16-726 Learning-Based Image Synthesis, 2021 Spring

Multi-Pose to Body Translation

Maneesh Bilalpur(mbilalpu), Rawal Khirodkar(rkhirodk), Teddy Zhang(wentaiz)


Overview

Most works in pose-to-body translation space deal with single person activity. For example, Wang et al. 2018, Chan et al. 2019, and Balakrishnan et al. 2018 have demonstrated that it is possible to convert the single person pose maps to dance videos.

Related works for Single Pose to Body translation.

While these works have shown stunningly good synthesis outputs, the lack of any experiments on how good are GANs in dealing with multi-person pose synthesis is concerning. Unlike single person interactions, multi-person interactions have interesting occlusions that are caused not only because of camera settings but also because of the inter-person(s) interactions.

In this project we attempt a two-stage GAN approach to solve for the interaction problem in pose-to-body synthesis for multi-person setting. Our sample multi-pose to body output is presented below:

Results of translating the input pose maps to output RGB images at epoch-10.

Challenges

In addition to challenges in collecting dance videos with no camera motion and static background, we have seen difficulties in

Manual Annotations

We collected 10 couple videos from YouTube each 3 minutes long. We thus extracted about 3000 frames per video and used 2000 frames in training and 1000 frames in testing. We found that under human-human interaction pose estimators like OpenPose and HRNet failed. Here are some failure cases,




Failure cases of OpenPose under human-human interaction.
To address the pose estimation failure using openpose, we manually annotate 18 keypoints using the COCO annotator. We also correct keypoint assignments to person, in case of wrong assignments.

Manual annotations overlaid with Openpose predictions. Notice that openpose connects the girl's left arm to the boy's left arm.


Architecture

We propose a two stage model to solve for modeling human interactions in multi-person pose maps. The stages include

End-to-end components in the proposed model

Implementation details

We discard the temporal dynamics of the frames in the videos. This is due to the large computation resources needed to even run the baseline experiments from methods like Wang et al. 2018. Hence explore only frame-level synthesis.

We use the pix2pixHD model from NVIDIA's opensource code towards our implementation of local generators. The code for our generator and other preprocessing is released here.

We first train the local generators with occlusion-free frames and incorporate the global generator. When training the global generator, we use the checkpoint from the best local generators to train with all the frames(including the frames with occlusion) from given video.

We used 4 NVIDIA's RTX2080ti GPUs to train the models.


Results

Baseline

We use the pix2pixHD implementation as our baseline and present the results on one of the two videos with most occlusion below. Throughout the rest of the work all our critical analysis is based on this video. Our baselines simply takes the entire pose image and maps it to the rgb image in one step.


Input Multi-Pose

Baseline (Single Stage Translation)

Local Generator Examples

We present some results on pose-to-body translation in extreme poses under no occlusion

Input
Synthesis Images
Groundtruth


Proposed Model output

In addition, we present the below output from the proposed model. While both look largely similary we notice major differences are in terms of background synthesis and interaction when the dancers swap their positions. Preceptually, we found that the baseline has more inconsistent artifacts over the proposed method. For example, the edge artifacts in the sky have a similar color in the proposed model output over the baseline.

Please refresh to time sync the gifs, make sure everything is on the window.

Input Multi-Pose

Baseline (Single Stage Translation)

Local Generator Output for Male Dancer

Local Generator Output for Female Dancer

Our Proposed Two Stage approach. Note, we handle interaction better than the baseline and suffer from no artifacts in the background generation.

Training Progress

We show qualitatively how the training of the Global generator progresser through various epochs. We train for a total of 50 epochs.

Contributions

Critical analyses and Future Work


Acknowledgements

All the videos used for this project are obtained from YouTube.

Recommend viewing the website on 21-inch display or higher for accessibility and better viewing experience.