Final Project (liac): Language-Driven Human Pose Animation

In this project, we explore the use of pose as an intermediary for generating human pose animations driven by text input. Specifically, we investigate the effectiveness of combining OpenPose pose detection and Text2Video-Zero to generate more accurate and realistic character motion from text. We also show some experiments using pose generators (the Human Motion Diffusion Model and Motion Diffuse), as well preliminary experiments using 3D models to get greater consistency. While our results suggest that there is still much room for improvement, we observe that this approach can offer improvements over the non-pose approach in realistic (and substantive) motion of the character, while holding the bar on temporal consistency and character identity.

Existing Work

Recent visual generation systems aim to generate images and videos based on textual descriptions, with some notable examples of such systems being Text2Live, Runway's Gen1, ControlNet, and Instruct-pix2pix.

Runway’s Gen1, with an input video provided

Instruct pix2pix, performing image edits

However, these systems have some drawbacks that limit their effectiveness in generating high-quality videos and animations. For example, most of these systems rely on video-to-video translation, which means that they require a driving video to specify the action or motion in the output video. This can be limiting in cases where a driving video is not available or laborious to collect, or where the desired output does not match any existing real videos which may be the case for some expressive animations.

Furthermore, these systems typically operate on edits at the image level, which means that they are unable to generate novel motion or action in the output video. This limitation makes it challenging to generate complex scenes or actions that require a sequence of motions to capture, such as a person running and shooting a hoop in basketball.

Another limitation of existing AI visual generation systems is that the outputs are often temporally inconsistent, with the main character or subject not having the same visual identity across scenes. Furthermore, there is often a significant flicker frame-to-frame. These temporal inconsistencies can lead to a jarring viewing experience for the audience, and severely restricts the practical usability of these systems for animation and production-level video.

With this project, the idea was to use pose guidance to provide motion of the character, in hopes of getting more motion as well as temporal consistency.

No pose, vs. with pose guidance

Here is a comparison of Text2Video-Zero with no pose guidance, versus with pose guidance. I ran it on the prompt: “A {person, panda, astronaut} waving hello.”

With no pose guidance:

With pose guidance:

As you can see, with pose guidance, the final results has more temporal consistency, and more actual motion of the main character. As such, adding pose is a promising direction to go in.

Openpose + Text2Video-Zero w pose guidance

I took clips from the AMASS Dataset of 3D Human Motions and ran OpenPose pose detection on them. The input pose guidance is shown in the left column. The output video results are shown for the following prompt structure:

a {person, panda, astronaut} + action + {on Mars, in the forest, in the street}

where action is walking, dancing, jumping, etc.

I ran it on cases where the pose of main character is not centered (i.e; the character walks across the scene) and it performs very poorly and doesn’t pick up on the main character at all.

From the CMU Graphics Lab Motion Capture Database

The Text2Video-Zero with pose guidance result

Thus, I focused on input poses where the main character is centered and show the results below.

Jumping

It does pretty well on jumping.

Dancing

Also does well on this static dance pose.

Stumbling away

It does surprisingly well for this case, given that it didn’t do so well for walking (next one).

Walking: failure case

When it comes to walking, it mixes up the legs. In walking, the astronaut’s head is facing the wrong way at times. It seems to get confused in cases where different limbs cross each other, like when the legs overlap.

Spinning: failure case

In this case, the model completely lost the main figure, and didn’t follow the pose at all. This movement might be too complicated— perhaps due to how fast the character is spinning, and/or the fact in spinning the limbs are constantly crossing over each other.

Other Experiments

I also tried using 2 different generators, the Human Motion Diffusion Model (MDM) and Motion Diffuse. However, since the Text2Video-Zero model was trained on OpenPose format pose videos rather than these pose schemas, the results were suboptimal. It is in our future work to convert these formats to the OpenPose format to produce better results with pose guidance.

Human Motion Diffusion Model

On the left, is the generated pose from MDM for the prompt “a person got down and is crawling across the floor”. Then I removed the background to make it more suitable for Text2Video-Zero to consume (right).

Below are the result of Text2Video-Zero with pose guidance on the prompt “a {person, cat} crawling {on the floor, across the grass}. As you can see, it doesn’t do so well— it’s probably because the pose input is not exactly the same as the OpenPose format videos it was trained on.

Motion Diffuse

Here is the same process but for the Motion Diffuse pose generator rather than the MDM. On this jumping task, Motion Diffuse did pretty well on this “astronaut jumping in space” prompt (better than the previous MDM crawling example). However, I still think I need to feed Text2Video-Zero the appropriate OpenPose input to get the best results.

Going to 3D

I also exported the results of the MDM pose generator into an SMPL format, and imported 120 obj files (1 for each frame) into Blender to capture these animations. I am pleased with the consistency that you obviously get when you use a 3D model, and would like to find a middle ground between the artistic/creative stylization you get with Text2Video-Zero in the 2D world, and the physical consistency you get with a 3D model. Overall, going to 3D seems like a promising direction.

Conclusion

In this project, we investigated the effectiveness of using pose as an intermediary for generating human pose animations driven by text input. We combined OpenPose pose detection with Text2Video-Zero and observed significant improvements in the quality of character motion for tasks such as jumping and holding a static dance pose. However, we also found that the approach still struggled with tasks that involved limbs crossing the body, like walking or spinning.

Looking ahead, there are still several avenues for future work. We plan to explore ways to convert pose schemas from different pose generators, such as the MDM and Motion Diffuse, into a format that Text2Video-Zero can understand and perform reasonably well on. Additionally, moving towards the 3D model world seems like a promising direction for improving the realism and accuracy of character motion in our pipeline.

Overall, this project demonstrates the potential of using pose as an intermediary for language-driven human pose animation and provides a starting point for future research in this area.