Generative Refinement of Single-View to 3D Animatable Human

Virtual human avatars are essential in areas like virtual reality, gaming, and film production. In VR, users interact with digital environments through their 3D avatars, while in games, these avatars serve as the main interface for player interaction. The level of realism and detail in avatars significantly affects the sense of immersion. To advance avatar quality and reduce development costs, researchers have proposed data-driven methods for generating 3D human avatars from text , single or multiple images , and video .

This study targets the generation of animatable 3D human avatars from a single image. The single-image approach makes it easy to collect data, as anyone can simply take a picture of themselves. However, this method is challenging because the model only receives a front-facing view, with little inherent 3D information. Current solutions generally fall into three categories: lifting models, regressive models, and generative models.

Lifting models are among the most widely used methods. They often rely on multi-view diffusion or video models to synthesize additional views from the input image. These synthesized views are then processed by multi-view reconstruction or feed-forward models to produce a coherent 3D representation. For instance, MVTokenFlow leverages pre-trained diffusion models to transform a single image into a four-dimensional animatable avatar. Pippo , MagicMan , and other works employ similar strategies using human-specific multi-view diffusion models.

Regressive models, on the other hand, predict 3D representations from input images in a largely deterministic fashion. Notable examples include PIFu , which estimates 3D occupancy fields, Human-LRM , which predicts triplanes, and LHM , which outputs 3D Gaussian blobs. Because regressive models do not model distributions, they often struggle to accurately represent unseen regions.

Generative models are designed to learn the distribution of 3D representations and can produce random samples from this distribution. RODIN trains 3D diffusion models using triplanes derived from extensive synthetic datasets. Although RODIN can generate plausible completions for unseen regions, it is mostly limited to synthetic data and cannot produce highly realistic textures. This highlights that generative models are constrained by the diversity and quality of their training data, which is costly to expand for 3D human tasks. Recently, SIGMAN compiled a dataset of one million Gaussians representing 110,000 identities from public 3D and multi-view datasets. However, their model, which has not yet been released, demands significant computational resources—about 3500 GPU hours—to train.

This project seeks to answer a central question: Is it possible to combine the advantages of all three model types? Regressive models are efficient for both training and inference, providing a quick way to obtain coarse 3D shapes. Lifting models, whether as practical tools or pre-trained systems, can enhance these coarse models through post-processing. Generative models, while capable of producing high-quality 3D objects, are resource-intensive to scale and train. By using a regressive model for an initial coarse estimate, a generative model layered on top could be much smaller than one trained from scratch. Furthermore, lifting models can further improve the generative model’s results via post-processing or data augmentation.

Building on these ideas, we introduce a generative refinement pipeline for animatable 3D human avatar creation from a single image. As illustrated in 1, the process involves three main steps: First, a regressive model (LHM ) generates a coarse 3D Gaussian representation from a single image. Second, this coarse representation is refined using a pre-trained video diffusion model (Wan2.1 ) to yield a more accurate 3D model. Finally, we assembled a diverse synthetic dataset of 3D human avatars by generating reference images with FLUX and applying the first two steps to create their 3D models. This dataset can then be used to train a generative model, for which we have some initial results, though the training is still ongoing.

Related Work

Lifting-based 3D Avatar Generation

Lifting-based methods use auxiliary models to generate multiple views or video sequences from a single image, which are then used to reconstruct 3D representations. MVTokenFlow combines multi-view and video diffusion models to transform single images into 4D animatable avatars, maintaining viewpoint consistency through token flow techniques. Similarly, Pippo and MagicMan develop multi-view diffusion models tailored for humans to create several views before assembling 3D representations. PSHuman also follows this approach but focuses on preserving subject-specific details. Techniques like Cap4D and FaceLift apply similar strategies to portraits. While these approaches achieve strong results, they often depend on complex, multi-stage pipelines where errors can accumulate.

Regressive-based 3D Avatar Generation

Regressive methods predict 3D representations directly from single images using deterministic mappings. PIFu was an early example, learning implicit functions for 3D occupancy field prediction from single images. Subsequent works, such as Human-LRM , regress triplanes for the human body, and LHM predicts 3D Gaussian blobs from a single image. These models are efficient for both training and inference, but often have trouble generating detailed results for unseen regions since they do not model the distribution of plausible completions.

Generative 3D Avatar Models

Generative models are designed to learn the distribution of 3D human representations and can sample from this distribution. RODIN and its high-definition variant RODIN-HD use 3D diffusion models trained on triplanes from synthetic datasets. These models can generate plausible completions for unseen regions, but their performance is limited by the diversity and quality of training data, which is often synthetic and lacks realistic textures. SIGMAN sought to overcome this by collecting a dataset of one million 3D Gaussians representing 110,000 identities from public 3D and multi-view datasets. However, such generative models usually require significant computational resources, with SIGMAN training reportedly consuming around 3500 GPU hours.

Text-to-3D Human Generation

Recent developments in text-to-3D human generation have introduced valuable techniques. Methods like DreamHuman , HumanGaussian , AvatarVerse , and Tech Report demonstrate the ability to create detailed human avatars from text. These approaches typically use large language models and diffusion models to convert text into 3D representations, often by generating intermediate 2D images first.

Video-based Avatar Creation

Although this work focuses on single-image methods, there have also been advances in monocular video-based techniques. InstantAvatar and ExAvatar leverage temporal information from monocular videos to build consistent 3D human representations. These approaches benefit from additional temporal data but require more input than single-image methods.

Animatable 3D Human Representations

To enable animation, 3D avatars must be represented in a way that supports deformation according to pose parameters. Traditional approaches often use parametric models like SMPL , which provide a low-dimensional skeleton for animation but lack personalized detail. More recent work has explored neural radiance fields (NeRF) and 3D Gaussian Splatting , adapted for human animation , to better balance animation flexibility with the preservation of identity-specific features.

Method

This section presents the three main stages of our generative refinement pipeline. The initial stage applies a regressive 3D avatar generation technique to produce a coarse 3D avatar from a single input image, as detailed in [subsec:regressive]. Then, the coarse 3D avatar is refined using a pre-trained video diffusion model to yield a more accurate 3D model, which is detailed in [subsec:refinement]. Finally, the 3D model is used to generate a synthetic dataset of 3D human avatars, which is detailed in [subsec:dataset].

Regressive 3D Avatar Generation

The pipeline begins with a regressive model that generates a rough 3D avatar from a single image in a single forward pass. For this, I utilize LHM , a leading model that predicts 3D Gaussians anchored to the SMPL mesh. LHM is built upon two essential elements: a pre-trained 3D Gaussian Splatting (3DGS) model and the SMPL parametric body model.

SMPL.

In simple terms, SMPL is a parametric model that describes the human body as a mesh composed of vertices and faces. The locations of these vertices are set by PCA basis coefficients, which serve as the parameters for the SMPL model. Two primary sets of SMPL parameters are used: shape parameters β ∈ ℝ²⁰, which define the body shape, and pose parameters θ ∈ ℝ^55 × 3, which determine the positions of the joints. In LHM, these SMPL parameters come from Multi-HMR , an off-the-shelf tool that estimates SMPL parameters from monocular video. Removing the temporal constraints from Multi-HMR allows its application to single images.

3DGS.

3D Gaussian Splatting (3DGS) is a 3D representation and rendering method introduced by Kerbl et al. In this approach, a 3D object is modeled as a collection of 3D Gaussian blobs. Each blob is defined by a centroid p ∈ ℝ³, scale σ ∈ ℝ³, quaternion rotation r ∈ ℝ⁴, opacity ρ ∈ [0, 1], and appearance features f ∈ ℝ^C using spherical harmonics for view-dependent effects. For rendering, these Gaussians are projected onto 2D screen space as splats, which are composited in depth order to form the final image. In LHM, each Gaussian is linked to an anchor vertex on the (upsampled) SMPL mesh, enabling animation through the SMPL pose parameters. The position of each Gaussian is expressed as an offset from its anchor vertex. When the anchor vertex is animated using the Linear Blend Skinning (LBS) algorithm, the offset is rotated and added to the animated vertex position.

LHM uses a transformer architecture conditioned on the reference image to predict the parameters of 3D Gaussians anchored to the predefined SMPL mesh vertices. Training is primarily performed by minimizing the photometric loss between the rendered result and the ground truth image, which may be a different view or video frame. The LHM-1B model was trained on 64 A800 GPUs for 189 hours, representing a significant computational investment for most research groups.

Analysis of LHM predictions.

To enhance the LHM-generated 3D Gaussians (referred to as coarse LHM 3DGS), we first examine its typical shortcomings. Our qualitative analysis points to two main problems. First, as illustrated in 2, there is noticeable misalignment between the coarse LHM 3DGS and the input image, even when viewed from the same perspective. Second, as seen in 3, the side and back views of the LHM reconstruction exhibit considerable blurriness.

We attribute the misalignment to a combination of inaccurate SMPL parameter estimation and the limited representational power of the LHM model. The blurriness is likely a result of the regressive model’s tendency to average its predictions when uncertain. Because the back and side views are not well constrained by a single front image, the model outputs blurry, averaged results for these regions.

Refinement by Video Diffusion Model

Despite these limitations, the coarse LHM 3DGS serves as an effective starting point for our generative refinement pipeline. Both the misalignment and blurriness can be improved through post-processing, particularly when leveraging multi-view images created by video or multi-view diffusion models.

Large pre-trained video diffusion models such as Wan2.1 are highly effective at regularizing outputs. By employing methods like SDEdit , we can convert out-of-domain data—such as LHM reconstructions with artifacts—into in-domain examples, all while preserving the core content. The workflow, depicted in 1, proceeds as follows: First, the coarse LHM 3DGS is rendered into a turntable video using a sampled camera trajectory. This trajectory is based on the SV3D dynamic orbit design, with elevation angles generated from random sinusoidal functions and azimuth angles spaced uniformly. Next, noise is added to the video according to an editing strength parameter, and the video is then denoised using a standard flow-matching Euler solver. The optimal editing strength is selected via a linear sweep on a validation set. The denoised video serves as the ground truth for 3DGS fitting. To maintain a one-to-one mapping between the refined and coarse LHM 3DGS, we simplify the original fitting process by turning off densification and pruning. This is particularly useful for training the generative model.

Generating Synthetic Dataset

Obtaining real-world datasets for 3D human avatars is costly and frequently results in attribute imbalances. For instance, SHHQ predominantly features Western identities, while MVHumanNet and DNARendering mainly include Asian subjects. To address this, we employ advanced text-to-image diffusion models such as FLUX to generate realistic and diverse human images with customizable attributes.

The process for generating the synthetic dataset is illustrated in step 2 of 1. We first compiled a comprehensive list of human attributes with assistance from ChatGPT, as outlined in 1. Prompts are then sampled to ensure a balanced distribution of these attributes. For each attribute, common options are provided, and where appropriate, "none" or "any" options are included to allow either omission or random selection by the diffusion model. These sampled attributes are passed to a local Llama3.3-70B model to generate natural language prompts, which are then used as input for FLUX to create human images.

Complete list of the attributes used to generate the synthetic dataset.
Group	Attributes
Basic Identity	age, gender, ethnicity
Physical Features	hair type, hair color, hair style, facial details, body shape
Motion	pose, activity
Emotion	facial expression, emotion
Setting / Environment	urban, natural, indoor, outdoor
Others	occupation, lighting, art style
Clothing	upper body clothing, lower body clothing, shoes, hat
Accessories	glasses, jewelry, carried items

Generative Refinement Training

This ongoing work is summarized in step 3 of 1. Results are not yet available, as the remaining research is still in progress as part of a broader project.

Experiments

Evaluation of the Refinement Pipeline

Since the training of the generative model in step 3 was not completed in time, we present qualitative results from steps 1 and 2.

Data and Implementation.

To assess our refinement pipeline, we employed the pretrained LHM-1B model to create coarse 3D Gaussians from single images. The refinement utilized Wan2.1 as the video diffusion model, applying an editing strength of 0.8, which was selected through a linear sweep to best balance content preservation and detail enhancement. For sampling camera trajectories, we used the dynamic orbit method from SV3D , with elevation angles generated from sinusoidal functions with randomly chosen frequencies, amplitudes, and phases.

The test set consisted of custom-captured images of various individuals to evaluate the pipeline’s performance on out-of-domain data.

Qualitative Results.

4 highlights clear improvements in the visual quality of the refined 3DGS over the original coarse LHM 3DGS. Specifically, we note:

Overall, the refinement proves effective and can be broadly applied to coarse LHM 3DGS, serving as an add-on enhancement to the LHM model.

Synthetic Dataset Generation

ChatGPT was used to help compile a list of human image attributes, Llama3.3-70B generated natural language prompts, and the FLUX-dev model produced human images based on the sampled attributes. Figure 5 displays some generated examples. The variety and quality of these images illustrate the potential of our method for building large-scale datasets to train generative 3D avatar models.

Previous Failed Attempts

The initial plan, titled “Taming 3D Generative Model for 4D Generation,” aimed to utilize Hunyuan3D as the generative model for producing 3D human avatars with dynamic shapes. However, upon reviewing Hunyuan3D’s outputs, we observed that the predicted 3D models lacked proper pixel alignment. As depicted in 6, the normal maps reveal that the predicted 3D models do not align well with the input images. In the second and third rows, the head poses differ greatly from the input, whereas the first row shows somewhat better alignment. This indicates varying alignment quality, with misalignments that are not simply rigid deformations and thus are difficult to fix. Consequently, we chose to switch from Hunyuan3D to LHM as the backbone model for 3D prediction.

Conclusion

This work presented a generative refinement pipeline for producing high-quality, animatable 3D human avatars from a single image. The pipeline is built around three main components: (1) a regressive model (LHM) that quickly produces coarse 3D Gaussians, (2) a refinement stage using a video diffusion model to improve the 3D output, and (3) a synthetic data generation process for training specialized generative models.

Our experiments show that the refinement method greatly enhances the quality of 3D avatars, especially in regions that are not visible to regressive models. The improved 3D Gaussians feature clearer details, less blurring, and better correspondence with the input image. Furthermore, the synthetic data generation approach demonstrates potential for creating diverse training datasets beyond the constraints of real-world data.

Introduction