Assignment 4: 3D Generation with Score Distillation Sampling

Part 1: 3D Gaussian Splatting
Part 2: Diffusion-guided Optimization

Part 1: 3D Gaussian Splatting

1.1.5 3D Gaussian Rasterization (35 points)

Implemented a 3D Gaussian rasterization pipeline in PyTorch, including projection of 3D Gaussians to 2D, evaluation of 2D Gaussians, filtering and sorting, alpha and transmittance computation, and splatting for color blending.

Implementation Overview

The rasterization pipeline consists of several key steps:

Project 3D Gaussians to 2D: Compute 2D mean and covariance from 3D Gaussian parameters using camera projection
Evaluate 2D Gaussians: Compute the power (exponent) of 2D Gaussian at each pixel location
Filter and Sort: Filter Gaussians behind the camera and sort by depth
Compute Alphas and Transmittance: Calculate alpha values and transmittance for proper alpha blending
Perform Splatting: Blend colors using alpha and transmittance values

Results

3D Gaussian rendering without spherical harmonics

Rendering of pre-trained 3D Gaussians (view-independent, DC component only)

1.2.2 Training 3D Gaussian Representations (15 points)

Implemented training code to optimize 3D Gaussians from multi-view images and a point cloud initialization. The training includes making parameters trainable, setting up optimizers with different learning rates, and implementing the forward pass with loss computation.

Training Details

Learning Rates Used

Means (positions): 0.0001 (smaller learning rate for stable geometry)
Opacities: 0.01 (larger learning rate for faster convergence)
Colors: 0.01 (larger learning rate for appearance)
Scales (for isotropic Gaussians): 0.005 (moderate learning rate)

Number of Iterations: 2000 iterations

Loss Function: L1 loss between predicted and ground truth images

Training Progress

Training Progress: Top row shows predicted renderings, bottom row shows ground truth

Final Results

Final Rendered Views After Training

Performance Metrics

PSNR: 29.334

SSIM: 0.939

1.3.1 Rendering Using Spherical Harmonics (10 points)

Extended the rasterizer to support view-dependent rendering using spherical harmonics. This enables rendering of pre-trained 3D Gaussians with full spherical harmonic coefficients, capturing view-dependent effects like reflections and specular highlights.

Implementation

The key addition is the colours_from_spherical_harmonics function, which evaluates spherical harmonic basis functions given viewing directions and combines them with the learned SH coefficients to produce view-dependent colors.

Comparison: With vs. Without Spherical Harmonics

Without Spherical Harmonics (Q1.1.5)

View-independent rendering (DC component only)

With Spherical Harmonics (Q1.3.1)

View-dependent rendering (full SH coefficients)

Observations

View Independence vs. View Dependence: Without SH, the rendering appears flat and lacks view-dependent effects. With SH, the rendering shows realistic view-dependent effects like reflections and specular highlights.
Visual Quality: The SH rendering captures more realistic material properties, especially for shiny or reflective surfaces.
Color Variation: With SH, colors change smoothly as the viewing angle changes, creating a more photorealistic appearance.

Part 2: Diffusion-guided Optimization

2.1 SDS Loss + Image Optimization (20 points)

Implemented the Score Distillation Sampling (SDS) loss with and without classifier-free guidance. The SDS loss uses a pre-trained Stable Diffusion model to guide image optimization toward matching a text prompt.

Implementation Details

# SDS Loss Implementation (from SDS.py)
def sds_loss(self, latents, text_embeddings, text_embeddings_uncond=None, 
             guidance_scale=100, grad_scale=1):
    # Sample random timestep
    t = torch.randint(self.min_step, self.max_step + 1, 
                      (latents.shape[0],), dtype=torch.long, device=self.device)
    
    # Add noise to latents
    with torch.no_grad():
        noise = torch.randn_like(latents)
        latents_noisy = self.scheduler.add_noise(latents, noise, t)
        
        # Predict noise
        noise_pred = self.unet(latents_noisy, t, 
                              encoder_hidden_states=text_embeddings).sample
        
        # Classifier-free guidance
        if text_embeddings_uncond is not None and guidance_scale != 1:
            noise_pred_uncond = self.unet(latents_noisy, t,
                                          encoder_hidden_states=text_embeddings_uncond).sample
            noise_pred = noise_pred_uncond + guidance_scale * (noise_pred - noise_pred_uncond)
    
    # Compute SDS gradient
    w = 1 - self.alphas[t]
    grad = w[:, None, None, None] * (noise_pred.detach() - noise) * grad_scale
    
    # Target trick: MSE between current and target latents
    loss = 0.5 * F.mse_loss(latents, (latents - grad).detach(), 
                           reduction="sum") / latents.shape[0]
    return loss
                

Results for Four Different Prompts

1. "a hamburger"

Without Guidance

Iterations: 2000

With Guidance

Iterations: 2000

2. "a standing corgi dog"

With Guidance - Iterations: 2000

3. "a cat lying on a rooftop"

With Guidance - Iterations: 2000

4. "a pigeon on an wire"

With Guidance - Iterations: 2000

Observations

Without Guidance (hamburger only): The image is abstract and no detailed. The optimization converges faster but produces lower quality results.
With Guidance: Images show significantly better fidelity and detail. Classifier-free guidance (guidance_scale=100) amplifies the conditional signal, resulting in more accurate prompt matching.
Training: All results were trained for 2000 iterations with learning rate 0.1 using AdamW optimizer.

2.2 Texture Map Optimization for Mesh (15 points)

Optimized the texture map of a fixed-geometry cow mesh using SDS loss. The texture is represented by a ColorField neural network that maps vertex coordinates to RGB colors.

Implementation Details

# Mesh Texture Optimization (from Q22_mesh_optimization.py)
# Initialize texture field
color_field = ColorField().to(device)

# Create mesh with learnable texture
mesh = pytorch3d.structures.Meshes(
    verts=vertices,
    faces=faces,
    textures=TexturesVertex(verts_features=color_field(vertices))
)

# Training loop: randomly sample camera and render
for iteration in range(total_iter):
    # Random camera pose
    dist = torch.rand(1).item() * (dist_max - dist_min) + dist_min
    elev = torch.rand(1).item() * (elev_max - elev_min) + elev_min
    azim = torch.rand(1).item() * 360
    R, T = look_at_view_transform(dist=dist, elev=elev, azim=azim)
    cameras = FoVPerspectiveCameras(device=device, R=R, T=T)
    
    # Render mesh
    rend = renderer(mesh, cameras=cameras, lights=lights)
    rend = rend[0, ..., :3]  # Extract RGB
    
    # Compute SDS loss
    latents = sds.encode_imgs(rend)
    loss = sds.sds_loss(latents, text_embeddings=embeddings['cond'],
                       text_embeddings_uncond=embeddings['uncond'],
                       guidance_scale=100, grad_scale=1)
                

Results for Two Different Text Prompts

1. "a cow with rainbow stripes"

360° rotation of textured mesh

2. "a dotted black and white cow"

360° rotation of textured mesh

Observations

Texture Quality: The SDS loss successfully applies diverse textures (stripes, dots) to the mesh geometry.
View Consistency: The texture appears consistent across different viewing angles, demonstrating successful optimization.
Training: Trained for 2000 iterations with random camera sampling for diverse viewpoints.

2.3 NeRF Optimization (15 points)

Optimized a NeRF model to generate 3D objects from text prompts using SDS loss. Both geometry and appearance are learnable, unlike the fixed-geometry mesh in Q2.2.

Implementation Details

# NeRF Optimization (from Q23_nerf_optimization.py)
# Render NeRF from random camera pose
pred_rgb, pred_depth = renderer.render(rays_o, rays_d, staged=False, 
                                      perturb=True, bg_color=bg_color)

# Encode to latent space
pred_rgb_512 = F.interpolate(pred_rgb.permute(2,0,1).unsqueeze(0), 
                            size=(512, 512), mode='bilinear', align_corners=False)
latents = sds.encode_imgs(pred_rgb_512)

# Compute SDS loss
loss = sds.sds_loss(latents, text_embeddings=embeddings['default'],
                   text_embeddings_uncond=embeddings['uncond'],
                   guidance_scale=100, grad_scale=1)

# Regularization losses
loss_entropy = entropy_loss(weights, pred_depth)
loss_orient = orientation_loss(normals, rays_d)
loss_total = loss + lambda_entropy * loss_entropy + lambda_orient * loss_orient
                

                    Hyperparameters Tuned
                    lambda_entropy: 1e-3 (encourages sparse density, prevents cloudy geometry)
lambda_orient: 1e-2 (encourages normals to face camera, improves geometry)
latent_iter_ratio: 0.2 (first 20% of training uses normal shading for geometry warmup)

                

Results for Three Example Prompts

1. "a standing corgi dog"

RGB Rendering

Depth Map

2. "a drumming gorilla"

RGB Rendering

Depth Map

3. "a rabbit eating a carrot"

RGB Rendering

Depth Map

Observations

Geometry Quality: The NeRF successfully learns overall 3D geometry matching the text prompts, with recognizable shapes and structures. However, the geometry is not perfectly consistent across different views (e.g., the corgi has multiple front faces and three ears).
Appearance: Colors and textures are reasonably matched to the prompts, though not photorealistic.
Depth Maps: Show clear geometric structure and depth variation, confirming successful 3D learning.
Training: Trained for 10000 iterations with progressive shading (normal → textureless/lambertian).

2.4.1 View-Dependent Text Embedding (10 points)

Extended Q2.3 with view-dependent text embeddings to improve 3D consistency. Different text embeddings are used based on the camera viewing angle (front, side, back, overhead, bottom).

Implementation Details

# View-Dependent Text Embedding (from Q23_nerf_optimization.py)
if args.view_dep_text:
    azimuth_val = azimuth.item()
    abs_polar = 90 + polar.item()  # absolute polar angle in degrees
    angle_overhead = 30  # degrees
    
    if abs_polar <= angle_overhead:
        # Overhead view
        text_cond = embeddings['overhead']
    elif abs_polar >= (180 - angle_overhead):
        # Bottom view
        text_cond = embeddings['bottom']
    else:
        # Normal view: interpolate front/side/back based on azimuth
        if azimuth_val >= -90 and azimuth_val < 90:
            if azimuth_val >= 0:
                r = 1 - azimuth_val / 90
            else:
                r = 1 + azimuth_val / 90
            text_cond = r * embeddings['front'] + (1 - r) * embeddings['side']
        else:
            if azimuth_val >= 0:
                r = 1 - (azimuth_val - 90) / 90
            else:
                r = 1 + (azimuth_val + 90) / 90
            text_cond = r * embeddings['side'] + (1 - r) * embeddings['back']
else:
    text_cond = embeddings['default']
                

Results for Two Example Prompts

1. "a standing corgi dog"

Q2.3 (Baseline)

RGB

Depth

Q2.4.1 (View-Dependent)

RGB

Depth

2. "a rabbit eating a carrot"

Q2.3 (Baseline)

RGB

Depth

Q2.4.1 (View-Dependent)

RGB

Depth

Qualitative Analysis: Effects of View-Dependent Text Conditioning

Comparison with Q2.3 (Baseline):

3D Consistency: View-dependent text embeddings help reduce multi-view inconsistencies. The baseline (Q2.3) sometimes shows artifacts like "two mouths" or "three ears" because each view is optimized independently. View-dependent conditioning provides view-specific guidance that encourages consistent geometry.
Geometry Quality: The view-dependent approach maintains similar or slightly improved geometric quality. The depth maps show consistent structure across views.
Appearance: RGB renderings show similar appearance quality, with view-dependent embeddings providing more context-aware optimization.
Limitations: While view-dependent embeddings help, some 3D inconsistencies may still persist due to the fundamental challenge of optimizing 3D structure from 2D diffusion guidance. The effectiveness depends on how well the diffusion model understands view-specific descriptions.

Key Insight: View-dependent text conditioning follows DreamFusion's hierarchical approach (Section 3.2.3), where different text embeddings are used for different viewing angles. This provides more targeted guidance during optimization, reducing the "Janus problem" (multiple front faces) and improving overall 3D consistency.

2.4.3 Variation of Implementation of SDS Loss (10 points)

Implemented SDS loss in pixel space instead of latent space, computing gradients directly in RGB space for potentially better perceptual alignment.

Implementation: Pixel-Space SDS Loss

# Pixel-Space SDS Loss (from SDS.py)
def sds_loss_pixel(self, latents, text_embeddings, 
                   text_embeddings_uncond=None, guidance_scale=100, grad_scale=1):
    # Sample timestep and add noise (same as latent-space)
    t = torch.randint(self.min_step, self.max_step + 1, 
                      (latents.shape[0],), dtype=torch.long, device=self.device)
    
    with torch.no_grad():
        noise = torch.randn_like(latents)
        latents_noisy = self.scheduler.add_noise(latents, noise, t)
        noise_pred = self.unet(latents_noisy, t, 
                              encoder_hidden_states=text_embeddings).sample
        
        # Classifier-free guidance
        if text_embeddings_uncond is not None and guidance_scale != 1:
            noise_pred_uncond = self.unet(latents_noisy, t,
                                          encoder_hidden_states=text_embeddings_uncond).sample
            noise_pred = noise_pred_uncond + guidance_scale * (noise_pred - noise_pred_uncond)
    
    # Compute SDS gradient in latent space
    alpha_t = self.alphas[t]
    w = 1 - alpha_t
    grad_latent = w[:, None, None, None] * (noise_pred.detach() - noise) * grad_scale
    
    # Decode current latents to pixel space
    latents_scaled = 1 / self.vae.config.scaling_factor * latents
    imgs_orig = self.vae.decode(latents_scaled.type(self.precision_t)).sample
    imgs_orig = (imgs_orig / 2 + 0.5).clamp(0, 1)
    
    # Decode target latents (latents - grad) to pixel space
    with torch.no_grad():
        latents_target = latents - grad_latent
        latents_target_scaled = 1 / self.vae.config.scaling_factor * latents_target
        imgs_target = self.vae.decode(latents_target_scaled.type(self.precision_t)).sample
        imgs_target = (imgs_target / 2 + 0.5).clamp(0, 1)
    
    # Compute loss in pixel space using target trick
    loss = 0.5 * F.mse_loss(imgs_orig, imgs_target.detach(), 
                           reduction='sum') / latents.shape[0]
    return loss
                

                    Key Implementation Details
                    Gradient Computation: SDS gradient is computed in latent space (same noise prediction and guidance as latent-space SDS)
Loss Computation: Both current and target latents are decoded to pixel space, and loss is computed in RGB space
Gradient Flow: Gradients flow back through the VAE decoder to the NeRF model, providing pixel-space supervision
Target Trick: Uses the same target trick formulation but in pixel space: loss = MSE(decode(latents), decode(latents - grad))

                

Results

Result from pixel-space SDS loss: "a standing poodle dog"

Why the Implementation Failed

Problem: The pixel-space SDS loss implementation failed to produce any 3D geometry. The rendered output shows only a colored background with no recognizable 3D shape.

Root Cause Analysis:

Gradient Flow Issue: The critical problem is in how gradients flow back through the VAE decoder. In the current implementation, the target image (imgs_target) is computed under torch.no_grad(), which breaks the gradient flow. While this is intentional for the target trick (to prevent gradients from flowing through the target), the issue is that the loss computation may not be properly connected to the NeRF parameters.
VAE Decoder Precision: The VAE decoder uses self.precision_t for type conversion, which may cause precision issues when gradients flow back. The decoder might not be fully differentiable in the way needed for this application.
Loss Scale: The pixel-space loss may have a different scale compared to latent-space loss, requiring different learning rates or loss scaling that wasn't tuned.
Numerical Instability: Decoding latents to pixels and then computing MSE may introduce numerical instabilities, especially when the latents are far from the VAE's training distribution.
Missing Regularization: The pixel-space implementation may need additional regularization or different hyperparameters compared to latent-space, but these weren't tuned.

What Should Have Been Done:

Ensure Gradient Flow: Make sure that imgs_orig has gradients enabled and flows back through the VAE decoder to the latents, and then to the NeRF model.
Tune Hyperparameters: The pixel-space loss likely requires different learning rates, loss scales, or regularization weights compared to latent-space.
Stability Checks: Add gradient clipping or normalization to prevent numerical instabilities from the VAE decoder.
Progressive Training: Start with latent-space loss to establish geometry, then fine-tune with pixel-space loss.

Comparison with Latent-Space:

The latent-space SDS loss works because it operates directly in the VAE's latent space, which is designed for the diffusion model. The gradients are more stable and the loss scale is well-calibrated. In contrast, pixel-space loss introduces an additional decoding step that may not preserve the gradient information needed for effective optimization.

Assignment 4: 3D Generation with Score Distillation Sampling

Table of Contents

Part 1: 3D Gaussian Splatting

1.1.5 3D Gaussian Rasterization (35 points)

Implementation Overview

Results

1.2.2 Training 3D Gaussian Representations (15 points)

Training Details

Learning Rates Used

Training Progress

Final Results

Performance Metrics

1.3.1 Rendering Using Spherical Harmonics (10 points)

Implementation

Comparison: With vs. Without Spherical Harmonics

Without Spherical Harmonics (Q1.1.5)

With Spherical Harmonics (Q1.3.1)

Observations

Part 2: Diffusion-guided Optimization

2.1 SDS Loss + Image Optimization (20 points)

Implementation Details

Results for Four Different Prompts

1. "a hamburger"

2. "a standing corgi dog"

3. "a cat lying on a rooftop"

4. "a pigeon on an wire"

Observations

2.2 Texture Map Optimization for Mesh (15 points)

Implementation Details

Results for Two Different Text Prompts

1. "a cow with rainbow stripes"

2. "a dotted black and white cow"

Observations

2.3 NeRF Optimization (15 points)

Implementation Details

Hyperparameters Tuned

Results for Three Example Prompts

1. "a standing corgi dog"

2. "a drumming gorilla"

3. "a rabbit eating a carrot"

Observations

2.4.1 View-Dependent Text Embedding (10 points)

Implementation Details

Results for Two Example Prompts

1. "a standing corgi dog"

Q2.3 (Baseline)

Q2.4.1 (View-Dependent)

2. "a rabbit eating a carrot"

Q2.3 (Baseline)

Q2.4.1 (View-Dependent)

Qualitative Analysis: Effects of View-Dependent Text Conditioning

2.4.3 Variation of Implementation of SDS Loss (10 points)

Implementation: Pixel-Space SDS Loss

Key Implementation Details

Results

Why the Implementation Failed