16-825 Learning for 3D Vision • Fall 2025
Name: Haejoon Lee (andrewid: haejoonl)

Assignment 4: 3D Generation with Score Distillation Sampling

Table of Contents

Part 1: 3D Gaussian Splatting

1.1.5 3D Gaussian Rasterization (35 points)

Implemented a 3D Gaussian rasterization pipeline in PyTorch, including projection of 3D Gaussians to 2D, evaluation of 2D Gaussians, filtering and sorting, alpha and transmittance computation, and splatting for color blending.

Implementation Overview

The rasterization pipeline consists of several key steps:

  1. Project 3D Gaussians to 2D: Compute 2D mean and covariance from 3D Gaussian parameters using camera projection
  2. Evaluate 2D Gaussians: Compute the power (exponent) of 2D Gaussian at each pixel location
  3. Filter and Sort: Filter Gaussians behind the camera and sort by depth
  4. Compute Alphas and Transmittance: Calculate alpha values and transmittance for proper alpha blending
  5. Perform Splatting: Blend colors using alpha and transmittance values

Results

3D Gaussian rendering without spherical harmonics

Rendering of pre-trained 3D Gaussians (view-independent, DC component only)

1.2.2 Training 3D Gaussian Representations (15 points)

Implemented training code to optimize 3D Gaussians from multi-view images and a point cloud initialization. The training includes making parameters trainable, setting up optimizers with different learning rates, and implementing the forward pass with loss computation.

Training Details

Learning Rates Used

  • Means (positions): 0.0001 (smaller learning rate for stable geometry)
  • Opacities: 0.01 (larger learning rate for faster convergence)
  • Colors: 0.01 (larger learning rate for appearance)
  • Scales (for isotropic Gaussians): 0.005 (moderate learning rate)

Number of Iterations: 2000 iterations

Loss Function: L1 loss between predicted and ground truth images

Training Progress

Training progress

Training Progress: Top row shows predicted renderings, bottom row shows ground truth

Final Results

Final renders

Final Rendered Views After Training

Performance Metrics

PSNR: 29.334

SSIM: 0.939

1.3.1 Rendering Using Spherical Harmonics (10 points)

Extended the rasterizer to support view-dependent rendering using spherical harmonics. This enables rendering of pre-trained 3D Gaussians with full spherical harmonic coefficients, capturing view-dependent effects like reflections and specular highlights.

Implementation

The key addition is the colours_from_spherical_harmonics function, which evaluates spherical harmonic basis functions given viewing directions and combines them with the learned SH coefficients to produce view-dependent colors.

Comparison: With vs. Without Spherical Harmonics

Without Spherical Harmonics (Q1.1.5)

Without SH

View-independent rendering (DC component only)

With Spherical Harmonics (Q1.3.1)

With SH

View-dependent rendering (full SH coefficients)

Observations

  • View Independence vs. View Dependence: Without SH, the rendering appears flat and lacks view-dependent effects. With SH, the rendering shows realistic view-dependent effects like reflections and specular highlights.
  • Visual Quality: The SH rendering captures more realistic material properties, especially for shiny or reflective surfaces.
  • Color Variation: With SH, colors change smoothly as the viewing angle changes, creating a more photorealistic appearance.

Part 2: Diffusion-guided Optimization

2.1 SDS Loss + Image Optimization (20 points)

Implemented the Score Distillation Sampling (SDS) loss with and without classifier-free guidance. The SDS loss uses a pre-trained Stable Diffusion model to guide image optimization toward matching a text prompt.

Implementation Details

# SDS Loss Implementation (from SDS.py) def sds_loss(self, latents, text_embeddings, text_embeddings_uncond=None, guidance_scale=100, grad_scale=1): # Sample random timestep t = torch.randint(self.min_step, self.max_step + 1, (latents.shape[0],), dtype=torch.long, device=self.device) # Add noise to latents with torch.no_grad(): noise = torch.randn_like(latents) latents_noisy = self.scheduler.add_noise(latents, noise, t) # Predict noise noise_pred = self.unet(latents_noisy, t, encoder_hidden_states=text_embeddings).sample # Classifier-free guidance if text_embeddings_uncond is not None and guidance_scale != 1: noise_pred_uncond = self.unet(latents_noisy, t, encoder_hidden_states=text_embeddings_uncond).sample noise_pred = noise_pred_uncond + guidance_scale * (noise_pred - noise_pred_uncond) # Compute SDS gradient w = 1 - self.alphas[t] grad = w[:, None, None, None] * (noise_pred.detach() - noise) * grad_scale # Target trick: MSE between current and target latents loss = 0.5 * F.mse_loss(latents, (latents - grad).detach(), reduction="sum") / latents.shape[0] return loss

Results for Four Different Prompts

1. "a hamburger"

Without Guidance
Hamburger without guidance
Iterations: 2000
With Guidance
Hamburger with guidance
Iterations: 2000

2. "a standing corgi dog"

Corgi with guidance

With Guidance - Iterations: 2000

3. "a cat lying on a rooftop"

Cat with guidance

With Guidance - Iterations: 2000

4. "a pigeon on an wire"

Pigeon with guidance

With Guidance - Iterations: 2000

Observations

  • Without Guidance (hamburger only): The image is abstract and no detailed. The optimization converges faster but produces lower quality results.
  • With Guidance: Images show significantly better fidelity and detail. Classifier-free guidance (guidance_scale=100) amplifies the conditional signal, resulting in more accurate prompt matching.
  • Training: All results were trained for 2000 iterations with learning rate 0.1 using AdamW optimizer.

2.2 Texture Map Optimization for Mesh (15 points)

Optimized the texture map of a fixed-geometry cow mesh using SDS loss. The texture is represented by a ColorField neural network that maps vertex coordinates to RGB colors.

Implementation Details

# Mesh Texture Optimization (from Q22_mesh_optimization.py) # Initialize texture field color_field = ColorField().to(device) # Create mesh with learnable texture mesh = pytorch3d.structures.Meshes( verts=vertices, faces=faces, textures=TexturesVertex(verts_features=color_field(vertices)) ) # Training loop: randomly sample camera and render for iteration in range(total_iter): # Random camera pose dist = torch.rand(1).item() * (dist_max - dist_min) + dist_min elev = torch.rand(1).item() * (elev_max - elev_min) + elev_min azim = torch.rand(1).item() * 360 R, T = look_at_view_transform(dist=dist, elev=elev, azim=azim) cameras = FoVPerspectiveCameras(device=device, R=R, T=T) # Render mesh rend = renderer(mesh, cameras=cameras, lights=lights) rend = rend[0, ..., :3] # Extract RGB # Compute SDS loss latents = sds.encode_imgs(rend) loss = sds.sds_loss(latents, text_embeddings=embeddings['cond'], text_embeddings_uncond=embeddings['uncond'], guidance_scale=100, grad_scale=1)

Results for Two Different Text Prompts

1. "a cow with rainbow stripes"

Rainbow striped cow

360° rotation of textured mesh

2. "a dotted black and white cow"

Dotted black and white cow

360° rotation of textured mesh

Observations

  • Texture Quality: The SDS loss successfully applies diverse textures (stripes, dots) to the mesh geometry.
  • View Consistency: The texture appears consistent across different viewing angles, demonstrating successful optimization.
  • Training: Trained for 2000 iterations with random camera sampling for diverse viewpoints.

2.3 NeRF Optimization (15 points)

Optimized a NeRF model to generate 3D objects from text prompts using SDS loss. Both geometry and appearance are learnable, unlike the fixed-geometry mesh in Q2.2.

Implementation Details

# NeRF Optimization (from Q23_nerf_optimization.py) # Render NeRF from random camera pose pred_rgb, pred_depth = renderer.render(rays_o, rays_d, staged=False, perturb=True, bg_color=bg_color) # Encode to latent space pred_rgb_512 = F.interpolate(pred_rgb.permute(2,0,1).unsqueeze(0), size=(512, 512), mode='bilinear', align_corners=False) latents = sds.encode_imgs(pred_rgb_512) # Compute SDS loss loss = sds.sds_loss(latents, text_embeddings=embeddings['default'], text_embeddings_uncond=embeddings['uncond'], guidance_scale=100, grad_scale=1) # Regularization losses loss_entropy = entropy_loss(weights, pred_depth) loss_orient = orientation_loss(normals, rays_d) loss_total = loss + lambda_entropy * loss_entropy + lambda_orient * loss_orient

Hyperparameters Tuned

  • lambda_entropy: 1e-3 (encourages sparse density, prevents cloudy geometry)
  • lambda_orient: 1e-2 (encourages normals to face camera, improves geometry)
  • latent_iter_ratio: 0.2 (first 20% of training uses normal shading for geometry warmup)

Results for Three Example Prompts

1. "a standing corgi dog"

Corgi RGB

RGB Rendering

Corgi Depth

Depth Map

2. "a drumming gorilla"

Gorilla RGB

RGB Rendering

Gorilla Depth

Depth Map

3. "a rabbit eating a carrot"

Rabbit RGB

RGB Rendering

Rabbit Depth

Depth Map

Observations

  • Geometry Quality: The NeRF successfully learns overall 3D geometry matching the text prompts, with recognizable shapes and structures. However, the geometry is not perfectly consistent across different views (e.g., the corgi has multiple front faces and three ears).
  • Appearance: Colors and textures are reasonably matched to the prompts, though not photorealistic.
  • Depth Maps: Show clear geometric structure and depth variation, confirming successful 3D learning.
  • Training: Trained for 10000 iterations with progressive shading (normal → textureless/lambertian).

2.4.1 View-Dependent Text Embedding (10 points)

Extended Q2.3 with view-dependent text embeddings to improve 3D consistency. Different text embeddings are used based on the camera viewing angle (front, side, back, overhead, bottom).

Implementation Details

# View-Dependent Text Embedding (from Q23_nerf_optimization.py) if args.view_dep_text: azimuth_val = azimuth.item() abs_polar = 90 + polar.item() # absolute polar angle in degrees angle_overhead = 30 # degrees if abs_polar <= angle_overhead: # Overhead view text_cond = embeddings['overhead'] elif abs_polar >= (180 - angle_overhead): # Bottom view text_cond = embeddings['bottom'] else: # Normal view: interpolate front/side/back based on azimuth if azimuth_val >= -90 and azimuth_val < 90: if azimuth_val >= 0: r = 1 - azimuth_val / 90 else: r = 1 + azimuth_val / 90 text_cond = r * embeddings['front'] + (1 - r) * embeddings['side'] else: if azimuth_val >= 0: r = 1 - (azimuth_val - 90) / 90 else: r = 1 + (azimuth_val + 90) / 90 text_cond = r * embeddings['side'] + (1 - r) * embeddings['back'] else: text_cond = embeddings['default']

Results for Two Example Prompts

1. "a standing corgi dog"

Q2.3 (Baseline)
Corgi RGB baseline

RGB

Corgi depth baseline

Depth

Q2.4.1 (View-Dependent)
Corgi RGB view-dep

RGB

Corgi depth view-dep

Depth

2. "a rabbit eating a carrot"

Q2.3 (Baseline)
Rabbit RGB baseline

RGB

Rabbit depth baseline

Depth

Q2.4.1 (View-Dependent)
Rabbit RGB view-dep

RGB

Rabbit depth view-dep

Depth

Qualitative Analysis: Effects of View-Dependent Text Conditioning

Comparison with Q2.3 (Baseline):

  • 3D Consistency: View-dependent text embeddings help reduce multi-view inconsistencies. The baseline (Q2.3) sometimes shows artifacts like "two mouths" or "three ears" because each view is optimized independently. View-dependent conditioning provides view-specific guidance that encourages consistent geometry.
  • Geometry Quality: The view-dependent approach maintains similar or slightly improved geometric quality. The depth maps show consistent structure across views.
  • Appearance: RGB renderings show similar appearance quality, with view-dependent embeddings providing more context-aware optimization.
  • Limitations: While view-dependent embeddings help, some 3D inconsistencies may still persist due to the fundamental challenge of optimizing 3D structure from 2D diffusion guidance. The effectiveness depends on how well the diffusion model understands view-specific descriptions.

Key Insight: View-dependent text conditioning follows DreamFusion's hierarchical approach (Section 3.2.3), where different text embeddings are used for different viewing angles. This provides more targeted guidance during optimization, reducing the "Janus problem" (multiple front faces) and improving overall 3D consistency.

2.4.3 Variation of Implementation of SDS Loss (10 points)

Implemented SDS loss in pixel space instead of latent space, computing gradients directly in RGB space for potentially better perceptual alignment.

Implementation: Pixel-Space SDS Loss

# Pixel-Space SDS Loss (from SDS.py) def sds_loss_pixel(self, latents, text_embeddings, text_embeddings_uncond=None, guidance_scale=100, grad_scale=1): # Sample timestep and add noise (same as latent-space) t = torch.randint(self.min_step, self.max_step + 1, (latents.shape[0],), dtype=torch.long, device=self.device) with torch.no_grad(): noise = torch.randn_like(latents) latents_noisy = self.scheduler.add_noise(latents, noise, t) noise_pred = self.unet(latents_noisy, t, encoder_hidden_states=text_embeddings).sample # Classifier-free guidance if text_embeddings_uncond is not None and guidance_scale != 1: noise_pred_uncond = self.unet(latents_noisy, t, encoder_hidden_states=text_embeddings_uncond).sample noise_pred = noise_pred_uncond + guidance_scale * (noise_pred - noise_pred_uncond) # Compute SDS gradient in latent space alpha_t = self.alphas[t] w = 1 - alpha_t grad_latent = w[:, None, None, None] * (noise_pred.detach() - noise) * grad_scale # Decode current latents to pixel space latents_scaled = 1 / self.vae.config.scaling_factor * latents imgs_orig = self.vae.decode(latents_scaled.type(self.precision_t)).sample imgs_orig = (imgs_orig / 2 + 0.5).clamp(0, 1) # Decode target latents (latents - grad) to pixel space with torch.no_grad(): latents_target = latents - grad_latent latents_target_scaled = 1 / self.vae.config.scaling_factor * latents_target imgs_target = self.vae.decode(latents_target_scaled.type(self.precision_t)).sample imgs_target = (imgs_target / 2 + 0.5).clamp(0, 1) # Compute loss in pixel space using target trick loss = 0.5 * F.mse_loss(imgs_orig, imgs_target.detach(), reduction='sum') / latents.shape[0] return loss

Key Implementation Details

  • Gradient Computation: SDS gradient is computed in latent space (same noise prediction and guidance as latent-space SDS)
  • Loss Computation: Both current and target latents are decoded to pixel space, and loss is computed in RGB space
  • Gradient Flow: Gradients flow back through the VAE decoder to the NeRF model, providing pixel-space supervision
  • Target Trick: Uses the same target trick formulation but in pixel space: loss = MSE(decode(latents), decode(latents - grad))

Results

Pixel-space SDS result

Result from pixel-space SDS loss: "a standing poodle dog"

Why the Implementation Failed

Problem: The pixel-space SDS loss implementation failed to produce any 3D geometry. The rendered output shows only a colored background with no recognizable 3D shape.

Root Cause Analysis:

  • Gradient Flow Issue: The critical problem is in how gradients flow back through the VAE decoder. In the current implementation, the target image (imgs_target) is computed under torch.no_grad(), which breaks the gradient flow. While this is intentional for the target trick (to prevent gradients from flowing through the target), the issue is that the loss computation may not be properly connected to the NeRF parameters.
  • VAE Decoder Precision: The VAE decoder uses self.precision_t for type conversion, which may cause precision issues when gradients flow back. The decoder might not be fully differentiable in the way needed for this application.
  • Loss Scale: The pixel-space loss may have a different scale compared to latent-space loss, requiring different learning rates or loss scaling that wasn't tuned.
  • Numerical Instability: Decoding latents to pixels and then computing MSE may introduce numerical instabilities, especially when the latents are far from the VAE's training distribution.
  • Missing Regularization: The pixel-space implementation may need additional regularization or different hyperparameters compared to latent-space, but these weren't tuned.

What Should Have Been Done:

  • Ensure Gradient Flow: Make sure that imgs_orig has gradients enabled and flows back through the VAE decoder to the latents, and then to the NeRF model.
  • Tune Hyperparameters: The pixel-space loss likely requires different learning rates, loss scales, or regularization weights compared to latent-space.
  • Stability Checks: Add gradient clipping or normalization to prevent numerical instabilities from the VAE decoder.
  • Progressive Training: Start with latent-space loss to establish geometry, then fine-tune with pixel-space loss.

Comparison with Latent-Space:

The latent-space SDS loss works because it operates directly in the VAE's latent space, which is designed for the diffusion model. The gradients are more stable and the loss scale is well-calibrated. In contrast, pixel-space loss introduces an additional decoding step that may not preserve the gradient information needed for effective optimization.