Assignment 4: 3D Gaussian Splatting & Score Distillation Sampling

Question 1.1.5: Rendering Pre-trained 3D Gaussians

Task: Implement a 3D Gaussian rasterization pipeline and render pretrained 3D Gaussians.

360° rendering of pre-trained 3D Gaussians (chair scene)

Question 1.2: Training 3D Gaussian Representations

Task: Train 3D Gaussians on a toy truck scene using multi-view data.

Training Parameters

Parameter	Learning Rate
Means	0.00016
Scales	0.005
Quaternions (Rotations)	0.001
Colors	0.0025
Opacities	0.05

Number of Iterations: 2000

PSNR: 26.84

SSIM: 0.912

Training Progress (Top: Predicted, Bottom: Ground Truth)

Final 360° Rendering

Question 1.3.1: Spherical Harmonics Extension (10 points)

Task: Add support for rendering with spherical harmonics to capture view dependent effects.

I extended the rasterizer to support spherical harmonics (SH) for view dependent color modeling. The implementation includes computing colors from all SH components (not just DC) based on viewing direction.

Rendering with full spherical harmonics support

Comparison: DC only vs Full SH

View 1

DC Component Only (View Independent)

Full Spherical Harmonics (View Dependent)

It's very obvious to see the highlights and reflections on the cushion of the chair being different in the full spherical harmonics view.

View 2

DC Component Only

Full Spherical Harmonics

It's very apparent on the back cushion of the chair - the shadow is quite different. The very front of the cushion of the chair also shows different shadow characteristics. The reflectiveness of the gold portions is also quite different between the two versions.

View 3

DC Component Only

Full Spherical Harmonics

This view is much more similar between the two versions, but you can see on the back of the chair that the gold reflections are slightly brighter in our full spherical harmonic view.

Summary: Across all views, the full spherical harmonics version consistently shows improved visual quality with better handling of highlights, shadows, and view dependent reflections. The view dependent shading makes the scene appear more realistic and three dimensional.

Question 1.3.2: Training on Harder Scene (Extra Credit)

Task: Train on a more complex scene (materials dataset) with random initialization and improve performance.

Baseline Approach

Using the same setup as Q1.2 (isotropic Gaussians, same learning rates, L1 loss):

Baseline Training Progress

Baseline Final Rendering

Baseline PSNR: 18.52

Baseline SSIM: 0.723

Improved Approach

Techniques Used:

Anisotropic Gaussians: Switched from isotropic to anisotropic Gaussians to better model complex geometry
Gentler Learning Rate Decay: Used StepLR scheduler with gamma=0.8 (instead of 0.5) for more gradual learning rate reduction every 1000 iterations
Increased SSIM Loss Weight: Adjusted loss to 0.4 * SSIM + 0.6 * MSE for better perceptual quality while maintaining structural fidelity
Extended Training: Trained for 4000 iterations for better convergence

Improved Training Progress (4000 iterations)

Improved Final Rendering

Improved PSNR: 18.811 (+0.291)

Improved SSIM: 0.712 (-0.011)

Summary: The improved approach successfully increased PSNR by 0.291 while maintaining comparable SSIM (only 0.011 decrease). The gentler learning rate decay and increased SSIM loss weight helped the model converge to a better solution with improved visual quality and structural detail.

Question 2.1: SDS Loss + Image Optimization

Task: Implement Score Distillation Sampling (SDS) loss and optimize images from text prompts.

Implemented SDS loss following the DreamFusion paper, including both guided and unguided variants. The SDS loss uses a pretrained Stable Diffusion model to provide gradients that optimize latent representations toward matching text prompts.

Results for 4 Prompts (With and Without Guidance)

Prompt: "a hamburger"

Without Guidance
Iterations: 1000

With Guidance
Iterations: 1000

Prompt: "a standing corgi dog"

Without Guidance
Iterations: 1000

With Guidance
Iterations: 1000

Prompt: "a basketball"

Without Guidance
Iterations: 1000

With Guidance
Iterations: 1000

Prompt: "a tennis player serving the ball"

Without Guidance
Iterations: 1000

With Guidance
Iterations: 1000

Observations: Classifier free guidance significantly improves image quality and fidelity to the prompt. The guided versions show clearer details, better colors, and more recognizable objects. Without guidance, images tend to be more abstract and less detailed.

Question 2.2: Texture Map Optimization for Mesh

Task: Optimize the texture of a mesh (cow.obj) using SDS loss with different text prompts.

Using random camera sampling during training, the texture field (implemented as a ColorField MLP) is optimized to make the cow mesh match different text descriptions. Each view is rendered and optimized using SDS loss for 1000 iterations.

Prompt: "a golden bull"
Final textured mesh with golden metallic appearance

Prompt: "a zebra"
Final textured mesh with zebra stripes pattern

The SDS loss successfully optimizes mesh textures to match text descriptions, transforming the cow mesh into different animals and materials. The golden bull shows material transformation while the zebra demonstrates the ability to change both appearance and perceived animal type.

Question 2.3: NeRF Optimization

Task: Optimize a NeRF model using SDS loss to create 3D scenes from text prompts.

Hyperparameters

Parameter	Value
lambda_entropy	0.01
lambda_orient	0.01
latent_iter_ratio	0.2
Iterations	10000 (100 epochs)

The latent_iter_ratio controls when the model switches from normal shading (warmup phase) to random shading (lambertian/textureless). Entropy and orientation regularization help maintain better geometry and reduce artifacts.

Results for 3 Prompts

Prompt: "a standing corgi dog"

RGB Rendering

Depth Map

Prompt: "a basketball"

RGB Rendering

Depth Map

Prompt: "a tennis player serving the ball"

RGB Rendering

Depth Map

The NeRF model successfully learns 3D geometry and appearance from text prompts alone, creating recognizable objects with reasonable depth structure. A tennis player serving the ball didn't seem to work out well enough. I believe this is just because the scene is a little too complex, and it wasn't able to complete the NeRF optimization within 100 epochs and 10,000 iterations.

Question 2.4.1: View-Dependent Text Embedding (10 points)

Task: Use view dependent text embeddings to improve 3D consistency in NeRF optimization.

Implemented view dependent conditioning where different text embeddings are used based on camera azimuth:

Front view (-60° to 60°): "front view of [prompt]"
Side view (60° to 120°, -120° to -60°): "side view of [prompt]"
Back view (120° to 180°, -180° to -120°): "back view of [prompt]"

This helps the model generate more consistent 3D geometry by providing directional context.

Prompt: "a standing corgi dog" (with view dependent text)

RGB with View Dependent Text

Depth with View Dependent Text

Prompt: "a basketball" (with view dependent text)

RGB with View Dependent Text

Depth with View Dependent Text

Comparison to Q2.3 Baseline: The view dependent conditioning produces more 3D consistent results with fewer multi face artifacts. Objects maintain better structural coherence across different viewing angles.

Question 2.4.3: Pixel-Space SDS with LPIPS (Extra Credit)

Task: Implement SDS loss in pixel space using LPIPS perceptual loss instead of latent space L2 loss.

Implementation Details:

The pixel space SDS implementation follows this pipeline:

Encode RGB image to latents (required for U-Net compatibility)
Add noise to latents at timestep t
Predict noise using U-Net with text conditioning
Denoise using DDIM formula to get predicted clean latents: x₀ = (xₜ - σₜ * εₚᵣₑᵈ) / √αₜ
Decode predicted latents back to RGB space
Compute LPIPS perceptual loss between original and predicted RGB
Apply timestep weighting and backpropagate

This differs from standard latent space SDS by computing gradients in pixel space using a learned perceptual metric (LPIPS with VGG features) instead of simple L2 loss in latent space.

Training Configuration: Trained for 100 epochs (10,000 iterations total) with the same hyperparameters as Q2.3 (lambda_entropy: 0.01, lambda_orient: 0.01, latent_iter_ratio: 0.2).

Prompt: "a standing corgi dog" (with pixel space SDS)

RGB with Pixel Space SDS

Depth with Pixel Space SDS

Analysis:

Image Quality: The pixel space SDS results are noticeably worse than the latent space baseline (Q2.3). The generated geometry is less defined and the overall 3D structure is weaker.

Why the results are worse: While theoretically appealing, pixel space SDS with LPIPS has several disadvantages: (1) The additional VAE decode step introduces noise and errors into the gradient flow, (2) LPIPS perceptual loss may not provide gradients as well suited for 3D consistency as the simple L2 loss in latent space, (3) Pixel space gradients are noisier and less stable than latent space gradients, making optimization more difficult, (4) The diffusion model was trained in latent space, so computing loss in pixel space moves away from the model's natural operating domain.

Training Time: Slightly slower than latent space SDS due to the additional decode step and LPIPS computation.

Conclusion: Despite the intuitive appeal of using perceptual loss in pixel space, the standard latent space SDS approach produces superior results. The latent space representation appears to be better suited for providing clean, stable gradients for 3D generation tasks.