16-825 Learning for 3D Vision
Task: Implement a 3D Gaussian rasterization pipeline and render pretrained 3D Gaussians.
360° rendering of pre-trained 3D Gaussians (chair scene)
Task: Train 3D Gaussians on a toy truck scene using multi-view data.
| Parameter | Learning Rate |
|---|---|
| Means | 0.00016 |
| Scales | 0.005 |
| Quaternions (Rotations) | 0.001 |
| Colors | 0.0025 |
| Opacities | 0.05 |
Number of Iterations: 2000
PSNR: 26.84
SSIM: 0.912
Training Progress (Top: Predicted, Bottom: Ground Truth)
Final 360° Rendering
Task: Add support for rendering with spherical harmonics to capture view dependent effects.
I extended the rasterizer to support spherical harmonics (SH) for view dependent color modeling. The implementation includes computing colors from all SH components (not just DC) based on viewing direction.
Rendering with full spherical harmonics support
It's very obvious to see the highlights and reflections on the cushion of the chair being different in the full spherical harmonics view.
It's very apparent on the back cushion of the chair - the shadow is quite different. The very front of the cushion of the chair also shows different shadow characteristics. The reflectiveness of the gold portions is also quite different between the two versions.
This view is much more similar between the two versions, but you can see on the back of the chair that the gold reflections are slightly brighter in our full spherical harmonic view.
Summary: Across all views, the full spherical harmonics version consistently shows improved visual quality with better handling of highlights, shadows, and view dependent reflections. The view dependent shading makes the scene appear more realistic and three dimensional.
Task: Train on a more complex scene (materials dataset) with random initialization and improve performance.
Using the same setup as Q1.2 (isotropic Gaussians, same learning rates, L1 loss):
Baseline Training Progress
Baseline Final Rendering
Baseline PSNR: 18.52
Baseline SSIM: 0.723
Techniques Used:
Improved Training Progress (4000 iterations)
Improved Final Rendering
Improved PSNR: 18.811 (+0.291)
Improved SSIM: 0.712 (-0.011)
Summary: The improved approach successfully increased PSNR by 0.291 while maintaining comparable SSIM (only 0.011 decrease). The gentler learning rate decay and increased SSIM loss weight helped the model converge to a better solution with improved visual quality and structural detail.
Task: Implement Score Distillation Sampling (SDS) loss and optimize images from text prompts.
Implemented SDS loss following the DreamFusion paper, including both guided and unguided variants. The SDS loss uses a pretrained Stable Diffusion model to provide gradients that optimize latent representations toward matching text prompts.
Without Guidance
Iterations: 1000
With Guidance
Iterations: 1000
Without Guidance
Iterations: 1000
With Guidance
Iterations: 1000
Without Guidance
Iterations: 1000
With Guidance
Iterations: 1000
Without Guidance
Iterations: 1000
With Guidance
Iterations: 1000
Observations: Classifier free guidance significantly improves image quality and fidelity to the prompt. The guided versions show clearer details, better colors, and more recognizable objects. Without guidance, images tend to be more abstract and less detailed.
Task: Optimize the texture of a mesh (cow.obj) using SDS loss with different text prompts.
Using random camera sampling during training, the texture field (implemented as a ColorField MLP) is optimized to make the cow mesh match different text descriptions. Each view is rendered and optimized using SDS loss for 1000 iterations.
Prompt: "a golden bull"
Final textured mesh with golden metallic appearance
Prompt: "a zebra"
Final textured mesh with zebra stripes pattern
The SDS loss successfully optimizes mesh textures to match text descriptions, transforming the cow mesh into different animals and materials. The golden bull shows material transformation while the zebra demonstrates the ability to change both appearance and perceived animal type.
Task: Optimize a NeRF model using SDS loss to create 3D scenes from text prompts.
| Parameter | Value |
|---|---|
| lambda_entropy | 0.01 |
| lambda_orient | 0.01 |
| latent_iter_ratio | 0.2 |
| Iterations | 10000 (100 epochs) |
The latent_iter_ratio controls when the model switches from normal shading (warmup phase) to random shading (lambertian/textureless). Entropy and orientation regularization help maintain better geometry and reduce artifacts.
RGB Rendering
Depth Map
RGB Rendering
Depth Map
RGB Rendering
Depth Map
The NeRF model successfully learns 3D geometry and appearance from text prompts alone, creating recognizable objects with reasonable depth structure. A tennis player serving the ball didn't seem to work out well enough. I believe this is just because the scene is a little too complex, and it wasn't able to complete the NeRF optimization within 100 epochs and 10,000 iterations.
Task: Use view dependent text embeddings to improve 3D consistency in NeRF optimization.
Implemented view dependent conditioning where different text embeddings are used based on camera azimuth:
This helps the model generate more consistent 3D geometry by providing directional context.
RGB with View Dependent Text
Depth with View Dependent Text
RGB with View Dependent Text
Depth with View Dependent Text
Comparison to Q2.3 Baseline: The view dependent conditioning produces more 3D consistent results with fewer multi face artifacts. Objects maintain better structural coherence across different viewing angles.
Task: Implement SDS loss in pixel space using LPIPS perceptual loss instead of latent space L2 loss.
Implementation Details:
The pixel space SDS implementation follows this pipeline:
This differs from standard latent space SDS by computing gradients in pixel space using a learned perceptual metric (LPIPS with VGG features) instead of simple L2 loss in latent space.
Training Configuration: Trained for 100 epochs (10,000 iterations total) with the same hyperparameters as Q2.3 (lambda_entropy: 0.01, lambda_orient: 0.01, latent_iter_ratio: 0.2).
RGB with Pixel Space SDS
Depth with Pixel Space SDS
Analysis:
Image Quality: The pixel space SDS results are noticeably worse than the latent space baseline (Q2.3). The generated geometry is less defined and the overall 3D structure is weaker.
Why the results are worse: While theoretically appealing, pixel space SDS with LPIPS has several disadvantages: (1) The additional VAE decode step introduces noise and errors into the gradient flow, (2) LPIPS perceptual loss may not provide gradients as well suited for 3D consistency as the simple L2 loss in latent space, (3) Pixel space gradients are noisier and less stable than latent space gradients, making optimization more difficult, (4) The diffusion model was trained in latent space, so computing loss in pixel space moves away from the model's natural operating domain.
Training Time: Slightly slower than latent space SDS due to the additional decode step and LPIPS computation.
Conclusion: Despite the intuitive appeal of using perceptual loss in pixel space, the standard latent space SDS approach produces superior results. The latent space representation appears to be better suited for providing clean, stable gradients for 3D generation tasks.