Assignment 4 - 3D Gaussian Splatting and Diffusion-guided Optimization
| Course | 16-825 (Learning for 3D Vision) |
|---|---|
| Assignment | 4 |
| Student | Kunwoo Lee |
This page summarizes my implementation and results for Gaussian splatting (Q1) and diffusion-guided optimization (Q2).
Part 1: 3D Gaussian Splatting
1.1 Perform Splatting
I implemented projection, filtering, alpha and transmittance computation, and splatting in
model.py, and used them in render.py to render the pre-trained Gaussians.
Files: q1_output/q1_render.gif, q1_output/q1_render_115.gif
1.2 Training 3D Gaussian Representations
For Q1.2 I made Gaussian parameters trainable, set up an optimizer with different learning rates per parameter
group, and implemented the training loop and loss in train.py.
Training setup
| Optimizer | Adam |
|---|---|
| Learning rate (means) | 1e-4 |
| Learning rate (opacities) | 1e-3 |
| Learning rate (colors) | 1e-3 |
| Learning rate (scales) | 1e-3 |
| Number of iterations | 1000 |
| PSNR on held out views | 26.4 dB |
| SSIM on held out views | 0.89 |
Learning rates were tuned to ensure stable updates across parameters with different sensitivities:
the Gaussian means used a smaller step size (1e-4) to avoid exploding geometry,
while opacity, color, and scale parameters used moderately higher rates (1e-3)
for faster convergence of appearance. The model was trained for 1000 iterations, producing smooth,
geometry-consistent reconstructions and realistic color rendering across multiple views.
Training progress and final renders
q1_training_progress.gif): top row shows current Gaussian renderings,
bottom row shows ground truth images.
q1_training_final_renders.gif).
1.3 Spherical Harmonics and Harder Scene Experiments
In this extension, I incorporated spherical harmonics (SH) to model view-dependent color variation and evaluated the Gaussian Splatting pipeline on a more complex scene. The goal was to capture lighting-dependent color changes and test reconstruction stability under more challenging geometry and illumination.
Rendered Results
The GIFs below show results from 1.1.5 (without SH) and 1.3.1 (with SH),
rendered using render.py. Each sequence visualizes the reconstructed truck scene from multiple
viewpoints.
Side-by-Side RGB Comparisons
The following image pairs show corresponding camera views from the two renderings. Each comparison highlights how spherical harmonics capture subtle lighting and shading variations across views.
Analysis and Observations
-
Although spherical harmonics (SH) were implemented to introduce view-dependent color, there was little to no visible difference between the SH and non-SH renderings. This likely occurred because the SH components were not meaningfully contributing to the final color output—either the view direction was not properly passed into the SH evaluation step, or the higher-order SH coefficients were never updated during training. As a result, the rendering effectively relied only on the zeroth-order (constant) color term, producing view-independent appearance identical to the baseline. Additionally, the truck dataset contains mostly diffuse surfaces with minimal specular highlights, which further limits the benefit of SH modeling. Together, these factors explain why both results appear visually indistinguishable despite the SH extension being enabled.
Part 2: Diffusion-guided Optimization
2.1 SDS Loss and Image Optimization
I implemented the SDS loss in SDS.py and used it to optimize images from text prompts in
Q21_image_optimization.py. Outputs are organized under q2_output/image/ with one
folder per prompt, each containing a single final image named output.png.
Hamburger prompt: effect of SDS guidance
For the prompt "a hamburger", I optimized two images to compare the effect of SDS guidance.
q2_output/image/a_hamburger/output.png
q2_output/image/a_hamburgerno_sds_guide/output.png
Other prompts
Below are the final optimized images for the remaining prompts, each using SDS guidance.
2.3 NeRF Optimization With SDS
For this part, I optimized a Neural Radiance Field (NeRF) under text-based diffusion guidance (SDS loss). Each NeRF was trained for 100 epochs, and I rendered both RGB and depth videos for three prompts. The rendered results should reflect reasonable geometry and color corresponding to each text description.
Prompt: "a standing corgi dog"
Prompt: "a banana"
Prompt: "a santa corgi dog"
2.4.1 View-dependent text embedding
In Q2.3, the SDS optimization treated each rendered view independently, often leading to 3D-inconsistent
shapes (for instance, multiple front faces appearing in different views). Following the DreamFusion
approach (Sec. 3.2, “Diffusion loss with view-dependent conditioning”), I enabled
view_dependent=True in prepare_embeddings() and integrated the resulting
embeddings into Q23_nerf_optimization.py.
This allows the text embedding to depend on the current camera view, improving multi-view consistency.
Below, I compare the view-independent (baseline Q2.3) and view-dependent NeRF renderings for "a standing corgi dog" and one additional prompt.
Prompt: "a standing corgi dog"
Baseline (Q2.3 — view-independent)
With view-dependent text embedding
The view-dependent embedding yields noticeably more stable geometry: the corgi’s body and head remain consistent as the camera rotates, and duplicated front-facing features disappear.
Prompt: "a banana"
Baseline (Q2.3 — view-independent)
With view-dependent text embedding
Prompt: "a santa corgi dog"
Baseline (Q2.3 — view-independent)
With view-dependent text embedding
The view-dependent conditioning preserves the hamburger’s 3D shape and prevents texture flickering when the camera moves. Compared to the baseline, the geometry appears smoother and more consistent, especially near the bun edges and patty boundaries.