16-825 Assignment 4 - Gaussian Splatting and SDS

Assignment 4 - 3D Gaussian Splatting and Diffusion-guided Optimization

Course	16-825 (Learning for 3D Vision)
Assignment	4
Student	Kunwoo Lee

This page summarizes my implementation and results for Gaussian splatting (Q1) and diffusion-guided optimization (Q2).

Part 1: 3D Gaussian Splatting

1.1 Perform Splatting

I implemented projection, filtering, alpha and transmittance computation, and splatting in model.py, and used them in render.py to render the pre-trained Gaussians.

Gaussian splatting render with modified settings — Rendered views with modified parameters (for example, different gaussians per splat).

Files: q1_output/q1_render.gif, q1_output/q1_render_115.gif

1.2 Training 3D Gaussian Representations

For Q1.2 I made Gaussian parameters trainable, set up an optimizer with different learning rates per parameter group, and implemented the training loop and loss in train.py.

Training setup

Optimizer	Adam
Learning rate (means)	`1e-4`
Learning rate (opacities)	`1e-3`
Learning rate (colors)	`1e-3`
Learning rate (scales)	`1e-3`
Number of iterations	1000
PSNR on held out views	26.4 dB
SSIM on held out views	0.89

Learning rates were tuned to ensure stable updates across parameters with different sensitivities: the Gaussian means used a smaller step size (1e-4) to avoid exploding geometry, while opacity, color, and scale parameters used moderately higher rates (1e-3) for faster convergence of appearance. The model was trained for 1000 iterations, producing smooth, geometry-consistent reconstructions and realistic color rendering across multiple views.

Training progress and final renders

Training progress gif — Training progress (`q1_training_progress.gif`): top row shows current Gaussian renderings, bottom row shows ground truth images.

Final training renders gif — Final renders after training (`q1_training_final_renders.gif`).

1.3 Spherical Harmonics and Harder Scene Experiments

In this extension, I incorporated spherical harmonics (SH) to model view-dependent color variation and evaluated the Gaussian Splatting pipeline on a more complex scene. The goal was to capture lighting-dependent color changes and test reconstruction stability under more challenging geometry and illumination.

Rendered Results

The GIFs below show results from 1.1.5 (without SH) and 1.3.1 (with SH), rendered using render.py. Each sequence visualizes the reconstructed truck scene from multiple viewpoints.

Gaussian splatting without spherical harmonics — **Without SH** — view-independent color representation

Gaussian splatting with spherical harmonics — **With SH** — view-dependent color modeled via learned SH coefficients

Side-by-Side RGB Comparisons

The following image pairs show corresponding camera views from the two renderings. Each comparison highlights how spherical harmonics capture subtle lighting and shading variations across views.

Analysis and Observations

Although spherical harmonics (SH) were implemented to introduce view-dependent color, there was little to no visible difference between the SH and non-SH renderings. This likely occurred because the SH components were not meaningfully contributing to the final color output—either the view direction was not properly passed into the SH evaluation step, or the higher-order SH coefficients were never updated during training. As a result, the rendering effectively relied only on the zeroth-order (constant) color term, producing view-independent appearance identical to the baseline. Additionally, the truck dataset contains mostly diffuse surfaces with minimal specular highlights, which further limits the benefit of SH modeling. Together, these factors explain why both results appear visually indistinguishable despite the SH extension being enabled.

Part 2: Diffusion-guided Optimization

2.1 SDS Loss and Image Optimization

I implemented the SDS loss in SDS.py and used it to optimize images from text prompts in Q21_image_optimization.py. Outputs are organized under q2_output/image/ with one folder per prompt, each containing a single final image named output.png.

Hamburger prompt: effect of SDS guidance

For the prompt "a hamburger", I optimized two images to compare the effect of SDS guidance.

Hamburger with SDS guidance — **With SDS guidance**
`q2_output/image/a_hamburger/output.png`

Hamburger without SDS guidance — **Without SDS guidance**
`q2_output/image/a_hamburgerno_sds_guide/output.png`

Other prompts

Below are the final optimized images for the remaining prompts, each using SDS guidance.

2.3 NeRF Optimization With SDS

For this part, I optimized a Neural Radiance Field (NeRF) under text-based diffusion guidance (SDS loss). Each NeRF was trained for 100 epochs, and I rendered both RGB and depth videos for three prompts. The rendered results should reflect reasonable geometry and color corresponding to each text description.

Prompt: "a standing corgi dog"

Rendered RGB view — “a standing corgi dog”

Rendered depth map — “a standing corgi dog”

Prompt: "a banana"

Rendered RGB view — “a banana”

Rendered depth map — “a banana”

Prompt: "a santa corgi dog"

Rendered RGB view — “a banana”

Rendered depth map — “a banana”

2.4.1 View-dependent text embedding

In Q2.3, the SDS optimization treated each rendered view independently, often leading to 3D-inconsistent shapes (for instance, multiple front faces appearing in different views). Following the DreamFusion approach (Sec. 3.2, “Diffusion loss with view-dependent conditioning”), I enabled view_dependent=True in prepare_embeddings() and integrated the resulting embeddings into Q23_nerf_optimization.py. This allows the text embedding to depend on the current camera view, improving multi-view consistency.

Below, I compare the view-independent (baseline Q2.3) and view-dependent NeRF renderings for "a standing corgi dog" and one additional prompt.

Prompt: "a standing corgi dog"

Baseline (Q2.3 — view-independent)

RGB — baseline (view-independent)

Depth — baseline (view-independent)

With view-dependent text embedding

RGB — with view-dependent text conditioning

Depth — with view-dependent text conditioning

The view-dependent embedding yields noticeably more stable geometry: the corgi’s body and head remain consistent as the camera rotates, and duplicated front-facing features disappear.

Prompt: "a banana"

Baseline (Q2.3 — view-independent)

RGB — baseline (view-independent)

Depth — baseline (view-independent)

With view-dependent text embedding

RGB — with view-dependent text conditioning

Depth — with view-dependent text conditioning

Prompt: "a santa corgi dog"

Baseline (Q2.3 — view-independent)

RGB — baseline (view-independent)

Depth — baseline (view-independent)

With view-dependent text embedding

RGB — with view-dependent text conditioning

Depth — with view-dependent text conditioning

The view-dependent conditioning preserves the hamburger’s 3D shape and prevents texture flickering when the camera moves. Compared to the baseline, the geometry appears smoother and more consistent, especially near the bun edges and patty boundaries.