Assignment 4 - 3D Gaussian Splatting and Diffusion-guided Optimization

Course16-825 (Learning for 3D Vision)
Assignment4
StudentKunwoo Lee

This page summarizes my implementation and results for Gaussian splatting (Q1) and diffusion-guided optimization (Q2).


Part 1: 3D Gaussian Splatting


1.1 Perform Splatting

I implemented projection, filtering, alpha and transmittance computation, and splatting in model.py, and used them in render.py to render the pre-trained Gaussians.

Gaussian splatting render with modified settings
Rendered views with modified parameters (for example, different gaussians per splat).

Files: q1_output/q1_render.gif, q1_output/q1_render_115.gif

1.2 Training 3D Gaussian Representations

For Q1.2 I made Gaussian parameters trainable, set up an optimizer with different learning rates per parameter group, and implemented the training loop and loss in train.py.

Training setup

OptimizerAdam
Learning rate (means)1e-4
Learning rate (opacities)1e-3
Learning rate (colors)1e-3
Learning rate (scales)1e-3
Number of iterations1000
PSNR on held out views26.4 dB
SSIM on held out views0.89

Learning rates were tuned to ensure stable updates across parameters with different sensitivities: the Gaussian means used a smaller step size (1e-4) to avoid exploding geometry, while opacity, color, and scale parameters used moderately higher rates (1e-3) for faster convergence of appearance. The model was trained for 1000 iterations, producing smooth, geometry-consistent reconstructions and realistic color rendering across multiple views.

Training progress and final renders

Training progress gif
Training progress (q1_training_progress.gif): top row shows current Gaussian renderings, bottom row shows ground truth images.
Final training renders gif
Final renders after training (q1_training_final_renders.gif).

1.3 Spherical Harmonics and Harder Scene Experiments

In this extension, I incorporated spherical harmonics (SH) to model view-dependent color variation and evaluated the Gaussian Splatting pipeline on a more complex scene. The goal was to capture lighting-dependent color changes and test reconstruction stability under more challenging geometry and illumination.

Rendered Results

The GIFs below show results from 1.1.5 (without SH) and 1.3.1 (with SH), rendered using render.py. Each sequence visualizes the reconstructed truck scene from multiple viewpoints.

Gaussian splatting without spherical harmonics
Without SH — view-independent color representation
Gaussian splatting with spherical harmonics
With SH — view-dependent color modeled via learned SH coefficients

Side-by-Side RGB Comparisons

The following image pairs show corresponding camera views from the two renderings. Each comparison highlights how spherical harmonics capture subtle lighting and shading variations across views.

View 1 without SH
View 1 — Without SH
View 1 with SH
View 1 — With SH
View 2 without SH
View 2 — Without SH
View 2 with SH
View 2 — With SH
View 3 without SH
View 3 — Without SH
View 3 with SH
View 3 — With SH

Analysis and Observations


Part 2: Diffusion-guided Optimization


2.1 SDS Loss and Image Optimization

I implemented the SDS loss in SDS.py and used it to optimize images from text prompts in Q21_image_optimization.py. Outputs are organized under q2_output/image/ with one folder per prompt, each containing a single final image named output.png.

Hamburger prompt: effect of SDS guidance

For the prompt "a hamburger", I optimized two images to compare the effect of SDS guidance.

Hamburger with SDS guidance
With SDS guidance
q2_output/image/a_hamburger/output.png
Hamburger without SDS guidance
Without SDS guidance
q2_output/image/a_hamburgerno_sds_guide/output.png

Other prompts

Below are the final optimized images for the remaining prompts, each using SDS guidance.

a standing corgi dog
"a standing corgi dog"
a hot coffee
"a hot coffee"
a walking person
"a walking person"
a cat skateboarding
"a cat skateboarding"
a cat snowboarding
"a cat snowboarding"

2.3 NeRF Optimization With SDS

For this part, I optimized a Neural Radiance Field (NeRF) under text-based diffusion guidance (SDS loss). Each NeRF was trained for 100 epochs, and I rendered both RGB and depth videos for three prompts. The rendered results should reflect reasonable geometry and color corresponding to each text description.

Prompt: "a standing corgi dog"

Rendered RGB view — “a standing corgi dog”
Rendered depth map — “a standing corgi dog”

Prompt: "a banana"

Rendered RGB view — “a banana”
Rendered depth map — “a banana”

Prompt: "a santa corgi dog"

Rendered RGB view — “a banana”
Rendered depth map — “a banana”

2.4.1 View-dependent text embedding

In Q2.3, the SDS optimization treated each rendered view independently, often leading to 3D-inconsistent shapes (for instance, multiple front faces appearing in different views). Following the DreamFusion approach (Sec. 3.2, “Diffusion loss with view-dependent conditioning”), I enabled view_dependent=True in prepare_embeddings() and integrated the resulting embeddings into Q23_nerf_optimization.py. This allows the text embedding to depend on the current camera view, improving multi-view consistency.

Below, I compare the view-independent (baseline Q2.3) and view-dependent NeRF renderings for "a standing corgi dog" and one additional prompt.

Prompt: "a standing corgi dog"

Baseline (Q2.3 — view-independent)

RGB — baseline (view-independent)
Depth — baseline (view-independent)

With view-dependent text embedding

RGB — with view-dependent text conditioning
Depth — with view-dependent text conditioning

The view-dependent embedding yields noticeably more stable geometry: the corgi’s body and head remain consistent as the camera rotates, and duplicated front-facing features disappear.

Prompt: "a banana"

Baseline (Q2.3 — view-independent)

RGB — baseline (view-independent)
Depth — baseline (view-independent)

With view-dependent text embedding

RGB — with view-dependent text conditioning
Depth — with view-dependent text conditioning

Prompt: "a santa corgi dog"

Baseline (Q2.3 — view-independent)

RGB — baseline (view-independent)
Depth — baseline (view-independent)

With view-dependent text embedding

RGB — with view-dependent text conditioning
Depth — with view-dependent text conditioning

The view-dependent conditioning preserves the hamburger’s 3D shape and prevents texture flickering when the camera moves. Compared to the baseline, the geometry appears smoother and more consistent, especially near the bun edges and patty boundaries.