HW4¶

Q1 3D Gaussian Splatting¶

1.1 3D Gaussian Rasterization (35 points)¶

1.1.1 - 1.1.4¶

See code.

1.1.5 Perform Splatting¶

1.2 Training 3D Gaussian Representations (15 points)¶

1.2.2 Perform Forward Pass and Compute Loss¶

Fig 1. Training Progress¶

Fig 2. Final Rendering¶

Fig 3. Eval (Iterations, PSNR, SSIM)¶

Learning Rates¶
parameter learning rate
opacities 0.025
scales 0.005
colours 0.0025
means 0.00016

1.3 Extensions¶

1.3.1 Rendering Using Spherical Harmonics (10 points)¶

With SH Without SH
Frame # With SH Without SH Observations
Frame 0 Notice that the cushion looks a lot brighter without spherical harmonics then the one with SH. The cushion has a view-dependent reflectance effect that comes from the microstructure of the material.
Frame 12 Similar effects to the 0th frame can be observed here.
Frame 24 The back of the seat is mostly diffuse and unaffected the presence of Spherical Harmonics.

1.3.2 Training On a Harder Scene (10 points)¶

I initially experimented with MoGe to incorporate monocular depth estimation for better Gaussian initialization, as I was confident that improved initialization would yield superior results compared to random initialization. However, I found it difficult to achieve consistent depth alignment with the given camera extrinsics. To address this, I switched to VGGT, using a small subset of the training images to estimate camera extrinsics and generate consistent point clouds. I then applied Umeyama alignment to compute a Sim(3) transformation that aligned the predicted extrinsics with the provided ones, thereby aligning the reconstructed points as well. However, I noticed that the generated points were not entirely reliable [see image below], so I needed to devise a method to filter out the noisy or incorrect ones.

Notice how the better VGGT init converges faster to a good result qualitative and quantitively.

Isotropic Baseline¶

python train_harder_scene.py --out_path ./output/q1.3.2/baseline \
    --gaussians_per_splat -1 \
    --isotropic_gaussians \
    --lr_opacity 0.02 \
    --lr_scale 0.02 \
    --lr_colour 0.02 \
    --lr_mean 0.02 \
    --init_type random
Anisotropic Extension with VGGT init¶

python train_harder_scene.py --out_path ./output/q1.3.2/extension \
    --init_type vggt \
    --num_itrs  500 \
    --gaussians_per_splat -1 \
    --lr_opacity 0.1 \
    --lr_scale 0.1 \
    --lr_colour 0.001 \
    --lr_mean 0.0001 \
    --lr_quat 0.00005

Q2. Diffusion-guided Optimization¶

2.1 SDS Loss + Image Optimization (20 points)¶

Prompt Iteration With Guidance Without Guidance
a hamburger 2000
a standing corgi dog 2000
a F-16_fighter_jet 2000
a chimpanzee holding a banana 2000

2.2 Texture Map Optimization for Mesh (15 points)¶

Prompt Iteration Initial Mesh Final Mesh
a black and white cow 2000
a blue cow with red patches 2000
a brown cow with orange patches 2000

2.3 NeRF Optimization (15 points)¶

Prompt Iteration Depth RGB
a hamburger 10000
a standing corgi dog 10000
a green frog on top of a horizontal flat rock 10000

2.4 Extensions¶

2.4.1 View-dependent text embedding (10 points)¶

Prompt Iteration Depth RGB
a red sports car 10000
a standing corgi dog 10000
a green frog on top of a horizontal flat rock 10000

2.4.2¶

2.4.3 Variation of implementation of SDS loss (10 points)¶

I differentiably decode both current and target latents with VAE and map to to either [0,1] for L2/L1/Huber or [-1,1] for LPIPS, optionally subsample via average pooling before the loss to cut compute, then computing the selected loss between decoded current and target pixels. Huber loss is used in final version, with pixel_downsample == 4 because otherwise it would've been way too slow.

Gradient calculation-wise, it's still done by $\nabla=w(\hat\epsilon -\epsilon)$, just in addition we decode both the current and target latents through the VAE decoder into RGB images, and computes a pixel-level loss (L2, L1, Huber, or LPIPS) between them.

Prompt Iteration Depth RGB Observation
a green frog on top of a horizontal flat rock 8000 The latent-space version of the SDS loss produces much cleaner depth renderings, yielding more solid geometry and finer details in the RGB outputs.
a green frog on top of a horizontal flat rock 8000 The pixel-space version of the SDS loss is significantly slower due to its higher dimensionality - and even more so if LPIPS loss is used, as it requires an additional VGG network. To trade off speed, I only used a subsampled set of pixels for the loss computation, which leads to noisier depth maps and RGB renderings with more floating artifacts and less solid appearances.