HW4¶
Q1 3D Gaussian Splatting¶
1.1 3D Gaussian Rasterization (35 points)¶
1.1.1 - 1.1.4¶
See code.
1.1.5 Perform Splatting¶

1.2 Training 3D Gaussian Representations (15 points)¶
1.3 Extensions¶
1.3.1 Rendering Using Spherical Harmonics (10 points)¶
| With SH | Without SH |
|---|---|
![]() |
![]() |
| Frame # | With SH | Without SH | Observations |
|---|---|---|---|
| Frame 0 | ![]() |
![]() |
Notice that the cushion looks a lot brighter without spherical harmonics then the one with SH. The cushion has a view-dependent reflectance effect that comes from the microstructure of the material. |
| Frame 12 | ![]() |
![]() |
Similar effects to the 0th frame can be observed here. |
| Frame 24 | ![]() |
![]() |
The back of the seat is mostly diffuse and unaffected the presence of Spherical Harmonics. |
1.3.2 Training On a Harder Scene (10 points)¶
I initially experimented with MoGe to incorporate monocular depth estimation for better Gaussian initialization, as I was confident that improved initialization would yield superior results compared to random initialization. However, I found it difficult to achieve consistent depth alignment with the given camera extrinsics. To address this, I switched to VGGT, using a small subset of the training images to estimate camera extrinsics and generate consistent point clouds. I then applied Umeyama alignment to compute a Sim(3) transformation that aligned the predicted extrinsics with the provided ones, thereby aligning the reconstructed points as well. However, I noticed that the generated points were not entirely reliable [see image below], so I needed to devise a method to filter out the noisy or incorrect ones.

Notice how the better VGGT init converges faster to a good result qualitative and quantitively.
Isotropic Baseline¶

python train_harder_scene.py --out_path ./output/q1.3.2/baseline \
--gaussians_per_splat -1 \
--isotropic_gaussians \
--lr_opacity 0.02 \
--lr_scale 0.02 \
--lr_colour 0.02 \
--lr_mean 0.02 \
--init_type random
Anisotropic Extension with VGGT init¶

python train_harder_scene.py --out_path ./output/q1.3.2/extension \
--init_type vggt \
--num_itrs 500 \
--gaussians_per_splat -1 \
--lr_opacity 0.1 \
--lr_scale 0.1 \
--lr_colour 0.001 \
--lr_mean 0.0001 \
--lr_quat 0.00005
Q2. Diffusion-guided Optimization¶
2.1 SDS Loss + Image Optimization (20 points)¶
| Prompt | Iteration | With Guidance | Without Guidance |
|---|---|---|---|
| a hamburger | 2000 | ![]() |
![]() |
| a standing corgi dog | 2000 | ![]() |
![]() |
| a F-16_fighter_jet | 2000 | ![]() |
![]() |
| a chimpanzee holding a banana | 2000 | ![]() |
![]() |
2.2 Texture Map Optimization for Mesh (15 points)¶
| Prompt | Iteration | Initial Mesh | Final Mesh |
|---|---|---|---|
| a black and white cow | 2000 | ![]() |
![]() |
| a blue cow with red patches | 2000 | ![]() |
![]() |
| a brown cow with orange patches | 2000 | ![]() |
![]() |
2.3 NeRF Optimization (15 points)¶
| Prompt | Iteration | Depth | RGB |
|---|---|---|---|
| a hamburger | 10000 | ![]() |
![]() |
| a standing corgi dog | 10000 | ![]() |
![]() |
| a green frog on top of a horizontal flat rock | 10000 | ![]() |
![]() |
2.4 Extensions¶
2.4.1 View-dependent text embedding (10 points)¶
| Prompt | Iteration | Depth | RGB |
|---|---|---|---|
| a red sports car | 10000 | ![]() |
![]() |
| a standing corgi dog | 10000 | ![]() |
![]() |
| a green frog on top of a horizontal flat rock | 10000 | ![]() |
![]() |
2.4.2¶
2.4.3 Variation of implementation of SDS loss (10 points)¶
I differentiably decode both current and target latents with VAE and map to to either [0,1] for L2/L1/Huber or [-1,1] for LPIPS, optionally subsample via average pooling before the loss to cut compute, then computing the selected loss between decoded current and target pixels. Huber loss is used in final version, with pixel_downsample == 4 because otherwise it would've been way too slow.
Gradient calculation-wise, it's still done by $\nabla=w(\hat\epsilon -\epsilon)$, just in addition we decode both the current and target latents through the VAE decoder into RGB images, and computes a pixel-level loss (L2, L1, Huber, or LPIPS) between them.
| Prompt | Iteration | Depth | RGB | Observation |
|---|---|---|---|---|
| a green frog on top of a horizontal flat rock | 8000 | ![]() |
![]() |
The latent-space version of the SDS loss produces much cleaner depth renderings, yielding more solid geometry and finer details in the RGB outputs. |
| a green frog on top of a horizontal flat rock | 8000 | ![]() |
![]() |
The pixel-space version of the SDS loss is significantly slower due to its higher dimensionality - and even more so if LPIPS loss is used, as it requires an additional VGG network. To trade off speed, I only used a subsampled set of pixels for the loss computation, which leads to noisier depth maps and RGB renderings with more floating artifacts and less solid appearances. |




































