Question 1: 3D Gaussian Splatting¶
Note: Due to memory and compute constraints, I set the gaussians_per_splat = 64
1.1.5: Perform Splatting¶

1.2.2: Perform Forward Pass and Compute Loss¶
I experimented with 4 different variations of the parameter settings. Each model was trained for 2000 iterations.
Parameter settings 1 (original settings):
opacities: 0.05
scales: 0.05
colours: 0.05
means: 0.05

Mean PSNR: 26.068
Mean SSIM: 0.885
Parameter settings 2 (modified means only):
opacities: 0.05
scales: 0.05
colours: 0.05
means: 0.001

Mean PSNR: 29.131
Mean SSIM: 0.934
Parameter settings 3 (modified means and colors only):
opacities: 0.05
scales: 0.05
colours: 0.005
means: 0.001

Mean PSNR: 29.520
Mean SSIM: 0.940
Parameter settings 4 (modified all):
opacities: 0.005
scales: 0.01
colours: 0.005
means: 0.001

Mean PSNR: 29.585
Mean SSIM: 0.939
I found that the best performing model, visually and according to the PSNR and SSIM scores: were:
Highest SSIM: Parameter settings 3
Highest PSRN: Parameter settings 4
Best overall visually: Parameter settings 4
1.3.1: Rendering Using Spherical Harmonics¶
GIF for 1.3.1:

GIF for 1.1.5 (old):

3 side-by-side comparisons
(First/Left = without spherical harmonics, Second/Right = with spherical harmonics)
Frame 3:

Frame 11:

Frame 16:

Explanation of differences:
Spherical Harmonics were introduced to model view-dependent appearance, so that the color of the 3D gaussian varies as the viewing direction differs to account for changes in lighting, shadows, reflections, shading, etc. In all, it should produce more realistic, better quality images.
We can clearly see that images rendered with spherical harmonics have more pronounced shadows (frames 3 and 11), and when the back of the chair is visible, the two parallel streaks are a much darker color. However, I do believe that in this generation process, the streaks are far too pronounced - they should be a lighter shade and blend more smoothly when a shadow falls on the back of the chair.
Question 2: Diffusion-guided Optimization¶
2.1: SDS Loss + Image Optimization¶
Image-prompt pairs for 4 different pairs. First image is with guidance, second image is without guidance.
Prompt 1: "a hamburger"
Number of Training Iterations: 700

Prompt 2: "a standing corgi dog"
Number of Training Iterations: 500

Prompt 3: "a monkey playing the guitar"
Number of Training Iterations: 500

Prompt 4: "green and grey polka dotted pattern"
Number of Training Iterations: 400

2.2: Texture Map Optimization for Mesh¶
Prompt 1: "yellow and red plaid pattern"

Prompt 2: "green and grey polka dot pattern"

2.3: NeRF Optimization¶
Depth image to first/left, RGB image to second/right,
Prompt 1: "a standing corgi dog"
Prompt 2: "a dolphin"
Prompt 3: "a pink bicycle"
2.4.1: View-dependent text embedding¶
Depth image to first/left, RGB image to second/right,
Prompt 1: "a standing corgi dog", trained for 40 iterations due to compute restrictions
Prompt 2: "a dolphin", trained for 60 iterations
Comparison with Q2.3:
The authors implement view dependent text conditioning by evaluating the degree of the azimuth and identifying whether the current view is most representative of the 'front', 'back' or 'side' view of the object.
View dependent conditioning should in theory improve the performance of the model, as it generates more structurally consistent and realistic viewpoints. Since the model is explicitly conditioning on both the prompt and the view point, the NeRF model should learn to produce a 3D representation that is consistent across different viewing angles.
Visually, we see that that the RGB gif of the dolphin is more accurate with view-dependent conditioning (captures the face and fins better), but we still see the issue of the Janus or 'multi-head' problem. Due to signficant resource constraints, the standing corgi dog view was worse with view dependent condtitioning. Though it trained for the same number of epochs, it had a much splotchier 3D structure. This is probably because the model has to learn a more complex representation (one with geometric structure and view conditioning) in the same duration.
2.4.3: Variation of implementation of SDS loss¶
The difference in this SDS loss implementation is that we are operating in pixel-space instead of latent space.
The stable diffusion unet still operates in latent space, so although the sds_loss function is passed in the raw predicted image (H x W x 3), it still needs to be encoded and passed into the unet to get the gradient.
Then the diffusion model's prediction has to be decoded into pixel space. Instead of using L2 loss, I experimented with LPIPS (taken from a python library). The loss was then taken for images of size H x W x 3.
Evidently, because the pixel dimension is much larger, convergence and training is slower and less stable than SDS in the latent space.
The prompt: "a dolphin"
GIF: Pixel-space SDS Loss - in the same amount of time, the structure is much less defined