16-825 Assignment 4: 3D Gaussian Splatting and Diffusion Guided Optimization

1. 3D Gaussian Splatting

1.1 3D Gaussian Rasterization (35 points)

Rendered output from the 3D Gaussian rasterizer:

Final render from 3D Gaussian rasterization

1.2 Training 3D Gaussian Representations (15 points)

The learning rates used for training the 3D Gaussian representation were:

pre_act_opacities: 0.05
pre_act_scales: 0.05
colours: 0.05
means: 0.005

The model was trained for 200 iterations, achieving the following metrics:

Mean PSNR: 27.909
Mean SSIM: 0.924

Training final renders
Training progress

1.3.1 Rendering Using Spherical Harmonics (10 points)

Rendered output without Spherical Harmonics (same as Section 1.1):

Without Spherical Harmonics

Rendered output using Spherical Harmonics:

With Spherical Harmonics

Side-by-side Comparison:

View 1:
Without Spherical Harmonics:

With Spherical Harmonics:

The chair's pattern appears more detailed, with enhanced realism in shadows and reflections on metallic ornaments.
View 2:
Without Spherical Harmonics:

With Spherical Harmonics:

Reflections are noticeably improved. Without Spherical Harmonics, metallic pieces display flat colors, and the fabric pattern lacks variation in response to shadows and light. Incorporating Spherical Harmonics introduces view-dependent effects, enhancing realism.

2. Diffusion-guided Optimization

2.1 SDS Loss + Image Optimization (20 points)

Prompt: "a hamburger"
Left: Without SDS guidance. Right: With SDS guidance.
Iterations: 400 and 1000, respectively.
Prompt: "a standing corgi dog"
Left: Without SDS guidance. Right: With SDS guidance.
Iterations: 1000 and 900, respectively.
Prompt: "a soccer ball" Left: Without SDS guidance. Right: With SDS guidance.
Iterations: 1500 and 1100, respectively.
Prompt: "a rubik cube"
Left: Without SDS guidance. Right: With SDS guidance.
Iterations: 1400 and 1999, respectively.

2.2 Texture Map Optimization for Mesh (15 points)

Prompt: "a hamburger"
Prompt: "a rubik cube"

2.3 NeRF Optimization (15 points)

Prompt: "a standing corgi dog"
Prompt: "a hotdog"
Prompt: "a gamer chair"

2.4.1 View-dependent Text Embedding (10 points)

Prompt: "a standing corgi dog"
Prompt: "a hotdog"

The use of view-dependent text embedding significantly improves the results. In the Corgi example, without view-dependent text embedding, the dog appears to have three ears, and its face remains visible from nearly every angle. By incorporating view-dependency, the geometry becomes more realistic and consistent with the expected appearance.

A similar improvement is observed in the hotdog example. Without view-dependency, the result lacks asymmetry, which is a natural characteristic of a hotdog. Adding view-dependency introduces the expected asymmetry, leading to a more realistic and visually accurate representation.

2.4.3 Variation of SDS Loss Implementation (10 points)

For this implementation, the code from Section 2.3 was extended to create a new loss function named sds_loss_pixel. This function computes the loss directly between the predicted and target images, enabling the inclusion of LPIPS loss alongside MSE loss. This approach enhances the evaluation of perceptual similarity and pixel-wise accuracy.

Prompt: "a hotdog"

The results are worse compared to Section 2.3, specially on depth estimation. This could be due to the decoder not properly reconstructing the image, leading to noisy gradients and suboptimal loss values.