Assignment 4: 3D Gaussian Splatting and Diffusion Guided Optimization¶

Name: Simson D'Souza, Andrew ID: sjdsouza, Email: sjdsouza@andrew.cmu.edu¶

1. 3D Gaussian Splatting¶

1.1 3D Gaussian Rasterization (35 points)¶

1.1.5 Perform Splatting¶

1.2 Training 3D Gaussian Representations (15 points)¶

Learning Rates
- Opacities: 0.001
- Scales: 0.001
- Colours: 0.02
- Means: 0.0002
Number of Iterations: 1000
Evaluation Metrics
- PSNR: 28.336
- SSIM: 0.93

1.3 Extensions (Choose at least one! More than one is extra credit)¶

1.3.1 Rendering Using Spherical Harmonics (10 Points)¶

With Spherical Harmonics

No description has been provided for this image

Without Spherical Harmonics (Render from Q1.1.5)

Comparison

With Spherical Harmonics	Without Spherical Harmonics
Frame 5	Frame 5
Frame 14	Frame 14
Frame 29	Frame 29

Observations

I am writing observations for all the frame comparisons shown above

The renders with Spherical Harmonics exhibit more realistic lighting and shading, as the colors dynamically vary with the viewing direction.
The renders without Spherical Harmonics appear flat and uniformly lit, lacking directional color variation.
Spherical Harmonics help capture view-dependent effects such as specular highlights and subtle reflections, improving overall visual fidelity.
Additionally, fine structural details of the object are more clearly visible in the renders with Spherical Harmonics, enhancing the quality of the reconstruction.

2. Diffusion-guided Optimization¶

2.1 SDS Loss + Image Optimization (20 points)¶

The following are the images with and without guidance for image optimization application.

Prompt: "a hamburger"

Without Guidance (2000 iterations)	With Guidance (2000 iterations)

Prompt: "a standing corgi dog"

Without Guidance (2000 iterations)	With Guidance (2000 iterations)

Prompt: "a unicorn riding a skateboard"

Without Guidance (2000 iterations)	With Guidance (2000 iterations)

Prompt: "a penguin in sunglasses"

Without Guidance (2000 iterations)	With Guidance (2000 iterations)

2.2 Texture Map Optimization for Mesh (15 points)¶

The following are the results of texture map optimization for mesh

Prompt: "a black and white cow"

Iterations: 2000

Initial Mesh	Final Mesh

Prompt: "an orange bull with spots"

Iterations: 2000

Initial Mesh	Final Mesh

Prompt: "a rainbow-colored cow"

Iterations: 2000

Initial Mesh	Final Mesh

2.3 NeRF Optimization (15 points)¶

Here, a 3D representation is generalized from a mesh to a NeRF model, where both the geometry and the color are learnable.

Hyperparameters

Regularization terms such as entropy regularization (lambda_entropy) and orientation regularization (lambda_orient) were tuned to stabilize NeRF optimization and improve geometry quality.
- lambda_entropy: 1e-3
- lambda_orient: 1e-2
Shading parameter (latent_iter_ratio) was tuned to warm up training with normal shading at the beginning and gradually switch to random shading, helping the model learn better geometry and albedo.
- latent_iter_ratio: 0.2
Epochs: 100

The following are the results of view independent NeRF Optimization

Prompt: "a standing corgi dog"

RGB	Depth
	<

Prompt: "a yellow rubber duck"

RGB	Depth

Prompt: "a teddy bear"

RGB	Depth

Observations

The generated shapes are overall quite accurate, and the geometry of each object is well captured.
However, several artifacts appear due to inconsistent view-dependent guidance.
- For example, the prompt “a standing corgi dog” produces a dog with three ears.
- The "duck" appears with two beaks, and the teddy bear seems to have multiple faces.
These artifacts suggest that the model struggles to maintain 3D consistency.

2.4 Extensions (Choose at least one! More than one is extra credit)¶

2.4.1 View-dependent text embedding (10 points)¶

The hyperparameters are the same as Q2.3. The following are the results of view dependent NeRF Optimization

Prompt: "a standing corgi dog"

RGB (Q2.3)	Depth (Q2.3)	RGB (Q2.4.1)	Depth (Q2.4.1)

Prompt: "a yellow rubber duck"

RGB (Q2.3)	Depth (Q2.3)	RGB (Q2.4.1)	Depth (Q2.4.1)
			<

Prompt: "a teddy bear"

RGB (Q2.3)	Depth (Q2.3)	RGB (Q2.4.1)	Depth (Q2.4.1)

Observations

Compared to Q2.3, where the model used view-independent text conditioning, the view-dependent text conditioning produces significantly more consistent and coherent shapes across multiple viewpoints. In Q2.3, some objects exhibited artifacts such as duplicated limbs or facial features (e.g., the corgi having three ears or the teddy bear showing multiple faces). These artifacts occur because the text prompt provided the same conditioning for all views, making it hard for the model to maintain 3D consistency when synthesizing images from different camera angles.
With view-dependent text conditioning, separate embeddings are used for the front, side, and back views. During rendering, the model interpolates between these embeddings based on the camera azimuth angle, enabling it to adapt the text representation to each viewpoint. This leads to more consistent geometry across views, front and side transitions look smoother, and the model avoids artifacts like duplicated ears or faces seen in Q2.3. Overall, the objects appear more realistic, with better alignment between RGB and depth outputs.