Assignment 4: 3D Gaussian Splatting and Diffusion Guided Optimization¶

Name: Simson D'Souza, Andrew ID: sjdsouza, Email: sjdsouza@andrew.cmu.edu¶


1. 3D Gaussian Splatting¶

1.1 3D Gaussian Rasterization (35 points)¶

1.1.5 Perform Splatting¶

Rendered Scene using Gaussian Splatting
Rendered Scene using Gaussian Splatting

1.2 Training 3D Gaussian Representations (15 points)¶

  • Learning Rates

    • Opacities: 0.001
    • Scales: 0.001
    • Colours: 0.02
    • Means: 0.0002
  • Number of Iterations: 1000

  • Evaluation Metrics

    • PSNR: 28.336
    • SSIM: 0.93
Training Progress
Training Progress




Final Renders
Final Renders

1.3 Extensions (Choose at least one! More than one is extra credit)¶

1.3.1 Rendering Using Spherical Harmonics (10 Points)¶

With Spherical Harmonics

No description has been provided for this image




Without Spherical Harmonics (Render from Q1.1.5)

No description has been provided for this image




Comparison

With Spherical Harmonics Without Spherical Harmonics
Frame 5
No description has been provided for this image
Frame 5
No description has been provided for this image
Frame 14
No description has been provided for this image
Frame 14
No description has been provided for this image
Frame 29
No description has been provided for this image
Frame 29
No description has been provided for this image




Observations

I am writing observations for all the frame comparisons shown above

  • The renders with Spherical Harmonics exhibit more realistic lighting and shading, as the colors dynamically vary with the viewing direction.
  • The renders without Spherical Harmonics appear flat and uniformly lit, lacking directional color variation.
  • Spherical Harmonics help capture view-dependent effects such as specular highlights and subtle reflections, improving overall visual fidelity.
  • Additionally, fine structural details of the object are more clearly visible in the renders with Spherical Harmonics, enhancing the quality of the reconstruction.

2. Diffusion-guided Optimization¶

2.1 SDS Loss + Image Optimization (20 points)¶

The following are the images with and without guidance for image optimization application.

Prompt: "a hamburger"

Without Guidance (2000 iterations) With Guidance (2000 iterations)
No description has been provided for this image No description has been provided for this image




Prompt: "a standing corgi dog"

Without Guidance (2000 iterations) With Guidance (2000 iterations)
No description has been provided for this image No description has been provided for this image




Prompt: "a unicorn riding a skateboard"

Without Guidance (2000 iterations) With Guidance (2000 iterations)
No description has been provided for this image No description has been provided for this image




Prompt: "a penguin in sunglasses"

Without Guidance (2000 iterations) With Guidance (2000 iterations)
No description has been provided for this image No description has been provided for this image





2.2 Texture Map Optimization for Mesh (15 points)¶

The following are the results of texture map optimization for mesh

Prompt: "a black and white cow"

Iterations: 2000

Initial Mesh Final Mesh
No description has been provided for this image No description has been provided for this image




Prompt: "an orange bull with spots"

Iterations: 2000

Initial Mesh Final Mesh
No description has been provided for this image No description has been provided for this image




Prompt: "a rainbow-colored cow"

Iterations: 2000

Initial Mesh Final Mesh
No description has been provided for this image No description has been provided for this image





2.3 NeRF Optimization (15 points)¶

Here, a 3D representation is generalized from a mesh to a NeRF model, where both the geometry and the color are learnable.

Hyperparameters

  • Regularization terms such as entropy regularization (lambda_entropy) and orientation regularization (lambda_orient) were tuned to stabilize NeRF optimization and improve geometry quality.
    • lambda_entropy: 1e-3
    • lambda_orient: 1e-2
  • Shading parameter (latent_iter_ratio) was tuned to warm up training with normal shading at the beginning and gradually switch to random shading, helping the model learn better geometry and albedo.
    • latent_iter_ratio: 0.2
  • Epochs: 100

The following are the results of view independent NeRF Optimization

Prompt: "a standing corgi dog"

RGB Depth
<




Prompt: "a yellow rubber duck"

RGB Depth




Prompt: "a teddy bear"

RGB Depth




Observations

  • The generated shapes are overall quite accurate, and the geometry of each object is well captured.
  • However, several artifacts appear due to inconsistent view-dependent guidance.
    • For example, the prompt “a standing corgi dog” produces a dog with three ears.
    • The "duck" appears with two beaks, and the teddy bear seems to have multiple faces.
  • These artifacts suggest that the model struggles to maintain 3D consistency.

2.4 Extensions (Choose at least one! More than one is extra credit)¶

2.4.1 View-dependent text embedding (10 points)¶

The hyperparameters are the same as Q2.3. The following are the results of view dependent NeRF Optimization

Prompt: "a standing corgi dog"

RGB (Q2.3) Depth (Q2.3) RGB (Q2.4.1) Depth (Q2.4.1)




Prompt: "a yellow rubber duck"

RGB (Q2.3) Depth (Q2.3) RGB (Q2.4.1) Depth (Q2.4.1)
<




Prompt: "a teddy bear"

RGB (Q2.3) Depth (Q2.3) RGB (Q2.4.1) Depth (Q2.4.1)




Observations

  • Compared to Q2.3, where the model used view-independent text conditioning, the view-dependent text conditioning produces significantly more consistent and coherent shapes across multiple viewpoints. In Q2.3, some objects exhibited artifacts such as duplicated limbs or facial features (e.g., the corgi having three ears or the teddy bear showing multiple faces). These artifacts occur because the text prompt provided the same conditioning for all views, making it hard for the model to maintain 3D consistency when synthesizing images from different camera angles.
  • With view-dependent text conditioning, separate embeddings are used for the front, side, and back views. During rendering, the model interpolates between these embeddings based on the camera azimuth angle, enabling it to adapt the text representation to each viewpoint. This leads to more consistent geometry across views, front and side transitions look smoother, and the model avoids artifacts like duplicated ears or faces seen in Q2.3. Overall, the objects appear more realistic, with better alignment between RGB and depth outputs.