Assignment 4: 3D Gaussian Splatting and Diffusion Guided Optimization¶
Name: Simson D'Souza, Andrew ID: sjdsouza, Email: sjdsouza@andrew.cmu.edu¶
1. 3D Gaussian Splatting¶
1.1 3D Gaussian Rasterization (35 points)¶
1.1.5 Perform Splatting¶
1.2 Training 3D Gaussian Representations (15 points)¶
Learning Rates
Opacities: 0.001Scales: 0.001Colours: 0.02Means: 0.0002
Number of Iterations: 1000
Evaluation Metrics
PSNR: 28.336SSIM: 0.93
1.3 Extensions (Choose at least one! More than one is extra credit)¶
1.3.1 Rendering Using Spherical Harmonics (10 Points)¶
With Spherical Harmonics
Without Spherical Harmonics (Render from Q1.1.5)
Comparison
| With Spherical Harmonics | Without Spherical Harmonics |
|---|---|
Frame 5
|
Frame 5
|
Frame 14
|
Frame 14
|
Frame 29
|
Frame 29
|
Observations
I am writing observations for all the frame comparisons shown above
- The renders with Spherical Harmonics exhibit more realistic lighting and shading, as the colors dynamically vary with the viewing direction.
- The renders without Spherical Harmonics appear flat and uniformly lit, lacking directional color variation.
- Spherical Harmonics help capture view-dependent effects such as specular highlights and subtle reflections, improving overall visual fidelity.
- Additionally, fine structural details of the object are more clearly visible in the renders with Spherical Harmonics, enhancing the quality of the reconstruction.
2. Diffusion-guided Optimization¶
2.1 SDS Loss + Image Optimization (20 points)¶
The following are the images with and without guidance for image optimization application.
Prompt: "a hamburger"
| Without Guidance (2000 iterations) | With Guidance (2000 iterations) |
|---|---|
|
|
Prompt: "a standing corgi dog"
| Without Guidance (2000 iterations) | With Guidance (2000 iterations) |
|---|---|
|
|
Prompt: "a unicorn riding a skateboard"
| Without Guidance (2000 iterations) | With Guidance (2000 iterations) |
|---|---|
|
|
Prompt: "a penguin in sunglasses"
| Without Guidance (2000 iterations) | With Guidance (2000 iterations) |
|---|---|
|
|
2.2 Texture Map Optimization for Mesh (15 points)¶
The following are the results of texture map optimization for mesh
Prompt: "a black and white cow"
Iterations: 2000
| Initial Mesh | Final Mesh |
|---|---|
|
|
Prompt: "an orange bull with spots"
Iterations: 2000
| Initial Mesh | Final Mesh |
|---|---|
|
|
Prompt: "a rainbow-colored cow"
Iterations: 2000
| Initial Mesh | Final Mesh |
|---|---|
|
|
2.3 NeRF Optimization (15 points)¶
Here, a 3D representation is generalized from a mesh to a NeRF model, where both the geometry and the color are learnable.
Hyperparameters
- Regularization terms such as entropy regularization (
lambda_entropy) and orientation regularization (lambda_orient) were tuned to stabilize NeRF optimization and improve geometry quality.lambda_entropy: 1e-3lambda_orient: 1e-2
- Shading parameter (
latent_iter_ratio) was tuned to warm up training with normal shading at the beginning and gradually switch to random shading, helping the model learn better geometry and albedo.latent_iter_ratio: 0.2
Epochs: 100
The following are the results of view independent NeRF Optimization
Prompt: "a standing corgi dog"
| RGB | Depth |
|---|---|
| < |
Prompt: "a yellow rubber duck"
| RGB | Depth |
|---|---|
Prompt: "a teddy bear"
| RGB | Depth |
|---|---|
Observations
- The generated shapes are overall quite accurate, and the geometry of each object is well captured.
- However, several artifacts appear due to inconsistent view-dependent guidance.
- For example, the prompt “a standing corgi dog” produces a dog with three ears.
- The "duck" appears with two beaks, and the teddy bear seems to have multiple faces.
- These artifacts suggest that the model struggles to maintain 3D consistency.
2.4 Extensions (Choose at least one! More than one is extra credit)¶
2.4.1 View-dependent text embedding (10 points)¶
The hyperparameters are the same as Q2.3. The following are the results of view dependent NeRF Optimization
Prompt: "a standing corgi dog"
| RGB (Q2.3) | Depth (Q2.3) | RGB (Q2.4.1) | Depth (Q2.4.1) |
|---|---|---|---|
Prompt: "a yellow rubber duck"
| RGB (Q2.3) | Depth (Q2.3) | RGB (Q2.4.1) | Depth (Q2.4.1) |
|---|---|---|---|
| < |
Prompt: "a teddy bear"
| RGB (Q2.3) | Depth (Q2.3) | RGB (Q2.4.1) | Depth (Q2.4.1) |
|---|---|---|---|
Observations
- Compared to Q2.3, where the model used view-independent text conditioning, the view-dependent text conditioning produces significantly more consistent and coherent shapes across multiple viewpoints. In Q2.3, some objects exhibited artifacts such as duplicated limbs or facial features (e.g., the corgi having three ears or the teddy bear showing multiple faces). These artifacts occur because the text prompt provided the same conditioning for all views, making it hard for the model to maintain 3D consistency when synthesizing images from different camera angles.
- With view-dependent text conditioning, separate embeddings are used for the front, side, and back views. During rendering, the model interpolates between these embeddings based on the camera azimuth angle, enabling it to adapt the text representation to each viewpoint. This leads to more consistent geometry across views, front and side transitions look smoother, and the model avoids artifacts like duplicated ears or faces seen in Q2.3. Overall, the objects appear more realistic, with better alignment between RGB and depth outputs.