Name: Ishita Gupta

Andrew ID: ishitag

Assignment 4 Submission

1. 3D Gaussian Splatting

1.1.5 Perform Splatting

Rendered GIF:

Q1.1.5 Render


1.2 Training 3D Gaussian Representations

Learning Rates:

Number of Iterations: 1000

Metrics:

Training Progress GIF:

Q1.2 Training Progress

Final Renders GIF:

Q1.2 Final Renders


1.3 Extensions

1.3.1 Rendering Using Spherical Harmonics

With Spherical Harmonics:

Q1.3.1 With SH

Without Spherical Harmonics (from Q1.1.5):

Q1.3.1 Without SH

Side-by-side Comparisons:

Without SH With SH Analysis
Frame 1 No SH Frame 1 With SH Rendering with Spherical Harmonics helps add view-dependency to the colors and this is observed in the shadow on the chair in the gif and also in these frames. The render with SH looks more realistic and even as compared to the one without spherical harmonics in consideration.
Frame 2 No SH Frame 2 With SH These are frames at the same time for both the renderings- without and with spherical harmonics and we can see more realistic shadows in the chair rendered using spherical harmonics.

1.3.2 Training On a Harder Scene

Baseline Approach (from the setup in 1.2):

Learning Rates:

Number of Iterations: 1000

Mean PSNR: 18.692

Mean SSIM: 0.684

Q1.3.2 Baseline Training Progress Q1.3.2 Baseline Training Final Render

Trying to Improve Results:

Iterations Training Progress Final Renders Mean PSNR Mean SSIM
3000 Q1.3.2 Improved Training Progress 3k iters Q1.3.2 Improved Final Renders 3k iters 19.0 0.718
4000 Q1.3.2 Improved Training Progress 4k iters Q1.3.2 Improved Final Renders 4k iters 19.02 0.721

Modifications Made: Reduced learning rates to enable more training, and slower but better convergence. Increased the number of iterations from 1000 to 3000 and 4000.

Analysis: Despite increasing iterations to 3000 and 4000 with adjusted learning rates, the improvement over the baseline remained modest (PSNR: 18.692 -> 19.0, SSIM: 0.684 -> 0.718). The random initialization of Gaussian means combined with the scene's complexity makes optimization challenging, and simply extending training duration was insufficient to achieve substantial quality gains without more sophisticated techniques like adaptive density control or anisotropic Gaussians.


2. Diffusion-guided Optimization

2.1 SDS Loss + Image Optimization

Prompt Without Guidance With Guidance (100 iters) With Guidance (Final)
"a hamburger" No Guidance 100 iters Final
"a standing corgi dog" No Guidance 100 iters Final
"a christmas tree" No Guidance 100 iters Final
"a sleigh" No Guidance 100 iters Final

Notes:


2.2 Texture Map Optimization for Mesh

Prompt 1: "chess"

Q2.2 Mesh Texture 1


Prompt 2: "rainbow"

Q2.2 Mesh Texture 2


Prompt 3: "zebra lines"

Q2.2 Mesh Texture 2


2.3 NeRF Optimization

Hyperparameters:

Prompt RGB View Depth View
"a standing corgi dog" Corgi RGB Corgi Depth
"a tree" Tree RGB Tree Depth
"a basket of apples" Apples RGB Apples Depth

The output for the prompt "a standing corgi dog" shows good shape and colorization but it renders 3 ears instead of 2, while the tree displays recognizable structure but with some blur in the finer details. Geometry generalization is not that great for asymmetric objects across all views. These are not "3D-consistent". However, somehow "the basket of apples" shows decent results in terms of color and views.


2.4 Extensions

2.4.1 View-dependent Text Embedding

Implementation: In the code, we only blend the "front", "back" and "side" embeddings. The elevation is ignored and the "top/bottom" fallback to the default prompt setting. We use sin and cos of the azimuth with simple max clamps instead of what the original DreamFusion paper implements (Gaussian kernels and softmax normalization). I train it on more epochs and tried some hyperparameter tuning to got these as the best results. This was with 7k iterations, lower learning rate.

Prompt 1: "a standing corgi dog"

RGB Video:

View-dependent Corgi RGB

Depth Video:

View-dependent Corgi Depth

Comparison with Q2.3:

Compared to results in 2.3, there is a clear difference in the model's ability to render asymmetric geometries correctly. The corgi previously had 3 ears but now has 2 which is correct. The view-dependent text embeddings help the model understand that different sides should look different, reducing the "Janus problem" where multiple front faces appear. The left/right side views now show proper asymmetry of the dog's body and features.


Prompt 2: "a basket of apples and grapes"

RGB Video:

View-dependent Apples RGB

Depth Video:

View-dependent Apples Depth

Analysis:

The basket of apples shows improved 3D consistency compared to Q2.3. With view-dependent conditioning, the basket maintains a more coherent structure across different viewing angles, with the apples and grapes appearing on appropriate sides based on the camera viewpoint. The depth maps reveal better geometric consistency, as the model now receives explicit guidance about which view (front/side/back) it's optimizing, reducing contradictory gradient signals that caused inconsistencies in the non-view-dependent version.


2.4.2 Other 3D Representation

Chosen Representation: Gaussian

Rendering Approach:

Used 3D gaussian primitives as the representation. Each Gaussian is parameterized by a position (means), scale, rotation (quaternions), color, and opacity. For rendering I:

  1. Project 3D Gaussians to 2D using the camera transformation
  2. Sort Gaussians by depth for correct alpha blending
  3. Rasterize using differentiable splatting - evaluate each 2D Gaussian at pixel locations and alpha-composite colors
  4. Sample random camera poses during training (elevation: -30° to 30°, azimuth: 0° to 360°)

Pros is that the renderer is fully differentiable, allowing gradients to flow back from the SDS loss to update Gaussian parameters.

Loss and Regularization:

Visual Results:

RGB Video:

Gaussian RGB

Depth Video:

Gaussian Depth

Prompt: "a colorful ball"

Comparing with NeRF:

Advantages:

Disadvantages:

Gaussians converged faster (2000 iterations, approx. 35 mins) compared to NeRF (3000+ iterations, approx. 60-90 mins) for similar prompt complexity, though NeRF produced way better and smoother geometry. I also trained the gaussians for less steps which might be a reason for poor smoothness.