Name: Ishita Gupta

Andrew ID: ishitag

Assignment 4 Submission

1. 3D Gaussian Splatting

1.1.5 Perform Splatting

Rendered GIF:

Q1.1.5 Render

1.2 Training 3D Gaussian Representations

Learning Rates:

Means: 0.00016
Opacities: 0.05
Colors: 0.00025
Scales: 0.005

Number of Iterations: 1000

Metrics:

Mean PSNR: 29.454
SSIM: 0.940

Training Progress GIF:

Q1.2 Training Progress

Final Renders GIF:

Q1.2 Final Renders

1.3 Extensions

1.3.1 Rendering Using Spherical Harmonics

With Spherical Harmonics:

Q1.3.1 With SH

Without Spherical Harmonics (from Q1.1.5):

Q1.3.1 Without SH

Side-by-side Comparisons:

Without SH	With SH	Analysis
		Rendering with Spherical Harmonics helps add view-dependency to the colors and this is observed in the shadow on the chair in the gif and also in these frames. The render with SH looks more realistic and even as compared to the one without spherical harmonics in consideration.
		These are frames at the same time for both the renderings- without and with spherical harmonics and we can see more realistic shadows in the chair rendered using spherical harmonics.

1.3.2 Training On a Harder Scene

Baseline Approach (from the setup in 1.2):

Learning Rates:

Means: 0.00016
Opacities: 0.05
Colors: 0.00025
Scales: 0.005

Number of Iterations: 1000

Mean PSNR: 18.692

Mean SSIM: 0.684

Q1.3.2 Baseline Training Progress Q1.3.2 Baseline Training Final Render

Trying to Improve Results:

Learning Rates:
- Means: 0.0005
- Opacities: 0.02
- Colors: 0.0025
- Scales: 0.005
Number of Iterations: 3000 (also tried 4000 with similar marginal improvements)
Mean PSNR: 19.0
Mean SSIM: 0.718

Iterations	Training Progress	Final Renders	Mean PSNR	Mean SSIM
3000			19.0	0.718
4000			19.02	0.721

Modifications Made: Reduced learning rates to enable more training, and slower but better convergence. Increased the number of iterations from 1000 to 3000 and 4000.

Analysis: Despite increasing iterations to 3000 and 4000 with adjusted learning rates, the improvement over the baseline remained modest (PSNR: 18.692 -> 19.0, SSIM: 0.684 -> 0.718). The random initialization of Gaussian means combined with the scene's complexity makes optimization challenging, and simply extending training duration was insufficient to achieve substantial quality gains without more sophisticated techniques like adaptive density control or anisotropic Gaussians.

2. Diffusion-guided Optimization

2.1 SDS Loss + Image Optimization

Prompt	Without Guidance	With Guidance (100 iters)	With Guidance (Final)
"a hamburger"
"a standing corgi dog"
"a christmas tree"
"a sleigh"

Notes:

Without Guidance: Optimized with positive prompt only (400 iterations)
With Guidance: Uses classifier-free guidance with positive and negative prompts (1000 iterations for final results)

2.2 Texture Map Optimization for Mesh

Prompt 1: "chess"

Q2.2 Mesh Texture 1

Prompt 2: "rainbow"

Q2.2 Mesh Texture 2

Prompt 3: "zebra lines"

Q2.2 Mesh Texture 2

2.3 NeRF Optimization

Hyperparameters:

lambda_entropy: 1e-3
lambda_orient: 1e-2
latent_iter_ratio: 0.2

Prompt	RGB View	Depth View
"a standing corgi dog"
"a tree"
"a basket of apples"

The output for the prompt "a standing corgi dog" shows good shape and colorization but it renders 3 ears instead of 2, while the tree displays recognizable structure but with some blur in the finer details. Geometry generalization is not that great for asymmetric objects across all views. These are not "3D-consistent". However, somehow "the basket of apples" shows decent results in terms of color and views.

2.4 Extensions

2.4.1 View-dependent Text Embedding

Implementation: In the code, we only blend the "front", "back" and "side" embeddings. The elevation is ignored and the "top/bottom" fallback to the default prompt setting. We use sin and cos of the azimuth with simple max clamps instead of what the original DreamFusion paper implements (Gaussian kernels and softmax normalization). I train it on more epochs and tried some hyperparameter tuning to got these as the best results. This was with 7k iterations, lower learning rate.

Prompt 1: "a standing corgi dog"

RGB Video:

View-dependent Corgi RGB

Depth Video:

View-dependent Corgi Depth

Comparison with Q2.3:

Compared to results in 2.3, there is a clear difference in the model's ability to render asymmetric geometries correctly. The corgi previously had 3 ears but now has 2 which is correct. The view-dependent text embeddings help the model understand that different sides should look different, reducing the "Janus problem" where multiple front faces appear. The left/right side views now show proper asymmetry of the dog's body and features.

Prompt 2: "a basket of apples and grapes"

RGB Video:

View-dependent Apples RGB

Depth Video:

View-dependent Apples Depth

Analysis:

The basket of apples shows improved 3D consistency compared to Q2.3. With view-dependent conditioning, the basket maintains a more coherent structure across different viewing angles, with the apples and grapes appearing on appropriate sides based on the camera viewpoint. The depth maps reveal better geometric consistency, as the model now receives explicit guidance about which view (front/side/back) it's optimizing, reducing contradictory gradient signals that caused inconsistencies in the non-view-dependent version.

2.4.2 Other 3D Representation

Chosen Representation: Gaussian

Rendering Approach:

Used 3D gaussian primitives as the representation. Each Gaussian is parameterized by a position (means), scale, rotation (quaternions), color, and opacity. For rendering I:

Project 3D Gaussians to 2D using the camera transformation
Sort Gaussians by depth for correct alpha blending
Rasterize using differentiable splatting - evaluate each 2D Gaussian at pixel locations and alpha-composite colors
Sample random camera poses during training (elevation: -30° to 30°, azimuth: 0° to 360°)

Pros is that the renderer is fully differentiable, allowing gradients to flow back from the SDS loss to update Gaussian parameters.

Loss and Regularization:

Primary Loss: SDS loss with classifier-free guidance (guidance scale = 100) on rendered 128×128 images upsampled to 512×512
Scale Regularization (λ = 1e-3): Penalizes deviation from target scale to prevent Gaussians from collapsing or exploding
Gradient Clipping (threshold = 1.0): Prevents unstable gradients during optimization

Visual Results:

RGB Video:

Gaussian RGB

Depth Video:

Gaussian Depth

Prompt: "a colorful ball"

Comparing with NeRF:

Advantages:

Faster rendering: Explicit splatting is significantly faster than volumetric ray marching (~3-5x speedup per iteration)
Direct geometric control: Can manipulate individual 3D primitives, making debugging easier
Memory efficient for sparse scenes: Only stores Gaussians where objects exist, no empty space sampling needed

Disadvantages:

View-independent appearance: Without spherical harmonics, cannot model specular highlights or view-dependent effects that NeRF naturally captures through its implicit function
Potential splat artifacts: Can produce visible "blob" artifacts if Gaussians are not properly regularized
Fixed capacity: Number of Gaussians is fixed (we used 1500), while NeRF's continuous representation adapts to complexity
Less smooth: Discrete primitives may not capture very smooth surfaces as well as NeRF's continuous field

Gaussians converged faster (2000 iterations, approx. 35 mins) compared to NeRF (3000+ iterations, approx. 60-90 mins) for similar prompt complexity, though NeRF produced way better and smoother geometry. I also trained the gaussians for less steps which might be a reason for poor smoothness.