Name: Ishita Gupta
Andrew ID: ishitag
Rendered GIF:

Learning Rates:
0.000160.050.000250.005Number of Iterations: 1000
Metrics:
29.4540.940Training Progress GIF:

Final Renders GIF:

With Spherical Harmonics:

Without Spherical Harmonics (from Q1.1.5):

Side-by-side Comparisons:
| Without SH | With SH | Analysis |
|---|---|---|
![]() |
![]() |
Rendering with Spherical Harmonics helps add view-dependency to the colors and this is observed in the shadow on the chair in the gif and also in these frames. The render with SH looks more realistic and even as compared to the one without spherical harmonics in consideration. |
![]() |
![]() |
These are frames at the same time for both the renderings- without and with spherical harmonics and we can see more realistic shadows in the chair rendered using spherical harmonics. |
Baseline Approach (from the setup in 1.2):
Learning Rates:
0.000160.050.000250.005Number of Iterations: 1000
Mean PSNR: 18.692
Mean SSIM: 0.684

Trying to Improve Results:
0.00050.020.00250.0053000 (also tried 4000 with similar marginal improvements)19.00.718| Iterations | Training Progress | Final Renders | Mean PSNR | Mean SSIM |
|---|---|---|---|---|
| 3000 | ![]() |
![]() |
19.0 | 0.718 |
| 4000 | ![]() |
![]() |
19.02 | 0.721 |
Modifications Made: Reduced learning rates to enable more training, and slower but better convergence. Increased the number of iterations from 1000 to 3000 and 4000.
Analysis: Despite increasing iterations to 3000 and 4000 with adjusted learning rates, the improvement over the baseline remained modest (PSNR: 18.692 -> 19.0, SSIM: 0.684 -> 0.718). The random initialization of Gaussian means combined with the scene's complexity makes optimization challenging, and simply extending training duration was insufficient to achieve substantial quality gains without more sophisticated techniques like adaptive density control or anisotropic Gaussians.
| Prompt | Without Guidance | With Guidance (100 iters) | With Guidance (Final) |
|---|---|---|---|
| "a hamburger" | ![]() |
![]() |
![]() |
| "a standing corgi dog" | ![]() |
![]() |
![]() |
| "a christmas tree" | ![]() |
![]() |
![]() |
| "a sleigh" | ![]() |
![]() |
![]() |
Notes:



Hyperparameters:
lambda_entropy: 1e-3lambda_orient: 1e-2latent_iter_ratio: 0.2| Prompt | RGB View | Depth View |
|---|---|---|
| "a standing corgi dog" | ![]() |
![]() |
| "a tree" | ![]() |
![]() |
| "a basket of apples" | ![]() |
![]() |
The output for the prompt "a standing corgi dog" shows good shape and colorization but it renders 3 ears instead of 2, while the tree displays recognizable structure but with some blur in the finer details. Geometry generalization is not that great for asymmetric objects across all views. These are not "3D-consistent". However, somehow "the basket of apples" shows decent results in terms of color and views.
Implementation: In the code, we only blend the "front", "back" and "side" embeddings. The elevation is ignored and the "top/bottom" fallback to the default prompt setting. We use sin and cos of the azimuth with simple max clamps instead of what the original DreamFusion paper implements (Gaussian kernels and softmax normalization). I train it on more epochs and tried some hyperparameter tuning to got these as the best results. This was with 7k iterations, lower learning rate.
RGB Video:

Depth Video:

Comparison with Q2.3:
Compared to results in 2.3, there is a clear difference in the model's ability to render asymmetric geometries correctly. The corgi previously had 3 ears but now has 2 which is correct. The view-dependent text embeddings help the model understand that different sides should look different, reducing the "Janus problem" where multiple front faces appear. The left/right side views now show proper asymmetry of the dog's body and features.
RGB Video:

Depth Video:

Analysis:
The basket of apples shows improved 3D consistency compared to Q2.3. With view-dependent conditioning, the basket maintains a more coherent structure across different viewing angles, with the apples and grapes appearing on appropriate sides based on the camera viewpoint. The depth maps reveal better geometric consistency, as the model now receives explicit guidance about which view (front/side/back) it's optimizing, reducing contradictory gradient signals that caused inconsistencies in the non-view-dependent version.
Chosen Representation: Gaussian
Rendering Approach:
Used 3D gaussian primitives as the representation. Each Gaussian is parameterized by a position (means), scale, rotation (quaternions), color, and opacity. For rendering I:
Pros is that the renderer is fully differentiable, allowing gradients to flow back from the SDS loss to update Gaussian parameters.
Loss and Regularization:
Visual Results:
RGB Video:

Depth Video:

Prompt: "a colorful ball"
Comparing with NeRF:
Advantages:
Disadvantages:
Gaussians converged faster (2000 iterations, approx. 35 mins) compared to NeRF (3000+ iterations, approx. 60-90 mins) for similar prompt complexity, though NeRF produced way better and smoother geometry. I also trained the gaussians for less steps which might be a reason for poor smoothness.