This assignment explores advanced methods in 3D vision and generative modeling. The primary focus is on two major components:
3D Gaussian Splatting: We implement and train a differentiable pipeline that renders 3D scenes using Gaussian splats. This includes developing functions for projecting 3D Gaussians to 2D, evaluating them for rendering, filtering/sorting by depth, alpha compositing, and performing the final splatting operation. The system is further extended to enable training of 3D Gaussian parameters directly from images using gradient-based optimization.
Diffusion-guided 3D Optimization: The second part applies Score Distillation Sampling (SDS) loss, enabling optimization of 3D representations guided by powerful text-to-image diffusion models. This allows for text-driven 3D generation and refinement.
Throughout the assignment, we benchmark our rendering and optimization methods both quantitatively (PSNR, SSIM) and qualitatively, comparing with reference outputs and analyzing the effects of architectural extensions such as spherical harmonics lighting. The project also covers practical concerns like reproducibility, efficient PyTorch coding, and training stability.

Training 3D Gaussian Representations
During training, I enabled gradients for the 3D Gaussian parameters (means, opacities, scales, colours) and used the Adam optimizer with per-parameter learning rates as follows:
The model was trained for 1000 iterations, minimizing L1 loss between the rendered and ground truth images. This optimization led to rapid convergence and high-fidelity reconstructions.
Final Results:

Training Progress:
The following GIF visualizes convergence and qualitative improvement over the course of training:

Spherical Harmonics (SH) allow the renderer to model view-dependent effects such as specular highlights and more realistic material appearance by modulating colour with respect to the viewing direction. Without SH, the rendering is limited to a fixed, view-independent colour per Gaussian, resulting in a flatter, less dynamic look.
Visual Comparison of Renderings (GIF):
| With Spherical Harmonics | Without Spherical Harmonics |
|---|---|
![]() |
![]() |
Individual Frame Comparison:
| With Spherical Harmonics | Without Spherical Harmonics |
|---|---|
![]() |
![]() |
Description:
This comparison clearly shows the benefits of integrating spherical harmonics into the 3D Gaussian rendering pipeline for capturing complex lighting and material properties.
For the image optimization task, I trained the model for 1000 iterations on each prompt. The Score Distillation Sampling (SDS) loss was used to optimize the 3D representation from a single guidance image and text prompt.
A key difference was observed between runs with guidance and without guidance:
Below is a comparison of the outputs with and without guidance for different prompts:
| Prompt | With Guidance Image | Without Guidance Image |
|---|---|---|
| A Hamburger | ![]() |
![]() |
| a standing corgi dog | ![]() |
![]() |
| Ironman | ![]() |
![]() |
| Spiderman | ![]() |
![]() |
As the table illustrates, diffusion guidance via SDS is essential for meaningful and prompt-faithful 3D reconstruction.
##Section 2.2: Texture Map Optimization for Mesh
Below is a table presenting two texture optimization outputs: a dotted black and white cow, and a green spotted cow. The first row also provides the initial reference mesh.
| Prompt | Mesh GIF |
|---|---|
| Reference mesh | ![]() |
| a dotted black and white cow | ![]() |
| a green spotted cow | ![]() |
| Prompt | Depth Video | RGB Video |
|---|---|---|
| A standing Corgi dog | ||
| A rabbit with a mic | ||
| a rat with dumbbell |
| Prompt | Without View-Dependent Embedding | With View-Dependent Embedding |
|---|---|---|
| a rabbit with a mic | ||
| a standing corgi dog |
View-dependent text embedding:
View-dependent text embedding augments the conditioning of diffusion or NeRF-based models by injecting information about the camera viewpoint into the text embedding. This allows the generated outputs to better account for changes in perspective, lighting, and object appearance as the camera moves, leading to more consistent and realistic multi-view synthesis. As observed in the comparison above, using view-dependent text embedding typically results in outputs with improved coherence and fidelity to the underlying 3D structure when viewed from different angles.