Assignment 4 - Gaussian Splatting & Diffusion

Author: Kailash Jagadeesh

Course: 16-825 Learning for 3D Vision — Carnegie Mellon University

Overview

This assignment explores advanced methods in 3D vision and generative modeling. The primary focus is on two major components:

3D Gaussian Splatting: We implement and train a differentiable pipeline that renders 3D scenes using Gaussian splats. This includes developing functions for projecting 3D Gaussians to 2D, evaluating them for rendering, filtering/sorting by depth, alpha compositing, and performing the final splatting operation. The system is further extended to enable training of 3D Gaussian parameters directly from images using gradient-based optimization.
Diffusion-guided 3D Optimization: The second part applies Score Distillation Sampling (SDS) loss, enabling optimization of 3D representations guided by powerful text-to-image diffusion models. This allows for text-driven 3D generation and refinement.

Throughout the assignment, we benchmark our rendering and optimization methods both quantitatively (PSNR, SSIM) and qualitatively, comparing with reference outputs and analyzing the effects of architectural extensions such as spherical harmonics lighting. The project also covers practical concerns like reproducibility, efficient PyTorch coding, and training stability.

Key deliverables are high-quality renderings, animated GIFs showing training and results, and concise analysis of results for both reconstruction and generative 3D tasks.

Section 1: 3D Gaussian Splatting

Section 1.1: 3D Gaussian Rasterization

Section 1.1.5: Perform Splatting

Q1.1.5 Splatting Result

Section 1.2.2: Perform Forward Pass and Compute Loss

Training 3D Gaussian Representations

During training, I enabled gradients for the 3D Gaussian parameters (means, opacities, scales, colours) and used the Adam optimizer with per-parameter learning rates as follows:

Opacities: 0.01
Scales: 0.005
Colours: 0.01
Means: 0.00016

The model was trained for 1000 iterations, minimizing L1 loss between the rendered and ground truth images. This optimization led to rapid convergence and high-fidelity reconstructions.

Final Results:

PSNR: 29.084
SSIM: 0.936

Final Training Renders

Training Progress:
The following GIF visualizes convergence and qualitative improvement over the course of training:

Training Progress Curve

Section 1.3.1: Rendering using Spherical Harmonics

Spherical Harmonics Rendering Comparison

Spherical Harmonics (SH) allow the renderer to model view-dependent effects such as specular highlights and more realistic material appearance by modulating colour with respect to the viewing direction. Without SH, the rendering is limited to a fixed, view-independent colour per Gaussian, resulting in a flatter, less dynamic look.

Visual Comparison of Renderings (GIF):

With Spherical Harmonics	Without Spherical Harmonics

Individual Frame Comparison:

With Spherical Harmonics	Without Spherical Harmonics

Description:

With SH: The object demonstrates realistic, glossy highlights that shift as the camera moves. Materials appear more lifelike due to view-dependent reflectance, and lighting effects add richer detail and depth to the scene.
Without SH: The rendered object appears matte, with static coloration across all views. There are no specular reflections or view-dependent changes, causing the result to look flatter and less faithful to real-world surfaces.

This comparison clearly shows the benefits of integrating spherical harmonics into the 3D Gaussian rendering pipeline for capturing complex lighting and material properties.

Section 2.1 SDS Loss + Image Optimization

For the image optimization task, I trained the model for 1000 iterations on each prompt. The Score Distillation Sampling (SDS) loss was used to optimize the 3D representation from a single guidance image and text prompt.

A key difference was observed between runs with guidance and without guidance:

With guidance: The model accurately steers the generation to match the textual prompt, resulting in semantically correct and visually coherent 3D outputs.
Without guidance: The optimization fails to capture the prompt, often producing random or meaningless shapes that do not resemble the desired object.

Below is a comparison of the outputs with and without guidance for different prompts:

Prompt	With Guidance Image	Without Guidance Image
A Hamburger
a standing corgi dog
Ironman
Spiderman

As the table illustrates, diffusion guidance via SDS is essential for meaningful and prompt-faithful 3D reconstruction.

##Section 2.2: Texture Map Optimization for Mesh

Below is a table presenting two texture optimization outputs: a dotted black and white cow, and a green spotted cow. The first row also provides the initial reference mesh.

Prompt	Mesh GIF
Reference mesh
a dotted black and white cow
a green spotted cow

Section 2.3: NeRF Optimization

Prompt	Depth Video	RGB Video
A standing Corgi dog
A rabbit with a mic
a rat with dumbbell

Section 2.4.1 View-dependent text embedding

Prompt	Without View-Dependent Embedding	With View-Dependent Embedding
a rabbit with a mic
a standing corgi dog

View-dependent text embedding:
View-dependent text embedding augments the conditioning of diffusion or NeRF-based models by injecting information about the camera viewpoint into the text embedding. This allows the generated outputs to better account for changes in perspective, lighting, and object appearance as the camera moves, leading to more consistent and realistic multi-view synthesis. As observed in the comparison above, using view-dependent text embedding typically results in outputs with improved coherence and fidelity to the underlying 3D structure when viewed from different angles.