Part 1: 3D Gaussian Splatting
1.1.5 3D Gaussian Rasterization (35 points)
Implemented a 3D Gaussian rasterization pipeline in PyTorch, including projection of 3D Gaussians to 2D, evaluation of 2D Gaussians, filtering and sorting, alpha and transmittance computation, and splatting for color blending.
Implementation Overview
The rasterization pipeline consists of several key steps:
- Project 3D Gaussians to 2D: Compute 2D mean and covariance from 3D Gaussian parameters using camera projection
- Evaluate 2D Gaussians: Compute the power (exponent) of 2D Gaussian at each pixel location
- Filter and Sort: Filter Gaussians behind the camera and sort by depth
- Compute Alphas and Transmittance: Calculate alpha values and transmittance for proper alpha blending
- Perform Splatting: Blend colors using alpha and transmittance values
Results
Rendering of pre-trained 3D Gaussians (view-independent, DC component only)
1.2.2 Training 3D Gaussian Representations (15 points)
Implemented training code to optimize 3D Gaussians from multi-view images and a point cloud initialization. The training includes making parameters trainable, setting up optimizers with different learning rates, and implementing the forward pass with loss computation.
Training Details
Learning Rates Used
- Means (positions): 0.0001 (smaller learning rate for stable geometry)
- Opacities: 0.01 (larger learning rate for faster convergence)
- Colors: 0.01 (larger learning rate for appearance)
- Scales (for isotropic Gaussians): 0.005 (moderate learning rate)
Number of Iterations: 2000 iterations
Loss Function: L1 loss between predicted and ground truth images
Training Progress
Training Progress: Top row shows predicted renderings, bottom row shows ground truth
Final Results
Final Rendered Views After Training
Performance Metrics
PSNR: 29.334
SSIM: 0.939
1.3.1 Rendering Using Spherical Harmonics (10 points)
Extended the rasterizer to support view-dependent rendering using spherical harmonics. This enables rendering of pre-trained 3D Gaussians with full spherical harmonic coefficients, capturing view-dependent effects like reflections and specular highlights.
Implementation
The key addition is the colours_from_spherical_harmonics function, which evaluates spherical harmonic basis functions given viewing directions and combines them with the learned SH coefficients to produce view-dependent colors.
Comparison: With vs. Without Spherical Harmonics
Without Spherical Harmonics (Q1.1.5)
View-independent rendering (DC component only)
With Spherical Harmonics (Q1.3.1)
View-dependent rendering (full SH coefficients)
Observations
- View Independence vs. View Dependence: Without SH, the rendering appears flat and lacks view-dependent effects. With SH, the rendering shows realistic view-dependent effects like reflections and specular highlights.
- Visual Quality: The SH rendering captures more realistic material properties, especially for shiny or reflective surfaces.
- Color Variation: With SH, colors change smoothly as the viewing angle changes, creating a more photorealistic appearance.
Part 2: Diffusion-guided Optimization
2.1 SDS Loss + Image Optimization (20 points)
Implemented the Score Distillation Sampling (SDS) loss with and without classifier-free guidance. The SDS loss uses a pre-trained Stable Diffusion model to guide image optimization toward matching a text prompt.
Implementation Details
# SDS Loss Implementation (from SDS.py)
def sds_loss(self, latents, text_embeddings, text_embeddings_uncond=None,
guidance_scale=100, grad_scale=1):
# Sample random timestep
t = torch.randint(self.min_step, self.max_step + 1,
(latents.shape[0],), dtype=torch.long, device=self.device)
# Add noise to latents
with torch.no_grad():
noise = torch.randn_like(latents)
latents_noisy = self.scheduler.add_noise(latents, noise, t)
# Predict noise
noise_pred = self.unet(latents_noisy, t,
encoder_hidden_states=text_embeddings).sample
# Classifier-free guidance
if text_embeddings_uncond is not None and guidance_scale != 1:
noise_pred_uncond = self.unet(latents_noisy, t,
encoder_hidden_states=text_embeddings_uncond).sample
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred - noise_pred_uncond)
# Compute SDS gradient
w = 1 - self.alphas[t]
grad = w[:, None, None, None] * (noise_pred.detach() - noise) * grad_scale
# Target trick: MSE between current and target latents
loss = 0.5 * F.mse_loss(latents, (latents - grad).detach(),
reduction="sum") / latents.shape[0]
return loss
Results for Four Different Prompts
1. "a hamburger"
Without Guidance
Iterations: 2000
With Guidance
Iterations: 2000
2. "a standing corgi dog"
With Guidance - Iterations: 2000
3. "a cat lying on a rooftop"
With Guidance - Iterations: 2000
4. "a pigeon on an wire"
With Guidance - Iterations: 2000
Observations
- Without Guidance (hamburger only): The image is abstract and no detailed. The optimization converges faster but produces lower quality results.
- With Guidance: Images show significantly better fidelity and detail. Classifier-free guidance (guidance_scale=100) amplifies the conditional signal, resulting in more accurate prompt matching.
- Training: All results were trained for 2000 iterations with learning rate 0.1 using AdamW optimizer.
2.2 Texture Map Optimization for Mesh (15 points)
Optimized the texture map of a fixed-geometry cow mesh using SDS loss. The texture is represented by a ColorField neural network that maps vertex coordinates to RGB colors.
Implementation Details
# Mesh Texture Optimization (from Q22_mesh_optimization.py)
# Initialize texture field
color_field = ColorField().to(device)
# Create mesh with learnable texture
mesh = pytorch3d.structures.Meshes(
verts=vertices,
faces=faces,
textures=TexturesVertex(verts_features=color_field(vertices))
)
# Training loop: randomly sample camera and render
for iteration in range(total_iter):
# Random camera pose
dist = torch.rand(1).item() * (dist_max - dist_min) + dist_min
elev = torch.rand(1).item() * (elev_max - elev_min) + elev_min
azim = torch.rand(1).item() * 360
R, T = look_at_view_transform(dist=dist, elev=elev, azim=azim)
cameras = FoVPerspectiveCameras(device=device, R=R, T=T)
# Render mesh
rend = renderer(mesh, cameras=cameras, lights=lights)
rend = rend[0, ..., :3] # Extract RGB
# Compute SDS loss
latents = sds.encode_imgs(rend)
loss = sds.sds_loss(latents, text_embeddings=embeddings['cond'],
text_embeddings_uncond=embeddings['uncond'],
guidance_scale=100, grad_scale=1)
Results for Two Different Text Prompts
1. "a cow with rainbow stripes"
360° rotation of textured mesh
2. "a dotted black and white cow"
360° rotation of textured mesh
Observations
- Texture Quality: The SDS loss successfully applies diverse textures (stripes, dots) to the mesh geometry.
- View Consistency: The texture appears consistent across different viewing angles, demonstrating successful optimization.
- Training: Trained for 2000 iterations with random camera sampling for diverse viewpoints.
2.3 NeRF Optimization (15 points)
Optimized a NeRF model to generate 3D objects from text prompts using SDS loss. Both geometry and appearance are learnable, unlike the fixed-geometry mesh in Q2.2.
Implementation Details
# NeRF Optimization (from Q23_nerf_optimization.py)
# Render NeRF from random camera pose
pred_rgb, pred_depth = renderer.render(rays_o, rays_d, staged=False,
perturb=True, bg_color=bg_color)
# Encode to latent space
pred_rgb_512 = F.interpolate(pred_rgb.permute(2,0,1).unsqueeze(0),
size=(512, 512), mode='bilinear', align_corners=False)
latents = sds.encode_imgs(pred_rgb_512)
# Compute SDS loss
loss = sds.sds_loss(latents, text_embeddings=embeddings['default'],
text_embeddings_uncond=embeddings['uncond'],
guidance_scale=100, grad_scale=1)
# Regularization losses
loss_entropy = entropy_loss(weights, pred_depth)
loss_orient = orientation_loss(normals, rays_d)
loss_total = loss + lambda_entropy * loss_entropy + lambda_orient * loss_orient
Hyperparameters Tuned
- lambda_entropy: 1e-3 (encourages sparse density, prevents cloudy geometry)
- lambda_orient: 1e-2 (encourages normals to face camera, improves geometry)
- latent_iter_ratio: 0.2 (first 20% of training uses normal shading for geometry warmup)
Results for Three Example Prompts
1. "a standing corgi dog"
RGB Rendering
Depth Map
2. "a drumming gorilla"
RGB Rendering
Depth Map
3. "a rabbit eating a carrot"
RGB Rendering
Depth Map
Observations
- Geometry Quality: The NeRF successfully learns overall 3D geometry matching the text prompts, with recognizable shapes and structures. However, the geometry is not perfectly consistent across different views (e.g., the corgi has multiple front faces and three ears).
- Appearance: Colors and textures are reasonably matched to the prompts, though not photorealistic.
- Depth Maps: Show clear geometric structure and depth variation, confirming successful 3D learning.
- Training: Trained for 10000 iterations with progressive shading (normal → textureless/lambertian).
2.4.1 View-Dependent Text Embedding (10 points)
Extended Q2.3 with view-dependent text embeddings to improve 3D consistency. Different text embeddings are used based on the camera viewing angle (front, side, back, overhead, bottom).
Implementation Details
# View-Dependent Text Embedding (from Q23_nerf_optimization.py)
if args.view_dep_text:
azimuth_val = azimuth.item()
abs_polar = 90 + polar.item() # absolute polar angle in degrees
angle_overhead = 30 # degrees
if abs_polar <= angle_overhead:
# Overhead view
text_cond = embeddings['overhead']
elif abs_polar >= (180 - angle_overhead):
# Bottom view
text_cond = embeddings['bottom']
else:
# Normal view: interpolate front/side/back based on azimuth
if azimuth_val >= -90 and azimuth_val < 90:
if azimuth_val >= 0:
r = 1 - azimuth_val / 90
else:
r = 1 + azimuth_val / 90
text_cond = r * embeddings['front'] + (1 - r) * embeddings['side']
else:
if azimuth_val >= 0:
r = 1 - (azimuth_val - 90) / 90
else:
r = 1 + (azimuth_val + 90) / 90
text_cond = r * embeddings['side'] + (1 - r) * embeddings['back']
else:
text_cond = embeddings['default']
Results for Two Example Prompts
1. "a standing corgi dog"
Q2.3 (Baseline)
RGB
Depth
Q2.4.1 (View-Dependent)
RGB
Depth
2. "a rabbit eating a carrot"
Q2.3 (Baseline)
RGB
Depth
Q2.4.1 (View-Dependent)
RGB
Depth
Qualitative Analysis: Effects of View-Dependent Text Conditioning
Comparison with Q2.3 (Baseline):
- 3D Consistency: View-dependent text embeddings help reduce multi-view inconsistencies. The baseline (Q2.3) sometimes shows artifacts like "two mouths" or "three ears" because each view is optimized independently. View-dependent conditioning provides view-specific guidance that encourages consistent geometry.
- Geometry Quality: The view-dependent approach maintains similar or slightly improved geometric quality. The depth maps show consistent structure across views.
- Appearance: RGB renderings show similar appearance quality, with view-dependent embeddings providing more context-aware optimization.
- Limitations: While view-dependent embeddings help, some 3D inconsistencies may still persist due to the fundamental challenge of optimizing 3D structure from 2D diffusion guidance. The effectiveness depends on how well the diffusion model understands view-specific descriptions.
Key Insight: View-dependent text conditioning follows DreamFusion's hierarchical approach (Section 3.2.3), where different text embeddings are used for different viewing angles. This provides more targeted guidance during optimization, reducing the "Janus problem" (multiple front faces) and improving overall 3D consistency.
2.4.3 Variation of Implementation of SDS Loss (10 points)
Implemented SDS loss in pixel space instead of latent space, computing gradients directly in RGB space for potentially better perceptual alignment.
Implementation: Pixel-Space SDS Loss
# Pixel-Space SDS Loss (from SDS.py)
def sds_loss_pixel(self, latents, text_embeddings,
text_embeddings_uncond=None, guidance_scale=100, grad_scale=1):
# Sample timestep and add noise (same as latent-space)
t = torch.randint(self.min_step, self.max_step + 1,
(latents.shape[0],), dtype=torch.long, device=self.device)
with torch.no_grad():
noise = torch.randn_like(latents)
latents_noisy = self.scheduler.add_noise(latents, noise, t)
noise_pred = self.unet(latents_noisy, t,
encoder_hidden_states=text_embeddings).sample
# Classifier-free guidance
if text_embeddings_uncond is not None and guidance_scale != 1:
noise_pred_uncond = self.unet(latents_noisy, t,
encoder_hidden_states=text_embeddings_uncond).sample
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred - noise_pred_uncond)
# Compute SDS gradient in latent space
alpha_t = self.alphas[t]
w = 1 - alpha_t
grad_latent = w[:, None, None, None] * (noise_pred.detach() - noise) * grad_scale
# Decode current latents to pixel space
latents_scaled = 1 / self.vae.config.scaling_factor * latents
imgs_orig = self.vae.decode(latents_scaled.type(self.precision_t)).sample
imgs_orig = (imgs_orig / 2 + 0.5).clamp(0, 1)
# Decode target latents (latents - grad) to pixel space
with torch.no_grad():
latents_target = latents - grad_latent
latents_target_scaled = 1 / self.vae.config.scaling_factor * latents_target
imgs_target = self.vae.decode(latents_target_scaled.type(self.precision_t)).sample
imgs_target = (imgs_target / 2 + 0.5).clamp(0, 1)
# Compute loss in pixel space using target trick
loss = 0.5 * F.mse_loss(imgs_orig, imgs_target.detach(),
reduction='sum') / latents.shape[0]
return loss
Key Implementation Details
- Gradient Computation: SDS gradient is computed in latent space (same noise prediction and guidance as latent-space SDS)
- Loss Computation: Both current and target latents are decoded to pixel space, and loss is computed in RGB space
- Gradient Flow: Gradients flow back through the VAE decoder to the NeRF model, providing pixel-space supervision
- Target Trick: Uses the same target trick formulation but in pixel space: loss = MSE(decode(latents), decode(latents - grad))
Results
Result from pixel-space SDS loss: "a standing poodle dog"
Why the Implementation Failed
Problem: The pixel-space SDS loss implementation failed to produce any 3D geometry. The rendered output shows only a colored background with no recognizable 3D shape.
Root Cause Analysis:
- Gradient Flow Issue: The critical problem is in how gradients flow back through the VAE decoder. In the current implementation, the target image (
imgs_target) is computed under torch.no_grad(), which breaks the gradient flow. While this is intentional for the target trick (to prevent gradients from flowing through the target), the issue is that the loss computation may not be properly connected to the NeRF parameters.
- VAE Decoder Precision: The VAE decoder uses
self.precision_t for type conversion, which may cause precision issues when gradients flow back. The decoder might not be fully differentiable in the way needed for this application.
- Loss Scale: The pixel-space loss may have a different scale compared to latent-space loss, requiring different learning rates or loss scaling that wasn't tuned.
- Numerical Instability: Decoding latents to pixels and then computing MSE may introduce numerical instabilities, especially when the latents are far from the VAE's training distribution.
- Missing Regularization: The pixel-space implementation may need additional regularization or different hyperparameters compared to latent-space, but these weren't tuned.
What Should Have Been Done:
- Ensure Gradient Flow: Make sure that
imgs_orig has gradients enabled and flows back through the VAE decoder to the latents, and then to the NeRF model.
- Tune Hyperparameters: The pixel-space loss likely requires different learning rates, loss scales, or regularization weights compared to latent-space.
- Stability Checks: Add gradient clipping or normalization to prevent numerical instabilities from the VAE decoder.
- Progressive Training: Start with latent-space loss to establish geometry, then fine-tune with pixel-space loss.
Comparison with Latent-Space:
The latent-space SDS loss works because it operates directly in the VAE's latent space, which is designed for the diffusion model. The gradients are more stable and the loss scale is well-calibrated. In contrast, pixel-space loss introduces an additional decoding step that may not preserve the gradient information needed for effective optimization.