16-825 Computer Vision - Assignment 4 (3D Generation)

1. 3D Gaussian Splatting (60 Points)

1.1 3D Gaussian Rasterization

Deliverable: GIF rendered from a pre-trained Gaussian model.

I implemented the core 3D Gaussian rasterization pipeline, including projection, alpha/opacity calculation, and final blending. The GIF below shows the result.

Color Rendering GIF (Q1.1)

Observation: The rendering output technically confirms the successful implementation of the core 3D Gaussian Splatting pipeline. The clear color gradients in the depth map (blue = near, yellow = far) prove that the 3D Gaussians were correctly projected and that the depth-based sorting mechanism is functional. The clean, sharp boundary shown in the mask/silhouette image confirms that the opacity calculations and the volumetric accumulation formula (transmittance and final color blending) are correctly executed.

1.2 Training 3D Gaussian Representations

Deliverable: GIF showing the final rendered toy truck after training.

We trained the 3D Gaussian representation for the toy truck using isotropic Gaussians initialized from a point cloud. Training ran for 1000 iterations with a **differential learning rate strategy** for fast and stable convergence.

Training Parameters & Metrics

Parameter	Learning Rate
opacities	0.001
scales	0.003
colours	0.02
means	0.01

Trained Iterations: 1000
Mean PSNR: 29.811
Mean SSIM: 0.939

Rendered Results (GIFs)

Final Render GIF

Training Progress GIF

1.3.1 Rendering Using Spherical Harmonics (SH)

Deliverables:

Attach the GIF you obtained using render.py for questions 1.3.1 (SH Rendering) and 1.1.5 (Base Rendering).
Attach 2 or 3 side-by-side RGB image comparisons of the renderings obtained from both cases. The images being compared must correspond to the same view/frame.

I extended the base 3D Gaussian rasterizer from Q1.1.5 to incorporate **Spherical Harmonics (SH)** for the color contribution of each Gaussian. This change models **view-dependent lighting effects** (such as highlights and reflections), significantly enhancing realism beyond the view-independent fixed color model.

GIF Comparison: View-Independent vs. View-Dependent Color

Q1.1.5 Base Rendering (View-Independent Color)

Q1.3.1 SH Rendering (View-Dependent Color)

Observation (GIF Summary): The Q1.3.1 GIF demonstrates a clear jump in fidelity over the Q1.1.5 base render. While both GIFs show the same geometry (depth and silhouette are preserved), the SH render exhibits **dynamic specular highlights** and **subtle, smooth shading transitions** as the viewpoint changes. The base render, by contrast, appears uniformly lit and flat, validating that the integration of higher-order SH coefficients successfully simulates view-dependent reflection and lighting.

Static Image Comparisons: Identical Viewpoints

Comparison 1: Front View

Q1.1.5 Base Render

Q1.3.1 SH Render

Comparison 2: Side View

Q1.1.5 Base Render

Q1.3.1 SH Render

Comparison 3: Top-Back View

Q1.1.5 Base Render

Q1.3.1 SH Render

2. Diffusion-Guided Optimization (60 Points)

2.1 SDS Loss and Image Optimization

Deliverable: Four optimized images showing the effect of Classifier-Free Guidance (CFG).

The SDS loss function was implemented. We compare the results of optimizing a latent vector with and without CFG (guidance scale > 1).

"a hamburger" (No Guided, iter 2000)

"a hamburger" (Guidance,iter 1900)

"a standing corgi dog" (No Guided, iter 2000)

"a standing corgi dog" (Guidance,iter 2000)

"I am whipping my computer" (No Guided, iter 2000)

"I am whipping my computer" (Guidance,iter 2000)

"I punch and shatter the CMU logo" (No Guided, iter 2000)

"I punch and shatter the CMU logo" (Guidance,iter 2000, not ideal)

"A fist at the center of an exploding, shattered CMU logo, fragments and shards flying everywhere" (Guidance,iter 2000, better)

2.2 Texture Map Optimization for Mesh

Deliverable: Two GIFs showing the final textured mesh views.

I optimized the texture map of the provided cow mesh using the SDS loss, demonstrating photorealistic texture generation guided by text prompts.

Prompt A: "a cow with a hamburger texture hamburger"

Prompt B: "a cow covered in a reflective disco ball mirror texture"

2.3 NeRF Optimization (View-Independent)

Deliverable: Three pairs of RGB and Depth videos for different prompts. (Displayed below as GIFs)

Prompt 1: "a standing corgi dog" (RGB)

Prompt 1: "a standing corgi dog" (Depth)

Prompt 2: "a cute little pig" (RGB)

Prompt 2: "a cute little pig" (Depth)

Prompt 3: "a tuna fish" (RGB)

Prompt 3: "a tuna fish" (Depth)

2.4.1 View-Dependent Text Embedding (10 Points)

Deliverable: RGB and Depth videos with view-dependent conditioning compared with Q2.3 results.

The DreamFusion paper proposes view-dependent text embedding to achieve better 3D consistency by conditioning the diffusion model on the viewing angle. This addresses issues like multiple front-facing features (e.g., overlapping ears) that occur when optimizing each view independently.

Comparison 1: "a standing corgi dog"

Q2.3: View-Independent

RGB

Depth

Q2.4.1: View-Dependent

RGB

Depth

Observation: Enabling view-dependent (VD) embedding effectively resolved the issue of multiple overlapping ears observed in the Q2.3 corgi model. The VD version shows more coherent geometry and a cleaner, more realistic silhouette as the viewing angle changes. The depth map also appears more consistent and uniform across different views.

Comparison 2: "a tuna fish"

Q2.3: View-Independent

RGB

Depth

Q2.4.1: View-Dependent

RGB

Depth

Observation - Challenging Geometry:

The tuna fish exhibits the most severe multi-view inconsistency artifacts among all tested objects, with visible duplicate tail fins appearing from opposite viewing angles. This occurs even with view-dependent conditioning and highlights fundamental challenges in 3D reconstruction of objects with specific geometric properties.

Why Tuna Fish is Particularly Difficult:

Strong Directionality Without Ground Constraint: Unlike the corgi (which has four legs providing ground contact points), the tuna is a free-floating object with extreme head-to-tail directionality. This makes it ambiguous which end is "front" when viewed from certain angles, leading the model to generate recognizable features (tail fins) at multiple positions.
Concentrated Salient Features: The tuna's most distinctive feature (the forked tail fin) is concentrated at one end (~20% of body length), similar to the corgi's ears. The diffusion model strongly associates "tuna fish" with visible tail fins, causing it to hallucinate tails from every viewpoint to satisfy the text prompt.
Bilateral Symmetry Ambiguity: The tuna's perfect left-right symmetry means there are two equally valid 180° rotations. Without strong 3D consistency enforcement, the optimization can converge to a solution where opposite sides are treated as independent "front views," each with its own tail.
Smooth Featureless Body: The streamlined body (~70% of length) lacks distinctive texture or geometric features that could serve as 3D anchors. This gives the optimizer too much freedom to place high-saliency features (tails) wherever needed to minimize 2D reconstruction loss per view.
Training Data Bias: Most tuna images in diffusion model training data show side profiles (exhibiting the characteristic streamlined shape). Views from top, bottom, or oblique angles are underrepresented, providing weak priors for these orientations and allowing the model to "fill in" with duplicate features.

Comparison with Other Objects:

vs. Corgi: While both show feature duplication, the corgi's four-legged stance provides ground plane constraint and clearer front/back distinction, making VD conditioning more effective.

This example demonstrates the limitations of view-independent SDS optimization for highly directional, asymmetric objects and suggests that additional geometric constraints or progressive training strategies may be necessary for such cases.

Summary: Geometric Complexity and 3D Consistency

Key Findings Across Different Object Types:

Symmetric objects show minimal artifacts: Objects with simple, symmetric geometry (like spheres) naturally avoid multi-view inconsistencies because every view is equally valid and contains similar features. These objects require minimal or no view-dependent conditioning.
Complex asymmetric objects benefit significantly from VD conditioning: Objects with distinct directional features (like the corgi) show clear improvement with view-dependent embeddings, as the conditioning helps establish consistent front/back/side orientations and reduces feature duplication artifacts.
Highly directional free-floating objects are most challenging: The tuna fish represents the most difficult case—combining strong directionality, concentrated salient features, bilateral symmetry, and lack of ground constraints. Even view-dependent conditioning struggles with these cases, suggesting the need for additional regularization techniques or multi-stage optimization strategies.
Trade-off consideration: While view-dependent embeddings improve geometric consistency for complex objects, they add computational overhead and may still be insufficient for objects with inherent geometric ambiguities. The technique should be selectively applied based on object complexity and geometry type.
Geometric difficulty hierarchy: Based on our experiments, the difficulty of achieving 3D consistency follows this pattern:
Sphere (easiest) < Radially symmetric (flower) < Grounded asymmetric (corgi) < Free-floating directional (tuna) < Complex articulated objects (hardest)

Future Directions: For challenging objects like the tuna fish, potential improvements could include (1) progressive training with directional constraints, (2) density regularization to prevent feature duplication, (3) explicit symmetry-breaking losses, or (4) incorporating shape priors from 3D model datasets.