16-825: Learning for 3D Vision — Assignment 4

Manyung Emma Hon · mehon · Fall 2025

1.1.5 Perform Splatting

1.2.2 Perform Forward Pass and Compute Loss

1.3.1 Rendering Using Spherical Harmonics (10 Points)

Frame	With SH	Without SH	Description
16			Instead of a single shade, the shade of this chair with spherical harmonics varies smoothly across the surface.
17			The shade changes depends on the angle as well.

2.1 SDS Loss + Image Optimization (20 points)

All images trained for 2000 iterations

Without guidance, the optimization struggles to produce recognizable objects, while with guidance the objects match the text prompts much more accurately.

Prompt 1: "a hamburger"

hamburger without guidance — **Without Guidance**

hamburger with guidance — **With Guidance**

Prompt 2: "a standing corgi dog"

corgi without guidance — **Without Guidance**

Prompt 3: "a dancing cat"

cat without guidance — **Without Guidance**

Prompt 4: "a rabbit eating apple"

rabbit without guidance — **Without Guidance**

rabbit with guidance — **With Guidance**

2.2 Texture Map Optimization for Mesh (15 points)

2.3 NeRF Optimization (15 points)

Prompt 1: "a standing corgi dog"

Depth

RGB

Prompt 2: "a dancing rabbit"

Depth

RGB

Prompt 3: "a sitting cat"

Depth

RGB

2.4.1 [Extension] View-dependent text embedding (10 points)

This extension implements view-dependent text conditioning. By conditioning the diffusion model on viewing direction (front, side, back, overhead), the optimization produces more 3D-consistent results.

Prompt 1: "a standing corgi dog"

Without View-Dependent Conditioning (from Q2.3)

Depth (Baseline)

RGB (Baseline)

With View-Dependent Conditioning

Depth (View-Dependent)

RGB (View-Dependent)

Prompt 2: "a sitting cat"

Without View-Dependent Conditioning (from Q2.3)

Depth (Baseline)

RGB (Baseline)

With View-Dependent Conditioning

Depth (View-Dependent)

RGB (View-Dependent)

Qualitative Analysis

3D Consistency have improved. View-dependent conditioning helps the model understand spatial relationships, resulting in animals with the correct number of features (e.g., two ears instead of multiple faces). Additionally, objects maintain consistent 3D structure as the camera orbits, with fewer artifacts like floating geometry or duplicated features. Lastly, the appearance changes smoothly between different viewing angles rather than having jarring inconsistencies. While it significantly improved 3D consistency and geometric quality. The objects look more believable from all angles, it sometimes struggles with fine color details (e.g., corgi eye placement). May require more careful hyperparameter tuning to achieve full color accuracy.

Overall, view-dependent text conditioning is crucial for creating convincing 3D assets from text descriptions. While it may add complexity, the improvement in 3D consistency far outweighs the drawbacks.