16-825: Learning for 3D Vision — Assignment 4

Manyung Emma Hon · mehon · Fall 2025

1.1.5 Perform Splatting

1.1

1.2.2 Perform Forward Pass and Compute Loss

Learning rates used
  • Opacities: 0.01
  • Scales: 0.005
  • Colours: 0.01
  • Means: 0.00016
  • Number of Iterations: 1000
  • PSNR: 29.084
  • SSIM: 0.936
  • 1.2
    Final renders after training
    1.2

    1.3.1 Rendering Using Spherical Harmonics (10 Points)

    without spherical harmonics
    1.1
    with spherical harmonics
    1.1
    Frame With SH Without SH Description
    16 1.1 1.1 Instead of a single shade, the shade of this chair with spherical harmonics varies smoothly across the surface.
    17 1.1 1.1 The shade changes depends on the angle as well.

    2.1 SDS Loss + Image Optimization (20 points)

    All images trained for 2000 iterations

    Without guidance, the optimization struggles to produce recognizable objects, while with guidance the objects match the text prompts much more accurately.

    Prompt 1: "a hamburger"

    hamburger without guidance
    Without Guidance
    hamburger with guidance
    With Guidance

    Prompt 2: "a standing corgi dog"

    corgi without guidance
    Without Guidance
    corgi with guidance
    With Guidance

    Prompt 3: "a dancing cat"

    cat without guidance
    Without Guidance
    cat with guidance
    With Guidance

    Prompt 4: "a rabbit eating apple"

    rabbit without guidance
    Without Guidance
    rabbit with guidance
    With Guidance

    2.2 Texture Map Optimization for Mesh (15 points)

    a dotted black and white cow
    1.1
    a stripe brown and green cow
    1.1

    2.3 NeRF Optimization (15 points)

    Prompt 1: "a standing corgi dog"

    Depth
    RGB

    Prompt 2: "a dancing rabbit"

    Depth
    RGB

    Prompt 3: "a sitting cat"

    Depth
    RGB

    2.4.1 [Extension] View-dependent text embedding (10 points)

    This extension implements view-dependent text conditioning. By conditioning the diffusion model on viewing direction (front, side, back, overhead), the optimization produces more 3D-consistent results.

    Prompt 1: "a standing corgi dog"

    Without View-Dependent Conditioning (from Q2.3)

    Depth (Baseline)
    RGB (Baseline)

    With View-Dependent Conditioning

    Depth (View-Dependent)
    RGB (View-Dependent)

    Prompt 2: "a sitting cat"

    Without View-Dependent Conditioning (from Q2.3)

    Depth (Baseline)
    RGB (Baseline)

    With View-Dependent Conditioning

    Depth (View-Dependent)
    RGB (View-Dependent)

    Qualitative Analysis

    3D Consistency have improved. View-dependent conditioning helps the model understand spatial relationships, resulting in animals with the correct number of features (e.g., two ears instead of multiple faces). Additionally, objects maintain consistent 3D structure as the camera orbits, with fewer artifacts like floating geometry or duplicated features. Lastly, the appearance changes smoothly between different viewing angles rather than having jarring inconsistencies. While it significantly improved 3D consistency and geometric quality. The objects look more believable from all angles, it sometimes struggles with fine color details (e.g., corgi eye placement). May require more careful hyperparameter tuning to achieve full color accuracy.

    Overall, view-dependent text conditioning is crucial for creating convincing 3D assets from text descriptions. While it may add complexity, the improvement in 3D consistency far outweighs the drawbacks.