16-825 Learning for 3D Vision

Assignment 4

Rodrigo Lopes Catto | rlopesca

3D Gaussian Splatting

1.1.5 Perform Splatting

SegmentLocal

1.2 Training 3D Gaussian Representations

Number of iterations: 250

Learning rate parameters:

The values for PSNR & SSIM are as follows:

Training progress

SegmentLocal

Final rendered GIF

SegmentLocal

1.3.1 Rendering Using Spherical Harmonics

SegmentLocal

Frame Without Spherical Harmonics With Spherical Harmonics Differences
Comparison 1 Frame 0 Frame 0 Lighting appears more realistic with spherical harmonics, showing softer shadows and richer texture detail on the seat.
Comparison 2 Frame 12 Frame 12 Spherical harmonics add shading variation and highlight depth, making materials look less flat and more natural.

2. Diffusion-guided Optimization

2.1 SDS Loss + Image Optimization

All the models below were trained for 2000 iterations.

'A hamburger'

Without GuidanceWith Guidance
Frame 0 Frame 0

'A standing corgi dog'

Without GuidanceWith Guidance
Frame 0 Frame 0

'a roller coaster'

Without GuidanceWith Guidance
Frame 0 Frame 0

'f1 car'

Without GuidanceWith Guidance
Frame 0 Frame 0

2.2 Texture Map Optimization for Mesh

Note: the gifs that are saved are not on continuous loop. Please refresh the webpage to start the gif video.

Prompt: 'Cow with tiger skin'

SegmentLocal

Prompt: 'Black and white cow'

SegmentLocal

2.3 NeRF Optimization

Parameters:

'a standing corgi dog' - 10,000 iterations

RGBDepth

'a dinossaur' - 6000 iterations

RGBDepth

'a duck' - 10000 iterations

RGBDepth

Extensions

2.4.1 View-dependent text embedding

Parameters:

'a standing corgi dog' - 5000 iterations

RGBDepth

Comparing this result with the corgi generated without view dependence, the main difference is the dog’s pose, as it is standing in the previous result and sitting here. In addition, the view-dependent model achieved a comparable or slightly better level of structural detail, with the nose now visible, while using only half the number of iterations of the non-view-dependent one.

'a dinossaur' - 2000 iterations

RGBDepth

The dinosaur rendered with view dependence shows a more coherent structure and better-defined silhouette, particularly around the head and tail regions. Despite being trained for only 2000 iterations, it already achieves a recognizable shape and some texture surface compared to the non-view-dependent version, which after 6000 iterations still appears flatter and less consistent in geometry.