16-825 Assignment 4¶

Q1: 3D Gaussian Splatting¶

Q1.1.5: Perform Splatting¶

Submission: In your webpage, attach the GIF that you obtained by running render.py

Q1.2: Training 3D Gaussian Representations¶

Submission: In your webpage, include the following details:

Learning rates that you used for each parameter. If you had experimented with multiple sets of learning rates, just mention the set that obtains the best performance in the next question.

Number of iterations that you trained the model for.

The PSNR and SSIM.

Both the GIFs output by train.py.

Learning Rates:

pre_act_opacities: 0.01
pre_act_scales: 0.001
colours: 0.01
means: 0.00001

Number of iterations: 1000
Mean PSNR: 27.746
Mean SSIM: 0.924

q1_training_progress:

q1_training_final_renders:

Q1.3: Extensions¶

Q1.3.1: Rendering Using Spherical Harmonics¶

Submission: In your webpage, include the following details:

Attach the GIF you obtained using render.py for questions 1.3.1 (this question) and 1.1.5 (older question).

Attach 2 or 3 side by side RGB image comparisons of the renderings obtained from both the cases. The images that are being compared should correspond to the same view/frame.

For each of the side by side comparisons that are attached, provide some explanation of differences (if any) that you notice.

New (questions 1.3.1)

Old (questions 1.1.5)

In each case for 1.3.1, if you look at the seat material it appears to be more reflective / shiny due to on the viewing angle while for 1.1.5 the color is stale and doesnt change depending on your viewing angle.

Comparison 1 (1.1.5 on left | 1.3.1 on right)

Comparison 2 (1.1.5 on left | 1.3.1 on right)

Comparison 3 (1.1.5 on left | 1.3.1 on right)

Q2: Diffusion-guided Optimization¶

Q2.1: SDS Loss + Image Optimization¶

Submission: On your webpage, show image output for four different prompts. Use the following two example prompts and two more of your own choice.

For each case (with and without guidance), show the "prompt - image" pair and indicate how many iterations you trained to obtain the results.

"a hamburger"¶

Without guidance (1400 iterations)

With guidance (1400 iterations)

"a standing corgi dog"¶

Without guidance (1400 iterations)

With guidance (1400 iterations)

"a spaceship"¶

Without guidance (1400 iterations)

With guidance (1400 iterations)

"avatar the last air bender"¶

Without guidance (1400 iterations)

With guidance (1400 iterations)

Q2.2: Texture Map Optimization for Mesh¶

Submission: On your webpage, show the gif of the final textured mesh given two different text prompts. You should be able to vary the color and texture pattern using diffent text prompts.

"a hamburger" (left) | "rainbow party" (right)

Q2.3: NeRF Optimization¶

Submission: On your webpage, show the video (.mp4 or .gif) of rendered rgb and depth images for three example prompts, one being "a standing corgi dog" and the other two of your own choice. The rendered object should match the prompt with reasonable geometry and color appearance, i.e. it may not be super photorealistic, but should at least be clear and recognizable object.

Tune the loss weight hyperparameters lambda_entropy and lambda_orient so that you get reasonable results for NeRF optimization. (Hint: try something small such as 1e-3, 1e-2, 1e-1, etc.)

prompt: "a standing corgi dog"

prompt: "timmy turner"

prompt: "a car"

prompt: "a hamburger"

Q2.4: Extensions¶

Q2.4.1: View-dependent text embedding¶

Submission: On your webpage, show the video of rendered rgb and depth images for at leat two example prompts, one being "a standing corgi dog" and the other one of your own choice. Compare the visual results with what you obtain in Q2.3 and qualitatively analyse the effects of view-dependent text conditioning.

prompt: "a standing corgi dog"

The view-dependence here does make the corgi appearance change meaningfully across different viewpoints, allowing for the fur texture to be more apparent. However, I also observe that the model sometimes produces artifacts like a second head. This suggests that the view-dependent text conditioning is weakening the global consistency of the scene. By giving different textual cues per view, the model seems less able to lock down the spatial placement of semantic parts.

prompt: "french fries"