Q1

1.1.5. 3D Gaussian Rasterization. Perform Splatting

Here is the rendered result:

1.2.2. Training 3D Gaussian Representations

The learning rates I used for the parameters are the following:

I trained this for 1000 iterations.

The resulting mean PSNR is 28.195, and the mean SSIM score is 0.922.

Here is the training progress of the splats visualized:

Here is the rendered final result after training:

1.3.1. Rendering using Spherical Harmonics

Rendered with spherical harmonics:

Rendered without spherical harmonics (From q1.1.5 above):

Taking some specific frames, we can see how view dependance affects the rendering process:

View Without spherical harmonics (Q1.1) With spherical harmonics (Q1.3) Observation Notes
000 000 000 Note a subtle transition of the shadow when using spherical harmonics instead of the dull and discontinuous shadow on the seat.
010 010 010 The material of the seatback and seat clearly has a sheen that is not present from the scene without spherical harmonics, due to view dependance.

Q2

2.1. Diffusion-guided Image Optimization

All diffusion results obtained with 2000 iterations of training.

Prompt Result Without Guidance Result With Guidance
a hamburger img img
a standing corgi dog img img
a humanoid robot img img
a sports car img img

It is worthy to note that all results without using guidance over SDS loss always collapses into some white-beige scene overall within a few hundred steps, while using guidance always produces non-empty results. All prompts were trained five times with identical results.

2.2. Texture Map Optimization for Mesh

Prompt: "A hamburger"

Prompt: "A standing corgi dog"

Prompt: "A pizza slice"

2.3. NeRF Optimization

Prompt Result depth map Result rendered RGB
a standing corgi dog
a hamburger
a potted plant
a sports car

Note that the sports car does not converge during training. This is to show that view-independent text embedding would try to optimize every view independently and not consider the viewing angle.

2.4 Bonus: View Dependant text embedding

Prompt Result depth map Result rendered RGB
a standing corgi dog
a sports car

Using view dependent text embedding as given in the utils python script, we see that the standing corgi dog prompt actually performs worse on the view dependant text embedding, and the training fails to converge. However, when the view dependent text embedding is used on the sports car prompt, it performs better than in Q2.3 and converges to a mostly accurate sports car shape with one of the front pillars missing, but at least converges during training.

This shows that the primitive view dependance of the text embedding is not as robust, though it did help the sports car prompt converge. In the code, currently, the view dependance directions are only from "front", "side", and "back", while the paper also mentions "overhead" and "underneath" views. If the set of embeddings were expanded as per the DreamFusion paper, it may significantly improve the output render quality.