3D Gaussian Splatting and Diffusion Guided Optimization¶

Part 1 - 3D Gaussian Splatting¶

1.1.2 Evaluate 2D Gaussians¶

Result of Running Unit Tests to evaluate compute_cov_3D, compute_cov_2D, compute_means_2D, and evaluate_gaussian_2D:

No description has been provided for this image

1.1.5 Perform Splatting¶

Output of running render.py:

1.2.2 Perform Forward Pass and Compute Loss¶

The learning rates I used for the parameters were as follows:

gaussians.pre_act_opacities --> 0.005
gaussians.pre_act_scales --> 0.005
gaussians.colours --> 0.05
gaussians.means --> 0.0002

I chose these based on the intuition that colours should not be as sensitive to a slight perturbation so a higher LR is appropriate. The means should be very sensitive as any change will move the gaussians in the space which is why I had the smallest LR. The pre_act_opacities and the pre_act_scales would affect the visibility and size of the Gaussians and would be moderately sensitive which is why I chose a LR in the middle of the other 2 params.

Trained for: 1000 iterations

Results

Mean PSNR: 27.844
Mean SSIM: 0.919

Output Results:

Extensions - 1.3.1 Rendering Using Spherical Harmonics¶

View Independent (Previous)	View Dependent (w/ Spherical Harmonics)

Observations:

We see the side-by-side RGB renders from frames 1, 13, and 18 respectively and we can notice some key differences
Frame 1 (2nd Row) --> The contrast in the seat of the chair is a lot more apparent in the view-dependent rendering. This is true both of the contrast between the green and gold parts, and also between the shaded/non shaded part. The non-shadowed green looks a lot brighter and well-lit in the view dependent rendering. This is because spherical harmonics will allow for view-dependent appearance effects such as shading variation, leading to the more photorealistic effects such as this
Frame 13 (3rd Row) --> The gold embroidery on the seat of the chair and the metal pieces on the arms both look more shiny in the view-dependent rendering than the view-independent. There is a "glint" to them, that is enabled by the spherical harmonics capturing specular reflections/highlights (whereas the averaging of the lighting that happens for view independent will reduce such effects being rendered)
Frame 18 (4th Row) --> Once again, the glint of the gold on the bottom part of the seat (near the backrest) is more prominent in the view-dependent. The whole green/gold portion on the view independent looks very flat and doesn't capture material realism very well.

All these observations motivate why the added complexity/parameters needed for storing spherical harmonic coefficients may be worth the trade off as it leads to a rendering that is more realistic and visually pleasing.

Part 2 - Diffusion-guided Optimization¶

2.1 SDS Loss + Image Optimization¶

Results of different image optimizations:

"a hamburger"¶

Without Guidance (1000 Iterations)	With Guidance (2000 Iterations)

"a ferocious dragon"¶

Without Guidance (1000 Iterations)	With Guidance (1100 Iterations)

"a pistachio croissant"¶

Without Guidance (1000 Iterations)	With Guidance (2000 Iterations)

"a standing corgi dog"¶

Without Guidance (1000 Iterations)	With Guidance (2000 Iterations)

2.2 Texture Map Optimization for Mesh¶

"a cow wearing a green sweater"¶

Initial Mesh	Final Mesh (2000 Iterations)

"a cow with a plaid outfit"¶

Initial Mesh	Final Mesh (2000 Iterations)

2.3 NeRF Optimization¶

"a standing corgi dog "¶

RGB Image (6000 Iterations)	Rendered Depth Video	Rendered RGB Video

"a fire breathing dragon"¶

RGB Image (6000 Iterations)	Rendered Depth Video	Rendered RGB Video

"a orange race car"¶

RGB Image (6000 Iterations)	Rendered Depth Video	Rendered RGB Video

One of the renderings that did not do well:¶

I suspect this is because the "formula one" is a very specific part of the prompt which may not be well represented in the training data.

Therefore, we get this messy rendering with blobs of red. Simply changing to "orange car car" resulted in a much better output as seen above.

"a red formula one car"¶

RGB Image (6000 Iterations)	Rendered Depth Video	Rendered RGB Video

Extensions -- 2.4.1 View-dependent text embedding¶

To extend to view-dependent text embeddings, I changed the call to prepare_embeddings so it also returns text embeddings for the front, back, side views (along with the usual uncond and default). During the training loop, I added multiple calls to our SDS Loss computation, one per text embedding (Front, Back, Side, Default), and then aggregated all these losses to be backpropagated. Below, we can see some side by side comparisons:

"a standing corgi dog"¶

View Independent	View Dependent

We can see that the view dependent method has a positive effect on the consistency of the rendering in the following ways:

In the left column (View independent), the dog's face is very inconsistent and we see its eyes and nose flicker in and out of multiple views in a way that breaks the realism
From different views, the legs of the corgi also appear to change position in the View independent rendering, further showing the limitations of this method
In the view dependent rendering, the body is more consistent and we don't get the same flickering with the face (Even though the eyes aren't present at all, we see the snout in a consistent position)
The legs and the tail of the corgi in the view dependent are also more locked into position
Interestingly, we get the 3 ears artifact in both renderings

"a sleeping lion"¶

View Independent	View Dependent

In this example, the view dependent rendering just leads to a much better overall result
The lion object (although still a little vague) is much more recognizable than in the view independent rendering, and even though finer details are missing, you can see where the general head and body of the animal is
Furthermore, we get much less noise everywhere than in the view-independent, which seems as though it is in a hazy cloud. This is despite both renderings having been optimized for 100,000 iterations.