3D Gaussian Splatting and Diffusion Guided Optimization¶

Part 1 - 3D Gaussian Splatting¶

1.1.2 Evaluate 2D Gaussians¶

Result of Running Unit Tests to evaluate compute_cov_3D, compute_cov_2D, compute_means_2D, and evaluate_gaussian_2D:

No description has been provided for this image

1.1.5 Perform Splatting¶

Output of running render.py:

No description has been provided for this image

1.2.2 Perform Forward Pass and Compute Loss¶

The learning rates I used for the parameters were as follows:

  • gaussians.pre_act_opacities --> 0.005
  • gaussians.pre_act_scales --> 0.005
  • gaussians.colours --> 0.05
  • gaussians.means --> 0.0002

I chose these based on the intuition that colours should not be as sensitive to a slight perturbation so a higher LR is appropriate. The means should be very sensitive as any change will move the gaussians in the space which is why I had the smallest LR. The pre_act_opacities and the pre_act_scales would affect the visibility and size of the Gaussians and would be moderately sensitive which is why I chose a LR in the middle of the other 2 params.

Trained for: 1000 iterations

Results

  • Mean PSNR: 27.844
  • Mean SSIM: 0.919

Output Results:

No description has been provided for this image
No description has been provided for this image

Extensions - 1.3.1 Rendering Using Spherical Harmonics¶

View Independent (Previous) View Dependent (w/ Spherical Harmonics)
No description has been provided for this image No description has been provided for this image
No description has been provided for this image No description has been provided for this image
No description has been provided for this image No description has been provided for this image
No description has been provided for this image No description has been provided for this image

Observations:

  • We see the side-by-side RGB renders from frames 1, 13, and 18 respectively and we can notice some key differences

  • Frame 1 (2nd Row) --> The contrast in the seat of the chair is a lot more apparent in the view-dependent rendering. This is true both of the contrast between the green and gold parts, and also between the shaded/non shaded part. The non-shadowed green looks a lot brighter and well-lit in the view dependent rendering. This is because spherical harmonics will allow for view-dependent appearance effects such as shading variation, leading to the more photorealistic effects such as this

  • Frame 13 (3rd Row) --> The gold embroidery on the seat of the chair and the metal pieces on the arms both look more shiny in the view-dependent rendering than the view-independent. There is a "glint" to them, that is enabled by the spherical harmonics capturing specular reflections/highlights (whereas the averaging of the lighting that happens for view independent will reduce such effects being rendered)

  • Frame 18 (4th Row) --> Once again, the glint of the gold on the bottom part of the seat (near the backrest) is more prominent in the view-dependent. The whole green/gold portion on the view independent looks very flat and doesn't capture material realism very well.

All these observations motivate why the added complexity/parameters needed for storing spherical harmonic coefficients may be worth the trade off as it leads to a rendering that is more realistic and visually pleasing.

Part 2 - Diffusion-guided Optimization¶

2.1 SDS Loss + Image Optimization¶

Results of different image optimizations:

"a hamburger"¶

Without Guidance (1000 Iterations) With Guidance (2000 Iterations)
No description has been provided for this image No description has been provided for this image

"a ferocious dragon"¶

Without Guidance (1000 Iterations) With Guidance (1100 Iterations)
No description has been provided for this image No description has been provided for this image

"a pistachio croissant"¶

Without Guidance (1000 Iterations) With Guidance (2000 Iterations)
No description has been provided for this image No description has been provided for this image

"a standing corgi dog"¶

Without Guidance (1000 Iterations) With Guidance (2000 Iterations)
No description has been provided for this image No description has been provided for this image

2.2 Texture Map Optimization for Mesh¶

"a cow wearing a green sweater"¶

Initial Mesh Final Mesh (2000 Iterations)
No description has been provided for this image No description has been provided for this image

"a cow with a plaid outfit"¶

Initial Mesh Final Mesh (2000 Iterations)
No description has been provided for this image No description has been provided for this image

2.3 NeRF Optimization¶

"a standing corgi dog "¶

RGB Image (6000 Iterations) Rendered Depth Video Rendered RGB Video
No description has been provided for this image Your browser does not support the video tag. Your browser does not support the video tag.

"a fire breathing dragon"¶

RGB Image (6000 Iterations) Rendered Depth Video Rendered RGB Video
No description has been provided for this image Your browser does not support the video tag. Your browser does not support the video tag.

"a orange race car"¶

RGB Image (6000 Iterations) Rendered Depth Video Rendered RGB Video
No description has been provided for this image Your browser does not support the video tag. Your browser does not support the video tag.

One of the renderings that did not do well:¶

I suspect this is because the "formula one" is a very specific part of the prompt which may not be well represented in the training data.

Therefore, we get this messy rendering with blobs of red. Simply changing to "orange car car" resulted in a much better output as seen above.

"a red formula one car"¶

RGB Image (6000 Iterations) Rendered Depth Video Rendered RGB Video
No description has been provided for this image Your browser does not support the video tag. Your browser does not support the video tag.

Extensions -- 2.4.1 View-dependent text embedding¶

To extend to view-dependent text embeddings, I changed the call to prepare_embeddings so it also returns text embeddings for the front, back, side views (along with the usual uncond and default). During the training loop, I added multiple calls to our SDS Loss computation, one per text embedding (Front, Back, Side, Default), and then aggregated all these losses to be backpropagated. Below, we can see some side by side comparisons:

"a standing corgi dog"¶

View Independent View Dependent
Your browser does not support the video tag. Your browser does not support the video tag.
Your browser does not support the video tag. Your browser does not support the video tag.

We can see that the view dependent method has a positive effect on the consistency of the rendering in the following ways:

  • In the left column (View independent), the dog's face is very inconsistent and we see its eyes and nose flicker in and out of multiple views in a way that breaks the realism
  • From different views, the legs of the corgi also appear to change position in the View independent rendering, further showing the limitations of this method
  • In the view dependent rendering, the body is more consistent and we don't get the same flickering with the face (Even though the eyes aren't present at all, we see the snout in a consistent position)
  • The legs and the tail of the corgi in the view dependent are also more locked into position
  • Interestingly, we get the 3 ears artifact in both renderings

"a sleeping lion"¶

View Independent View Dependent
Your browser does not support the video tag. Your browser does not support the video tag.
Your browser does not support the video tag. Your browser does not support the video tag.
  • In this example, the view dependent rendering just leads to a much better overall result
  • The lion object (although still a little vague) is much more recognizable than in the view independent rendering, and even though finer details are missing, you can see where the general head and body of the animal is
  • Furthermore, we get much less noise everywhere than in the view-independent, which seems as though it is in a hazy cloud. This is despite both renderings having been optimized for 100,000 iterations.