Rendered output from the 3D Gaussian rasterizer:

The learning rates used for training the 3D Gaussian representation were:
pre_act_opacities: 0.05 pre_act_scales: 0.05 colours: 0.05 means: 0.005 The model was trained for 200 iterations, achieving the following metrics:


Rendered output without Spherical Harmonics (same as Section 1.1):

Rendered output using Spherical Harmonics:

View 1:
Without Spherical Harmonics:

With Spherical Harmonics:
The chair's pattern appears more detailed, with enhanced realism in shadows and reflections on metallic ornaments.
View 2:
Without Spherical Harmonics:

With Spherical Harmonics:
Reflections are noticeably improved. Without Spherical Harmonics, metallic pieces display flat colors, and the fabric pattern lacks variation in response to shadows and light. Incorporating Spherical Harmonics introduces view-dependent effects, enhancing realism.
Prompt: "a hamburger"
Left: Without SDS guidance. Right: With SDS guidance.
Iterations: 400 and 1000, respectively.

Prompt: "a standing corgi dog"
Left: Without SDS guidance. Right: With SDS guidance.
Iterations: 1000 and 900, respectively.

Prompt: "a soccer ball"
Left: Without SDS guidance. Right: With SDS guidance.
Iterations: 1500 and 1100, respectively.

Prompt: "a rubik cube"
Left: Without SDS guidance. Right: With SDS guidance.
Iterations: 1400 and 1999, respectively.

Prompt: "a hamburger"

Prompt: "a rubik cube"

Prompt: "a standing corgi dog"

Prompt: "a hotdog"

Prompt: "a gamer chair"

Prompt: "a standing corgi dog"

Prompt: "a hotdog"

The use of view-dependent text embedding significantly improves the results. In the Corgi example, without view-dependent text embedding, the dog appears to have three ears, and its face remains visible from nearly every angle. By incorporating view-dependency, the geometry becomes more realistic and consistent with the expected appearance.
A similar improvement is observed in the hotdog example. Without view-dependency, the result lacks asymmetry, which is a natural characteristic of a hotdog. Adding view-dependency introduces the expected asymmetry, leading to a more realistic and visually accurate representation.
For this implementation, the code from Section 2.3 was extended to create a new loss function named sds_loss_pixel. This function computes the loss directly between the predicted and target images, enabling the inclusion of LPIPS loss alongside MSE loss. This approach enhances the evaluation of perceptual similarity and pixel-wise accuracy.
The results are worse compared to Section 2.3, specially on depth estimation. This could be due to the decoder not properly reconstructing the image, leading to noisy gradients and suboptimal loss values.