For previous questions the unit tests passed.

Learning rates:
pre_act_opacities lr: 2e-4
pre_act_scales lr: 2e-4
colours lr: 1e-3
means lr: 5e-4
Number of iterations: 1000
[*] Evaluation --- Mean PSNR: 28.407
[*] Evaluation --- Mean SSIM: 0.924

| Frame Number | Default | Spherical Harmonics | Explanations |
|---|---|---|---|
| All | ![]() |
![]() |
Previously, the rendered was view dependent. When we extend to use higher order terms, the colors vary smoothly and capture specular, reflections, and shows more naturally. Overall SH render produces much richer shading and more realisitic surface color changes. |
| 0 | ![]() |
![]() |
We can see the backrest is duller throughout in the default case. Using SPH left part of the backrest is lighter and right is darker giving much natural shadow look. Similarly the base shadows look a lot more natural due to variations in the SPH version. In the default version it seems like there's a mask with constant opacity. Moreover the motif of the base is much more vibrant with sph. |
| 10 | ![]() |
![]() |
SPH gets richer tonal variation. The inner seat back and armrest show more accurate shadowing and highlights. The texture of the fabric looks slightly more pronounced probably due to the view-dependent brightness variation. |
| 20 | ![]() |
![]() |
From the back angle, both of them look pretty similar. Only noticeable difference is that SPH gets more colors in the top part of the chair (brown area). It has some yellow/gold spots. The non-SPH version fails to capture such details. |
| Training Progress | Final | Mean PSNR | Mean SSIM | |
|---|---|---|---|---|
| Default Settings | ![]() |
![]() |
16.242 | 0.585 |
| Modified V1 | ![]() |
![]() |
18.002 | 0.695 |
| Modified Final | ![]() |
![]() |
18.515 | 0.721 |
Default Setting baseline uses isotropic gaussians, L1 loss, and the learning rates from the previous question . The results were quite poor as expected. To modify that I used anisotropic gaussians, different learning rates, lr scheduler, added SSIM loss and different number of iterations. I also tried different weights for l1 and SSIM loss. With 0.8 and 0.1 weight for L1 and SSIM loss respectively, the performance was still poor giving metrics close to Default Case.
[*] Evaluation --- Mean PSNR: 15.794
[*] Evaluation --- Mean SSIM: 0.575
I reduced the weighting of SSIM loss to 0.1. I also reduced the learning rates for each param (you can see the commented out progress in the code) and ran training for 3000 epochs. The results are marked as Modified V1. Here we atleast start seeing rough shapes of the balls and hence the result seemed much better. Even the metrics improved. Finally I trained for 8000 epochs. Here we see a lot more improvement than the baseline, as well as Modified V1. We can clearly see a few balls which were not visible previously. It also gets rid of a ot of specular highlight looking artifacts as seen V1. To further improve, I would tune the params even more and try running for slightly higher number of epochs.
| Prompt | Number of Iterations (w/o. guidance / w. guidance) | without guidance | with guidance |
|---|---|---|---|
| "a hamburger" | 400 / 2000 | ![]() |
![]() |
| "a standing corgi dog" | 400 / 2000 | ![]() |
![]() |
| "a weightlifting cat" | 400 / 2000 | ![]() |
![]() |
| "a racoon dj". | 1000 / 2000 | ![]() |
![]() |
| Prompt | Mesh |
|---|---|
| "a cow" | ![]() |
| "a black and white zebra" | ![]() |
| Prompt | Depth | RGB |
|---|---|---|
| "a standing corgi" | ![]() |
![]() |
| "a robot" | ![]() |
![]() |
| "a sitting cat" | ![]() |
![]() |
| Prompt | Non-View Dependent | View Dependent |
|---|---|---|
| "a standing corgi" | ![]() |
![]() ![]() |
| "a robot" | ![]() ![]() |
![]() |
Using view dependent text gives a much accurate 3d representation overall. This might be due to the fact that when we obtain 2d images conditioned on a view angle, the model gets to see and learn the fact that some parts are only visible from certain angles. This can help get better shape for individual components of an object. For standing corgi, we can see that without view dependent text, the model learns to give a representation with 3 ears which is not very accurate. With view dependency, the model knows that I wouldn't be able to see a third ear from any view angle and hence the representation doesn't contain an extra artifact. Similarly with the dogs snout. The snout has more accurate shape than without view dependency. For the robot case, without any view dependency we see that the robot has 3 legs and one arms. Adding view dependency text gives the robot much clearer arms which are attached to different sides compared to the legs. Moreover it adds an antena to the robots head. As with dogs' snout, the robot legs also seem to have a better shape since the model is not trying to find an average of all views.
In order to compute loss in pixel space, I still exploit U-net architecture to predict denoised latents, but I decode it to pixel space using the decoder. The image from NeRF rendering is then compared against the decoded prediction with a simple L2 loss. More specifically, in the pixel version, the NeRF image is passed through SD encoder to get latent representation. After adding noise to this latent, the diffusion model predicts the denoised version of it. This new predicted latent is then decoded back into an RGB image using the decoder. The loss is computed then as pixel-wise L2 difference between this NeRF rendered image and the decoded prediction. Only the NeRF rendered image gets affected by the gradients and the diffusion model output remains untouched.
| Prompt | Standard SDS Loss | Pixel Loss |
|---|---|---|
| "a robot" | ![]() ![]() |
![]() |
The SDS loss in pixel space makes the optimization more aligned with human perception and hence the results looks a lot more real. The pixel loss as we can see tends to produce more visually consistent color since the loss is applied after decoding which helps enforce low-level color similarity. There is however some noisiness in terms of geometry and some bluriness. The pixel space loss takes a lot longer per iteration due to additional decoding step roughly by ≈30%. In contrast, standard SDS loss, the overall geometry of the robot is captured but lacks realism in terms of color and shape.