For previous questions the unit tests passed.

Q 1.1.2

alt text

Q 1.2.2

Learning rates:
    pre_act_opacities lr: 2e-4
    pre_act_scales lr: 2e-4
    colours lr: 1e-3
    means lr: 5e-4


Number of iterations: 1000

[*] Evaluation --- Mean PSNR: 28.407
[*] Evaluation --- Mean SSIM: 0.924

alt text alt text

Q 1.3.1

Frame Number Default Spherical Harmonics Explanations
All alt text alt text Previously, the rendered was view dependent. When we extend to use higher order terms, the colors vary smoothly and capture specular, reflections, and shows more naturally. Overall SH render produces much richer shading and more realisitic surface color changes.
0 alt text alt text We can see the backrest is duller throughout in the default case. Using SPH left part of the backrest is lighter and right is darker giving much natural shadow look. Similarly the base shadows look a lot more natural due to variations in the SPH version. In the default version it seems like there's a mask with constant opacity. Moreover the motif of the base is much more vibrant with sph.
10 alt text alt text SPH gets richer tonal variation. The inner seat back and armrest show more accurate shadowing and highlights. The texture of the fabric looks slightly more pronounced probably due to the view-dependent brightness variation.
20 alt text alt text From the back angle, both of them look pretty similar. Only noticeable difference is that SPH gets more colors in the top part of the chair (brown area). It has some yellow/gold spots. The non-SPH version fails to capture such details.

Q 1.3.2

Training Progress Final Mean PSNR Mean SSIM
Default Settings alt text alt text 16.242 0.585
Modified V1 alt text alt text 18.002 0.695
Modified Final alt text alt text 18.515 0.721

Default Setting baseline uses isotropic gaussians, L1 loss, and the learning rates from the previous question . The results were quite poor as expected. To modify that I used anisotropic gaussians, different learning rates, lr scheduler, added SSIM loss and different number of iterations. I also tried different weights for l1 and SSIM loss. With 0.8 and 0.1 weight for L1 and SSIM loss respectively, the performance was still poor giving metrics close to Default Case.

[*] Evaluation --- Mean PSNR: 15.794
[*] Evaluation --- Mean SSIM: 0.575

I reduced the weighting of SSIM loss to 0.1. I also reduced the learning rates for each param (you can see the commented out progress in the code) and ran training for 3000 epochs. The results are marked as Modified V1. Here we atleast start seeing rough shapes of the balls and hence the result seemed much better. Even the metrics improved. Finally I trained for 8000 epochs. Here we see a lot more improvement than the baseline, as well as Modified V1. We can clearly see a few balls which were not visible previously. It also gets rid of a ot of specular highlight looking artifacts as seen V1. To further improve, I would tune the params even more and try running for slightly higher number of epochs.

Q 2.1

Prompt Number of Iterations (w/o. guidance / w. guidance) without guidance with guidance
"a hamburger" 400 / 2000 alt text alt text
"a standing corgi dog" 400 / 2000 alt text alt text
"a weightlifting cat" 400 / 2000 alt text alt text
"a racoon dj". 1000 / 2000 alt text alt text

Q 2.2

Prompt Mesh
"a cow" alt text
"a black and white zebra" alt text

Q 2.3

Prompt Depth RGB
"a standing corgi" alt text alt text
"a robot" alt text alt text
"a sitting cat" alt text alt text

Q 2.4.1

Prompt Non-View Dependent View Dependent
"a standing corgi" alt text alt text alt textalt text
"a robot" alt textalt text alt text alt text

Using view dependent text gives a much accurate 3d representation overall. This might be due to the fact that when we obtain 2d images conditioned on a view angle, the model gets to see and learn the fact that some parts are only visible from certain angles. This can help get better shape for individual components of an object. For standing corgi, we can see that without view dependent text, the model learns to give a representation with 3 ears which is not very accurate. With view dependency, the model knows that I wouldn't be able to see a third ear from any view angle and hence the representation doesn't contain an extra artifact. Similarly with the dogs snout. The snout has more accurate shape than without view dependency. For the robot case, without any view dependency we see that the robot has 3 legs and one arms. Adding view dependency text gives the robot much clearer arms which are attached to different sides compared to the legs. Moreover it adds an antena to the robots head. As with dogs' snout, the robot legs also seem to have a better shape since the model is not trying to find an average of all views.

Q2.4.3

Implementation Details

In order to compute loss in pixel space, I still exploit U-net architecture to predict denoised latents, but I decode it to pixel space using the decoder. The image from NeRF rendering is then compared against the decoded prediction with a simple L2 loss. More specifically, in the pixel version, the NeRF image is passed through SD encoder to get latent representation. After adding noise to this latent, the diffusion model predicts the denoised version of it. This new predicted latent is then decoded back into an RGB image using the decoder. The loss is computed then as pixel-wise L2 difference between this NeRF rendered image and the decoded prediction. Only the NeRF rendered image gets affected by the gradients and the diffusion model output remains untouched.

Prompt Standard SDS Loss Pixel Loss
"a robot" alt textalt text alt text alt text

The SDS loss in pixel space makes the optimization more aligned with human perception and hence the results looks a lot more real. The pixel loss as we can see tends to produce more visually consistent color since the loss is applied after decoding which helps enforce low-level color similarity. There is however some noisiness in terms of geometry and some bluriness. The pixel space loss takes a lot longer per iteration due to additional decoding step roughly by ≈30%. In contrast, standard SDS loss, the overall geometry of the robot is captured but lacks realism in terms of color and shape.