
Learning rates:
pre_act_opacities: 0.01pre_act_scales: 0.001colours: 0.01means: 0.0005Number of iterations: 1000
PSNR: 29.191
SSIM: 0.939
Final renders:

Training progress renders:

| 0th-order only | Full |
|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
First frame comparison: the texture details are much more defined in the SH version.
Second frame comparison: the lighting is the same as in the previous frame for the 0th-order only version. In contrast, the SH version now looks darker. The SH version correctly captures the lighting because from the previous frame, we know that the light is pointing from right to left. Now that the chair is facing left, it becomes darker because the light is not directly hitting it.
All images are trained for 2000 iterations.
| Prompt | Without guidance | With guidance |
|---|---|---|
| "a hamburger" | ![]() |
![]() |
| "a standing corgi dog" | ![]() |
![]() |
| "a spherical rubik's cube" | ![]() |
![]() |
| "a pikachu holding a gun" | ![]() |
![]() |
Prompt: "a cow with orange skin and blue dots"

Prompt: "a dotted black and white cow"

| Prompt | RGB | Depth |
|---|---|---|
| "a standing corgi dog" | ![]() |
![]() |
| "a pikachu holding a sword" | ![]() |
![]() |
| "an f1 racing car" | ![]() |
![]() |
| Prompt | RGB, without VD | RGB, with VD | Depth, without VD | Depth, with VD |
|---|---|---|---|---|
| "a standing corgi dog" | ![]() |
![]() |
![]() |
![]() |
| "a pikachu holding a sword" | ![]() |
![]() |
![]() |
![]() |
| "an f1 racing car" | ![]() |
![]() |
![]() |
![]() |
Visual results comparison:
In summary, view-dependent text conditioning helps improving the correctness of the generated 3D objects. Without it, the model is biased to generate all front-facing features due to the vast proportion of front-facing views in the training set.