Assignment 4 : Neural Volume Rendering and Surface Rendering¶

Name: Xinyu Liu¶

1. 3D Gaussian Splatting¶

1.1 3D Gaussian Rasterization¶

GIF

1.2 Training 3D Gaussian Representations¶

Experiemental Setting:

  1. Learning rates that you used for each parameter:
    parameters = [
         {'params': [gaussians.pre_act_opacities], 'lr': 1e-3, "name": "opacities"},
         {'params': [gaussians.pre_act_scales], 'lr': 1e-4, "name": "scales"},
         {'params': [gaussians.colours], 'lr': 5e-4, "name": "colours"},
         {'params': [gaussians.means], 'lr': 1e-5, "name": "means"},
     ]
    
  2. Number of iterations: 5000
  3. The PSNR and SSIM: Mean PSNR = 27.582; Mean SSIM: 0.924.

GIF GIF

1.3 Extensions¶

1.3.1 Rendering Using Spherical Harmonics¶

The overall observation regarding the use of spherical harmonics is that, without them, the color remains constant regardless of the camera or view direction, resulting in low-contrast and unrealistic object appearance. In contrast, with spherical harmonics, highlights and shading vary with camera movement, producing higher-contrast colors and a more realistic appearance.

For the first camera pose, the highlight should appear only on the left outer side of the seat, while the right outer side is blocked by the armrest (as shown in the lower image). However, without spherical harmonics, the highlight appears along the entire outer edge of the seat (upper image).

No description has been provided for this image No description has been provided for this image

For the second camera pose, the highlight should fall on the seat, with the backrest entirely in shadow (lower image). Without spherical harmonics, however, there is some highlight on the backrest (upper image).

No description has been provided for this image No description has been provided for this image

For the third camera pose, it should render the entire seat in shadow (lower image). Yet, without spherical harmonics, a highlight is visible on the seat (upper image).

No description has been provided for this image No description has been provided for this image

GIF GIF

1.3.2 Training On a Harder Scene¶

Baseline:

GIF GIF

Tried changing:

  1. Hyperparameters: experimented with several set of hyperparameters, for example:
     parameters = [
         {'params': [gaussians.pre_act_opacities], 'lr': 1e-4, "name": "opacities"},
         {'params': [gaussians.pre_act_scales], 'lr': 5e-5, "name": "scales"},
         {'params': [gaussians.colours], 'lr': 5e-5, "name": "colours"},
         {'params': [gaussians.means], 'lr': 5e-6, "name": "means"},
     ]
    
  2. Integrated SSIM loss: new loss = l2_loss + 0.2 * simm_loss
  3. Trained for longer epochs: 10000 epochs

New results (does not seem to improve for some reason)

GIF GIF

2.Diffusion-guided Optimization¶

2.1 SDS Loss + Image Optimization¶

"a hamburger"¶

Without (600 iters) vs. With Guidance (1500 iters):

No description has been provided for this image No description has been provided for this image

"a standing corgi dog"¶

Without (500 iters) vs. With Guidance (1500 iters):

No description has been provided for this image No description has been provided for this image

"a hello kitty"¶

Without (400 iters) vs. With Guidance (600 iters):

No description has been provided for this image No description has been provided for this image

"a steak and egg dish"¶

Without (800 iters) vs. With Guidance (900 iters):

No description has been provided for this image No description has been provided for this image

2.2 Texture Map Optimization for Mesh¶

"a standing black cow"¶

GIF

"a standing orange cow"¶

GIF

2.3 NeRF Optimization¶

"a standing corgi dog"¶

Depth (left), RGB (right).

Your browser does not support the video tag. Your browser does not support the video tag.
"a cup of latte"¶

Depth (left), RGB (right).

Your browser does not support the video tag. Your browser does not support the video tag.
"an apple"¶

Depth (left), RGB (right).

Your browser does not support the video tag. Your browser does not support the video tag.

2.4 Extensions¶

2.4.1 View-dependent text embedding¶

"a standing corgi dog"¶

The previous one is already 3D-consistent even without view dependent text embedding (probably cuz we were lucky); but clearly, this one with view dependent text embedding is clearly very 3D-consistent, with the full orientation of the head and body looks clear, reasonable and consistent.

Your browser does not support the video tag. Your browser does not support the video tag.
"a cup of latte"¶

In this example, without view-dependent text embeddings, there is clearly inconsistent geometry across different views, as the handle appears multiple times without consistency. This inconsistency arises because, previously we optimized each view independently. While now with view-dependent text embeddings, the overall 3D consistency is significantly improved (although we don't have a handle in this example).

Your browser does not support the video tag. Your browser does not support the video tag.

2.4.3 Variation of implementation of SDS loss¶

In my implementation, I first add noise to the RGB image and then encode the noisy image into latent space. The UNet predicts the noise in this latent space, which is then subtracted from the latent representation to obtain a predicted latent. This predicted latent is finally decoded back into RGB space, allowing gradient computation using a combination of LPIPS and L2 losses in pixel space.

This implementation can result in higher image quality with finer visual details; however, it also incurs longer training times due to the larger dimensionality and higher computational cost.

I do not have concrete visual results for this approach because, in my experiments, the learned patterns often “smoothed out” into a nearly uniform background after approximately 10–20 iterations, probably due to the choice of hyperparameters.