Assignment 3: Neural Volume Rendering and Surface Rendering

Name: Ishita Gupta
Andrew ID: ishitag

Part A: Neural Volume Rendering

Q0: Transmittance Calculation

Solution:

Transmittance Solution

Q1: Differentiable Volume Rendering

Q1.3: Ray Sampling (5 points)

Implemented get_pixels_from_image and get_rays_from_pixels in ray_utils.py.

Grid Visualization	Rays Visualization

I first generate per-pixel NDC coordinates in [-1,1]^2 using meshgrid(.., indexing='ij'), stack them as (x,y), and reshape to (HxW,2). I then form image-plane points (x,y,1) in NDC, unproject them with the camera to world space, set all ray origins to the camera center, and define directions by normalizing (world_point - origin).

Q1.4: Point Sampling

Implemented StratifiedSampler.forward() in sampler.py.

Visualization:

Sample Points

I implement stratified sampling by partitioning the depth range [min_depth, max_depth] into n_pts_per_ray equal bins, computing midpoints, and adding uniform random offsets within each bin. The 3D sample points are then computed as sample_points = ray_origins + z_vals * ray_directions, where z_vals are the stratified depth values, producing a structured point cloud along all camera rays.

Q1.5: Volume Rendering

Implemented:

VolumeRenderer._compute_weights
VolumeRenderer._aggregate
Modified VolumeRenderer.forward to render depth maps

Spiral Rendering	Depth Map

Q2: Optimizing a Basic Implicit Volume

Q2.1: Random Ray Sampling (5 points)

Implemented get_random_pixels_from_image in ray_utils.py.

H, W = image_size
rand_y = torch.randint(0, H, (num_pixels,))
rand_x = torch.randint(0, W, (num_pixels,))
xy_grid = convert_to_ndc(rand_x, rand_y, H, W)

Q2.2: Loss and Training

Used Mean Squared Error (MSE) loss between predicted and ground truth RGB values.

After training:

Box Center: [0.25, 0.25, 0]
Box Side Lengths: [2.00, 1.50, 1.50]

Before Training	After Training

Part 2 GIF

I trained the box SDF model by randomly sampling rays from ground truth images and minimizing the MSE loss between predicted and ground truth RGB values. The model optimizes both the box center position and side lengths through gradient descent. Starting from an initial guess of a centered cube at origin with side lengths [1.5, 1.5, 1.5], the network discovered through differentiable volume rendering that the actual box is offset to (0.25, 0.25, 0) and is a rectangular prism elongated along the X-axis.

Q3: Optimizing a Neural Radiance Field (NeRF)

Implemented NeuralRadianceField class in implicit.py:

Architecture: 6-layer MLP with 128 hidden units per layer
- Layer 0: Positional encoding (39D) -> 128 hidden units
- Layer 1-2: 128 -> 128 hidden units
- Layer 3: 128 + 3 (skip connection) -> 128 hidden units
- Layer 4-5: 128 -> 128 hidden units
- Output: 128 -> 4 (density + RGB)
Skip Connections: Implemented at layer 3, concatenating original 3D coordinates to help with high-frequency details
Activations:
- ReLU for density (ensures non-negative values)
- Sigmoid for RGB (ensures [0,1] range)
Positional Encoding:
- Used HarmonicEmbedding with 6 harmonic functions for 3D coordinates
- Input dimension: 3 -> Output dimension: 39 (6x2x3 + 3)
- Helps network learn high-frequency details like textures and edges

Training Configuration:

Epochs: 250
Image Resolution: 128x128
Batch Size: 1024 rays per iteration
Learning Rate: 0.0005 with exponential decay (gamma=0.8 every 50 epochs)
Optimizer: Adam
Checkpointing: Every 50 epochs
Rendering: Spiral animation every 10 epochs

Results:

NeRF Lego

NeRF Architecture Design: The implementation follows the original NeRF paper architecture with several key components:

Positional Encoding: Raw 3D coordinates are transformed using sinusoidal functions at multiple frequencies (2^0, 2^1, ..., 2^5). This encoding allows the MLP to represent high-frequency details that would be difficult to learn with raw coordinates alone.
Deep MLP with Skip Connections: The 6-layer MLP with 128 hidden units provides sufficient capacity to represent complex 3D scenes. The skip connection at layer 3 concatenates the original 3D coordinates, providing a direct path for gradients and helping preserve high-frequency information.
Output Processing: The network outputs raw values that are processed with ReLU (density) and Sigmoid (RGB) to ensure physical constraints are met.

Results Quality: The trained NeRF successfully learns the 3D geometry and appearance of the lego bulldozer scene. The spiral rendering shows:

Geometric Accuracy: The model captures the complex shape of the bulldozer including fine details
View Consistency: Novel views maintain geometric consistency across different camera angles
Color Fidelity: RGB predictions accurately represent the scene's appearance
Smooth Interpolation: The continuous neural representation allows smooth transitions between training views

Q4: NeRF Extras (10 points + Extra Credit)

Q4.1: View Dependence

I added view dependence by implementing a two-head architecture: a view-independent density head that processes only 3D position features, and a view-dependent RGB head that concatenates position features with direction embeddings. The direction embeddings use harmonic encoding of normalized ray directions, which are expanded per sample point and fed into the RGB head alongside the position features to enable material appearance to vary with viewing angle.

Results:

lego	materials	highres materials

View Dependence Lego | Materials Scene | Materials Scene High Res

Q4.2: Coarse/Fine Sampling

I implemented hierarchical (coarse-to-fine) sampling as described in the original NeRF paper. This approach uses two networks: a smaller coarse network that samples uniformly along rays to estimate density, and a fine network that performs importance sampling based on the coarse predictions. By concentrating samples near surfaces, this method aims to improve rendering quality while maintaining computational efficiency. The implementation produces functional results, though with some training instability.

I attempted to implement the hierarchical sampling approach from the original NeRF paper, which uses two networks (coarse and fine) with importance sampling. The approach works as follows:

Coarse Network: First pass samples points uniformly along each ray and evaluates a smaller "coarse" NeRF network to get initial density estimates.
Importance Sampling: Use the coarse network's density predictions to compute a probability distribution along each ray, identifying regions likely to contain surfaces.
Fine Network: Sample additional points based on this importance distribution (denser sampling near surfaces) and evaluate the full "fine" network at both coarse and fine sample points.

I created a CoarseNeRF class with a smaller architecture (half the hidden units and layers) and modified the training loop to:

First evaluate the coarse network
Compute importance weights from coarse densities
Perform inverse transform sampling to get fine sample locations
Concatenate coarse and fine samples for the full network

Results:

The hierarchical sampling implementation produced partial results. While the training was somewhat unstable initially, it did generate renderings:

Hierarchical NeRF

The results show that the hierarchical sampler is functional but maybe it is not fully optimized. The rendering quality is acceptable though not perfect (comparing to the previous results), indicating the coarse-to-fine sampling strategy is working to some degree.

Speed/Quality Trade-offs:

Two-pass overhead: Hierarchical sampling requires two network evaluations (coarse + fine) per ray, adding computational cost compared to single-pass NeRF
Better sample efficiency: However, samples are better distributed - concentrated near surfaces rather than uniformly spread. This means fewer total samples can achieve similar quality
Sample allocation: My implementation uses 64 coarse + 64 fine samples (128 total) - same count as standard NeRF but better placement should theoretically improve quality
Net speed: In practice, the two-pass overhead can be offset by using fewer total samples. The original NeRF paper showed that hierarchical sampling achieves better quality-per-sample, making it worthwhile despite the extra forward pass
Current implementation: Since I used the same total sample count (128), there's computational overhead without speed benefit, but quality should be improved from better sample placement

Challenges encountered:

Training instability: The two-network approach requires careful hyperparameter tuning
Coarse network dependency: The fine network's quality depends on the coarse network learning meaningful density distributions early in training

Part B: Neural Surface Rendering

Q5: Sphere Tracing

Implemented sphere_tracing function in renderer.py: I implemented sphere tracing by marching along each ray in steps equal to the SDF value at the current point. Starting at the near plane, I normalize directions, iteratively update points, and stop when |SDF| < epsilon (hit) or the accumulated distance exceeds the far plane (miss). The function returns the final points and a boolean mask indicating which rays intersected the torus surface.

Results:

Torus

Q6: Optimizing a Neural SDF

I implement a dual-branch architecture with positional encoding: a 6-layer distance MLP (128 hidden units) with ReLU activations and optional skip connections for SDF prediction, and a separate 2-layer color MLP (128 hidden units) for RGB output. The distance head outputs raw signed distances (no activation), while the color head uses sigmoid to ensure 0-1 RGB range. Both branches use harmonic positional encoding (4 frequencies) on 3D coordinates to improve representation quality.

Eikonal Loss: Implemented in losses.py

```
grad_norm = torch.norm(gradients, dim=-1)    
eikonal_loss = torch.mean((grad_norm - 1.0) ** 2)
```

4096 points per batch, 0.02 eikonal weight, 0.1 interior weight, 4 harmonic frequencies for 3D coordinates

Results:

Input Point Cloud	Reconstructed Surface

Q7: VolSDF

Extended Neural SDF from Q6 with:

Color Network: 2-layer MLP (128 hidden units) with positional encoding, sigmoid activation for RGB output [0,1].
SDF to Density: VolSDF Laplace CDF conversion - density = alpha * Psi_beta(-sdf) where Psi_beta is high near surface (sdf ≈ 0), exponentially decaying away from surface.
Networks: 6-layer distance MLP (128 units), 2-layer color MLP (128 units), 6 harmonic frequencies
Alpha=10.0, Beta=0.05, LR=0.0005 (decay gamma=0.8/50 epochs), Eikonal=0.02, Interior=0.1
1000 pretrain iterations on sphere SDF, bounds [-4,4]^3

Geometry	Rendered Color

Alpha and Beta intuition:
- Alpha: Surface opacity/density scale (higher = more opaque)
- Beta: Surface sharpness (low = sharp, high = smooth/blurred)
How does high beta bias your learned SDF? What about low beta?:
- High beta: Creates a "fuzzy" or "soft" surface. The density changes gradually over a larger region around the true surface. This makes the surface look smooth but potentially less accurate.
- Low beta: Creates a "sharp" or "crisp" surface. The density changes very quickly near the surface, creating a well-defined boundary. This gives more accurate geometry but can be harder to optimize.
Would an SDF be easier to train with volume rendering and low beta or high beta? Why?:
- High beta is easier to train. With high beta, the density gradients are smoother and more spread out, which means the optimization landscape is gentler. The network gets useful gradient signals from a wider region around the surface. It's like having a wider target to hit - easier but less precise.
- Low beta is harder because the gradients are concentrated in a very narrow region near the surface. If the network's prediction is even slightly off, it might miss the narrow band where gradients exist, making learning slower and more unstable.
Would you be more likely to learn an accurate surface with high beta or low beta? Why?:
- Low beta gives more accurate surfaces. Because the density is concentrated very close to the true surface (where SDF = 0), the volume rendering will produce sharper, more precise geometry. The surface location is well-defined.
- High beta sacrifices some accuracy for smoothness. The blurred density field means the rendered surface might be slightly off from the true SDF zero-level set. It's like taking a photo with a soft focus - smoother but less detailed.

Q8: Neural Surface Extras

Q8.1: Render a Large Scene with Sphere Tracing

I created a complex scene with 36 primitives arranged in an inverted cone structure (like a Christmas tree) with toruses on top. The scene uses SDF union operations (taking the minimum of multiple SDFs):

Command:

python -m surface_rendering_main --config-name=complex_scene

Results:

Complex Scene

"Come one, cheer up, it's nearly Christmas."

— Hagrid

Q8.2: Fewer Training Views

nerf	volsdf	volsdf geometry

nerf | nerf | nerf |

Trained both NeRF and VolSDF on only 20 views (vs. 100 standard). VolSDF uses stronger regularization (5x eikonal weight=0.1, 2x interior weight=0.2) and longer pretraining (2000 iters) to compensate for sparse data. The SDF-based representation with geometric priors (del_f=1) produces more consistent geometry in unobserved regions compared to NeRF, which tends to overfit or produce artifacts with limited views.

Q8.3: Alternate SDF to Density Conversions

Implemented three SDF-to-density conversion methods in renderer.py:

VolSDF (Laplace CDF) - Original method using Laplace cumulative distribution:
- sigma(x) = alpha * phi_beta(-f(x)) where phi is Laplace CDF
- Handles inside/outside asymmetrically
- Parameters: alpha=10.0, beta=0.05
NeuS (Logistic Density) - Uses derivative of sigmoid function:
- sigma(x) = alpha * (1/beta) * exp(-|f|/beta) / (1 + exp(-|f|/beta))^2
- More concentrated near surface, symmetric around zero
- Parameters: alpha=50.0, beta=0.1 (higher alpha compensates for narrower distribution)
Naive (Simple Exponential) - Basic exponential decay:
- sigma(x) = alpha * exp(-|f(x)|/beta)
- Simplest approach, symmetric
- Parameters: alpha=20.0, beta=0.1

VolSDF	NeuS	Naive

VolSDF (Baseline): Successfully renders the scene with good surface detail (mean brightness=10.05). The Laplace CDF with asymmetric inside/outside handling produces smooth, well-defined surfaces at alpha=10.0, beta=0.05.
NeuS: Produces very dark/near-black output. Was a failure case. The logistic density distribution with alpha=50.0, beta=0.1 concentrated density too sharply maybe, causing numerical issues or requiring significantly more training. So maybe NeuS is more sensitive to hyperparameters.
Naive: Surprisingly achieves comparable quality to VolSDF. The simple exponential decay (alpha=20.0, beta=0.1) successfully learns the scene despite lacking inside/outside distinction. However, this approach may struggle with more complex geometries where surface orientation matters.