Name: Ishita Gupta
Andrew ID: ishitag
Solution:

Implemented get_pixels_from_image and get_rays_from_pixels in ray_utils.py.
| Grid Visualization | Rays Visualization |
|---|---|
![]() |
![]() |
I first generate per-pixel NDC coordinates in [-1,1]^2 using meshgrid(.., indexing='ij'), stack them as (x,y), and reshape to (HxW,2). I then form image-plane points (x,y,1) in NDC, unproject them with the camera to world space, set all ray origins to the camera center, and define directions by normalizing (world_point - origin).
Implemented StratifiedSampler.forward() in sampler.py.
Visualization:

I implement stratified sampling by partitioning the depth range [min_depth, max_depth] into n_pts_per_ray equal bins, computing midpoints, and adding uniform random offsets within each bin. The 3D sample points are then computed as sample_points = ray_origins + z_vals * ray_directions, where z_vals are the stratified depth values, producing a structured point cloud along all camera rays.
Implemented:
VolumeRenderer._compute_weightsVolumeRenderer._aggregateVolumeRenderer.forward to render depth maps| Spiral Rendering | Depth Map |
|---|---|
![]() |
![]() |
Implemented get_random_pixels_from_image in ray_utils.py.
H, W = image_size
rand_y = torch.randint(0, H, (num_pixels,))
rand_x = torch.randint(0, W, (num_pixels,))
xy_grid = convert_to_ndc(rand_x, rand_y, H, W)
Used Mean Squared Error (MSE) loss between predicted and ground truth RGB values.
After training:
[0.25, 0.25, 0][2.00, 1.50, 1.50]| Before Training | After Training |
|---|---|
![]() |
![]() |
![]() |
![]() |

I trained the box SDF model by randomly sampling rays from ground truth images and minimizing the MSE loss between predicted and ground truth RGB values. The model optimizes both the box center position and side lengths through gradient descent. Starting from an initial guess of a centered cube at origin with side lengths [1.5, 1.5, 1.5], the network discovered through differentiable volume rendering that the actual box is offset to (0.25, 0.25, 0) and is a rectangular prism elongated along the X-axis.
Implemented NeuralRadianceField class in implicit.py:
Architecture: 6-layer MLP with 128 hidden units per layer
Skip Connections: Implemented at layer 3, concatenating original 3D coordinates to help with high-frequency details
Activations:
Positional Encoding:
HarmonicEmbedding with 6 harmonic functions for 3D coordinatesTraining Configuration:
Results:

NeRF Architecture Design: The implementation follows the original NeRF paper architecture with several key components:
Positional Encoding: Raw 3D coordinates are transformed using sinusoidal functions at multiple frequencies (2^0, 2^1, ..., 2^5). This encoding allows the MLP to represent high-frequency details that would be difficult to learn with raw coordinates alone.
Deep MLP with Skip Connections: The 6-layer MLP with 128 hidden units provides sufficient capacity to represent complex 3D scenes. The skip connection at layer 3 concatenates the original 3D coordinates, providing a direct path for gradients and helping preserve high-frequency information.
Output Processing: The network outputs raw values that are processed with ReLU (density) and Sigmoid (RGB) to ensure physical constraints are met.
Results Quality: The trained NeRF successfully learns the 3D geometry and appearance of the lego bulldozer scene. The spiral rendering shows:
I added view dependence by implementing a two-head architecture: a view-independent density head that processes only 3D position features, and a view-dependent RGB head that concatenates position features with direction embeddings. The direction embeddings use harmonic encoding of normalized ray directions, which are expanded per sample point and fed into the RGB head alongside the position features to enable material appearance to vary with viewing angle.
Results:
| lego | materials | highres materials |
|---|---|---|
|
| 
I implemented hierarchical (coarse-to-fine) sampling as described in the original NeRF paper. This approach uses two networks: a smaller coarse network that samples uniformly along rays to estimate density, and a fine network that performs importance sampling based on the coarse predictions. By concentrating samples near surfaces, this method aims to improve rendering quality while maintaining computational efficiency. The implementation produces functional results, though with some training instability.
I attempted to implement the hierarchical sampling approach from the original NeRF paper, which uses two networks (coarse and fine) with importance sampling. The approach works as follows:
Coarse Network: First pass samples points uniformly along each ray and evaluates a smaller "coarse" NeRF network to get initial density estimates.
Importance Sampling: Use the coarse network's density predictions to compute a probability distribution along each ray, identifying regions likely to contain surfaces.
Fine Network: Sample additional points based on this importance distribution (denser sampling near surfaces) and evaluate the full "fine" network at both coarse and fine sample points.
I created a CoarseNeRF class with a smaller architecture (half the hidden units and layers) and modified the training loop to:
Results:
The hierarchical sampling implementation produced partial results. While the training was somewhat unstable initially, it did generate renderings:

The results show that the hierarchical sampler is functional but maybe it is not fully optimized. The rendering quality is acceptable though not perfect (comparing to the previous results), indicating the coarse-to-fine sampling strategy is working to some degree.
Speed/Quality Trade-offs:
Challenges encountered:
Implemented sphere_tracing function in renderer.py:
I implemented sphere tracing by marching along each ray in steps equal to the SDF value at the current point. Starting at the near plane, I normalize directions, iteratively update points, and stop when |SDF| < epsilon (hit) or the accumulated distance exceeds the far plane (miss). The function returns the final points and a boolean mask indicating which rays intersected the torus surface.
Results:

I implement a dual-branch architecture with positional encoding: a 6-layer distance MLP (128 hidden units) with ReLU activations and optional skip connections for SDF prediction, and a separate 2-layer color MLP (128 hidden units) for RGB output. The distance head outputs raw signed distances (no activation), while the color head uses sigmoid to ensure 0-1 RGB range. Both branches use harmonic positional encoding (4 frequencies) on 3D coordinates to improve representation quality.
Eikonal Loss: Implemented in losses.py
```
grad_norm = torch.norm(gradients, dim=-1)
eikonal_loss = torch.mean((grad_norm - 1.0) ** 2)
```
Results:
| Input Point Cloud | Reconstructed Surface |
|---|---|
![]() |
![]() |
Extended Neural SDF from Q6 with:
Color Network: 2-layer MLP (128 hidden units) with positional encoding, sigmoid activation for RGB output [0,1].
SDF to Density: VolSDF Laplace CDF conversion - density = alpha * Psi_beta(-sdf) where Psi_beta is high near surface (sdf ≈ 0), exponentially decaying away from surface.
Networks: 6-layer distance MLP (128 units), 2-layer color MLP (128 units), 6 harmonic frequencies
| Geometry | Rendered Color |
|---|---|
![]() |
![]() |
Alpha and Beta intuition:
How does high beta bias your learned SDF? What about low beta?:
Would an SDF be easier to train with volume rendering and low beta or high beta? Why?:
Would you be more likely to learn an accurate surface with high beta or low beta? Why?:
I created a complex scene with 36 primitives arranged in an inverted cone structure (like a Christmas tree) with toruses on top. The scene uses SDF union operations (taking the minimum of multiple SDFs):
Command:
python -m surface_rendering_main --config-name=complex_scene
Results:

"Come one, cheer up, it's nearly Christmas."
— Hagrid
| nerf | volsdf | volsdf geometry |
|---|---|---|
|
|
|
Trained both NeRF and VolSDF on only 20 views (vs. 100 standard). VolSDF uses stronger regularization (5x eikonal weight=0.1, 2x interior weight=0.2) and longer pretraining (2000 iters) to compensate for sparse data. The SDF-based representation with geometric priors (del_f=1) produces more consistent geometry in unobserved regions compared to NeRF, which tends to overfit or produce artifacts with limited views.
Implemented three SDF-to-density conversion methods in renderer.py:
VolSDF (Laplace CDF) - Original method using Laplace cumulative distribution:
sigma(x) = alpha * phi_beta(-f(x)) where phi is Laplace CDFNeuS (Logistic Density) - Uses derivative of sigmoid function:
sigma(x) = alpha * (1/beta) * exp(-|f|/beta) / (1 + exp(-|f|/beta))^2Naive (Simple Exponential) - Basic exponential decay:
sigma(x) = alpha * exp(-|f(x)|/beta)| VolSDF | NeuS | Naive |
|---|---|---|
![]() |
![]() |
![]() |
VolSDF (Baseline): Successfully renders the scene with good surface detail (mean brightness=10.05). The Laplace CDF with asymmetric inside/outside handling produces smooth, well-defined surfaces at alpha=10.0, beta=0.05.
NeuS: Produces very dark/near-black output. Was a failure case. The logistic density distribution with alpha=50.0, beta=0.1 concentrated density too sharply maybe, causing numerical issues or requiring significantly more training. So maybe NeuS is more sensitive to hyperparameters.
Naive: Surprisingly achieves comparable quality to VolSDF. The simple exponential decay (alpha=20.0, beta=0.1) successfully learns the scene despite lacking inside/outside distinction. However, this approach may struggle with more complex geometries where surface orientation matters.