Neural Volume Rendering and Surface Rendering¶

Part A - Neural Volume Rendering¶

0 - Transmittance Calculation¶

Below, I do the transmittance calculation for a given example of a ray going through a non-homogenous medium. The following transmittance formula will be used by our Neural Volume Rendering set-up later.

No description has been provided for this image

These are the transmittance values we are trying to calculate:

T(y1,y2)
T(y2,y4)
T(x,y4)
T(x,y3)

1.3 - Ray Sampling¶

In this part, I implemented get_pixels_from_image and get_rays_from_pixels as utilities for the volume and surface rendering that will be implemented later. These functions essentially generate pixel coordinates ranging from [-1, 1] for the image and then generate rays for each pixel. To do this, we unproject from the camera's Normalized Device Coordinate Space into world space, and get the origins/directions for each ray.

Below are visualized examples of the xy_grid (i.e. pixel coordinate grid) and the rays for a box:

xy_grid	rays

1.4 - Point Sampling¶

In this section, I implemented a sampler that will uniformly sample N point offsets from the origin of a ray along its direction. These sampled points are what will eventually be used to render color along the ray. We can visualize this and see what the point samples look like from the first camera:

1.5 - Volume Rendering¶

In this part, I implemented a Volume Renderer that builds on the previous few sections (and will eventually be used by our Volume Optimization and NeRF systems). Given the volume density and the deltas (i.e distance between conseuctive samples) for some SDF Volume, it will compute the weights that are used by the radiance summation to scale the emitted radiance at given points:

Basically, these are our weights:

$$ T(\mathbf{x}, \mathbf{x}_{t_i}) \, \big(1 - e^{-\sigma_{t_i} \, \Delta t}\big) $$

And we can use them to render features (i.e. the colors) and render the depth map (By calculating the above summation). Here are the rendered examples for our box SDFVolume:

Rendered Box	Depth Visualization

2 - Optimizing a Basic Implicit Volume¶

In this section, we now use our differentiable volume renderer (from the previous part) and use it to optimize the parameters of our basic box implicit volume. Some of the details I implemented in this section:

Sampling a subset of rays from a full image for a full iterate (since volume rendering for every single ray and all the same points can be memory intensive)
Using an MSE loss (between the rendered colors and the ground truth RGB values from the given views) as our objective

After optimization for a 1000 epochs, we get the following volume rendering with the given (rounded) positions/side lengths of the box:

Box center:
(0.25, 0.25, -0.00)

Box side lengths:
(2.01, 1.50, 1.50)

3 - Optimizing a Neural Radiance Field (NeRF)¶

Now, we finally build a NeuralRadianceField (NeRF) MLP, which will predict a volume density and a color for every single sampled input 3D point (which was obtained via our Ray sampling set-up). These densities and color values will then be rendered by our previous Volume Renderer and we can get a 3D reconstruction of our desired object.

This pipeline essentially allows us to optimize a rendered 3D volume from a given set of RGB images.

I designed my MLP pipeline with the following structure:

The input points are first embedded with a Positional Encoding (following a Harmonic encoding pattern) which allows for richer input information
The input goes through alternating Linear and ReLU layers with occasional skip connections (where the embedded points are fed back in), encouraging the model to learn fine geometric detail and prevent vanishing gradients
A final Linear + ReLU is used to predict the density (since densities can't be negative) and a Linear + Sigmoid is used to predict the color (so that it is bounded)
Note that as of right now, the MLP has no view dependence and only makes predictions based on 3D Points (this is added in part 4)

After 250 epochs of training with 6 hidden layers (Linear+ReLU), 1 skip connection (at the 4th layer) and the aforementioned Linear+ReLU / Linear+Sigmoid output layers, we get the following optimized volume rendering of a Lego Truck (Along with some of the views that were used as input RGBs):

Example View 1	Example View 2	Example View 3

4 - NeRF Extra (View Dependence)¶

As mentioned in the previous section, the initial NeRF MLP implementation did not take into account the camera views and relied solely on 3D position of the points. In this version, we add View dependency as well and visualize the results.

The MLP was changed in the following ways:

The positionally encoded input 3D points still pass through the MLP from before, and then a Linear + ReLU is used to output density (which should not be view dependent)
The feature vector (the output of the above MLP) is then concatenated with the positionally encoded directions (same Harmonic embeddings as before), and then passed through a seaprate MLP (I used 3 Linear + ReLU layers here)
This 2nd MLP does not use skip-connections (Empirically found that they did not improve the outputs)
The output from this 2nd MLP is then passed through a Linear + Sigmoid to predict the color values

We can see the comparisons of optimized volumes from the view and non-view dependent models below:

Non-View Dependent NeRF	View Dependent NeRF

Trade-offs:

As we can see in both examples above, the view-dependent NeRF gives a result that looks more realistic and less "flat" / more "punchy" in its colors
This is especially visible in the 2nd row where the View Dependent model can capture the glossy/reflective look of the knobs more accurately, whereas the Non-View dependent looks more dull (Because the colors are only based on geometry)
The downside of the view dependent model is that it is more prone to overfitting and will have less generalization ability (i.e. the generalization to novel views will be less stable)
This is because the model may memorize view-specific colors rather than truly learning the reflection properties; i.e. It may memorize that the view from Camera 1 looks brighter than the view from Camera 2, and this will lead to lower generalization quality for novel views
This downside of the View-dependent model is not immediately obvious in our example outputs above because of the low-resolution setting and the ample views we train with, but this will be more evident for fewer input views / more detailed renders.

Part B - Neural Surface Rendering¶

5 - Sphere Tracing¶

Now, we move onto Neural Surface Rendering as opposed to Volume Rendering, with the core idea being to eventually learn a Signed Distance Function (SDF) that outputs a 0 for 3D points that are on the surface of the desired object.

In this part, I implemented the Sphere Tracing idea which does the following:

It takes in some implicit function and rays (with origins and directions). Points are initialized as the origin of the rays
We evaluate the value of our current points using the implicit function (i.e. get the SDF values, which indicate whether we're on the surface or not. A 0 value indicates intersecting with the surface)
The points are moved in the direction of the rays by the value of the SDF (This guarantees we won't cross the surface, since the SDF >= distance to the surface)
Maintain a mask of which rays have intersected the surface (within some threshold 1e-5)
Iteratively repeat the above steps (i.e. Calculating SDF values and walking along the ray) and terminate when all the rays have intersected with the surface or we reach some max iterations threshold

We can see the resultant points from this Sphere Tracing algorithm for the implicit function of a Torus:

6 - Optimizing a Neural SDF¶

Similar to Part (3) where we built a NeRF MLP, we now build a Neural SDF MLP. This MLP should ideally take in a 3D point and output a scalar value which is its predicted distance to the surface (0 for points lying on the surface). We optimize it by training the network to output zero for the observed point cloud points (which lie on the surface).

The implemented MLP is quite similar to the NeRF one but with a few key differences explained below.

Details of the MLP:

Similar set up of positionally encoded input points (using Harmonic embeddings), layers of Linear+ReLU with support for occasional skip connections
Linear output layer which goes from the feature vector to a single scalar prediction for distance (Unlike our density prediction in the NeRF, we do not use a ReLU since distance to the surface can be negative if the point is inside)
Note: As of right now, this network is just predicting distance, not color

The objective is to minimize the mean normalized distance for the observed points (since they should be 0) and the Eikonal Loss. To encourage our MLP to actually learn a SDF, the norm of the gradients are pushed to = 1. This is a property of a true SDF because its gradient should point outward from the surface of the object and its magnitude (i.e. norm) should indicate the rate of change of distance, which is = 1 for an actual distance field.

To encourage this property, we add the following Eikonal Loss:

$$ \mathcal{L}_{\text{Eikonal}} = \mathbb{E}_{\mathbf{x}} \left[ \left( \| \nabla_{\mathbf{x}} f(\mathbf{x}) \|_2 - 1 \right)^2 \right] $$

Putting all this together, we can optimize a Neural SDF for a given input, as seen below (Left is the input points, right is our optimized rendering). For this examle, I had 6 hidden layers in my network (Linear+ReLU), no skip connections (Were not needed empirically), a hidden dimension of 128, and an output Linear layer to predict distance.

Input Points (Point Cloud)	Optimized Rendering

7 - VolSDF¶

We now extend our Neural SDF MLP from the previous part in the following ways:

The MLP is extended to predict color as well as distance. To do so, I made the following changes:

Add the positional encoding to the input points and pass them through the previous distance MLP (Linear+ReLU layers with skip connections)
Take this feature vector and use it for the distance prediction (via a separate Linear output layer) and as input into a color-prediction MLP
Color prediction MLP also has Linear+ReLU layers with skip connections
Final output for color is made by passing through a Linear+Sigmoid (so that we can bound the values)

By following this above format, the color and distance MLPs are not independent but the color MLP will benefit by inputting the feature vector already learned by the distance MLP, and the distance MLP will be optimized (during back-prop) to encode features that are useful for feature/color predictions. This should have the following benefits:

Color network is more aware of geometry and will be less prone to assigning inconsistent colors (i.e. different colors for the same surface point when seen from a different view)
Distance network might learn finer surface details that are important visually but also for predicting the geometry of the surface.
The color network doesn't have to have as many layers since it's not starting from just the 3D points

We are also now using our Volume Renderer (from Part A) so we have to take our predicted distance values (SDF Values) and convert them to densities that will be used in our transmittance/radiance weight calculations.

To convert SDF to Volume Density, we use the formula from the VolSDF paper which is as follows:

$$ \sigma(\mathbf{x}) = \alpha \, \Psi_\beta \left( -d_\Omega(\mathbf{x}) \right), $$

$\Psi_\beta$ is the Cumulative Distribution Function of a Laplace distribution:

$$ \Psi_\beta(s) = \begin{cases} \frac{1}{2} \exp\left(\frac{s}{\beta}\right), & \text{if } s \le 0, \\ 1 - \frac{1}{2} \exp\left(-\frac{s}{\beta}\right), & \text{if } s > 0. \end{cases} $$

The following is the impact of $\alpha$ and $\beta$:

$\alpha$ --> It is a multiplicative term that controls the scale of the density values returned from $\Psi_\beta$. If $\alpha$ is larger, then the resultant density will be greater and the visual rendering will be more opaque. If $\alpha$ is smaller, then the density values will be scaled to be lower and in our rendering, we get surfaces that are more "translucent" or softer (You get more emitted radiance from things inside the object)
$\beta$ --> $\beta$ can be thought of as the radius of the "blur" around the zero-level set of the SDF. It is inversely related to the sharpness of the transition in the Laplace CDF. Intuitively, as $\beta$ gets smaller (and tends to $0$), the exponential term becomes larger (since $\beta$ is in the denominator), and the density values change more rapidly. This gives a very sharp boundary where there is a distinct difference between density outside the surface and inside. As $\beta$ increases, we get a slower exponential decay and a smoother, thicker boundary as density increases gradually as the point approaches the surface.

1. How does high $\beta$ bias your learned SDF? What about low $\beta$?

A high $\beta$ will bias the learned model towards learning more broad, smoothed distance predictions. The rendering loss will also give a weaker signal because small changes in the learned SDF don't have a strong impact on the rendering quality (since density is more spread out). Therefore, the learned SDF will be smoother and less like a true SDF function, and result in blurrier outputs.

With a low $\beta$, the learned SDF is biased towards fitting the surface more tightly, because if it does not fit the true zero-level set well, the rendered densities will shift rapidly. It will get large gradients near the surface, but almost none anywhere else. Therefore, the learned SDF is biased towards a sharper surface and more closely matching a true SDF, but can be harder to optimize (as explained below).

2. Would an SDF be easier to train with volume rendering and low $\beta$ or high $\beta$? Why?

An SDF with volume rendering will be easier to train with high $\beta$. This is because the rendered rays will get gradient contributions from many of the sampled points (since the density transition is smoother and many points have non-zero density values), leading to a smoother gradient surface and easier optimization (even if it is slow). With low $\beta$, only the points near the surface contribute significant gradients, resulting in sparse and unstable gradient values. This can make the model overfit or find it difficult to optimize (getting stuck in areas with near-zero gradient values).

3. Would you be more likely to learn an accurate surface with high $\beta$ or low $\beta$? Why?

You are more likely to learn an accurate surface with low $\beta$, as this encourages a sharper drop-off in densities away from the zero-level set, producing a more precisely defined surface. The boundary becomes more "crisp" rather than "blurry" which can lead to better geometric and visual accuracy (assuming the optimization converges well).

OVERALL: We see that $\beta$ has a trade-off where high $\beta$ leads to smoother optimization but less well-defined boundaries, and low $\beta$ gives better-defined boundaries but makes optimization more difficult.

Below is the result of our VolSDF rendering, for some different values of Alpha and Beta:

For the Beta variations, we can see the predicted effect where the colored rendering looks sharper (more desirable here) and less "fuzzy" for the 0.03 than the 0.05 and 0.07 but, as can be seen in the geometry visualization, the edges (especially around the base) are more jagged because of the sharp drop-off in density

Alpha 10, Beta 0.05	Alpha 10, Beta 0.03	Alpha 10, Beta 0.07

For the Alpha variations, we can see the predicted effect where the Alpha = 1 rendering looks extremely cloudy and translucent, allowing you to see through to the other side of the object even though it should be occluded. This is because the density values have been scaled to be quite low. Alpha = 10 dramatically improves on this by scaling density to be higher and Alpha = 20 makes this even more opaque (with a stronger geometry as well).

Alpha 10, Beta 0.05	Alpha 1, Beta 0.05	Alpha 20, Beta 0.05

Based on the above experiments, the final parameters I chose were Alpha = 20, Beta = 0.03. This seems to work best because we get a rendering that is not fuzzy (due to the higher density scaling and sharper drop-off) while still being optimizable. The geometry also fits the original object best.

Full Rendering	Geometry

8.1 - Neural Surface Extras (Render a Large Scene with Sphere Tracing)¶

In Part 5, we rendered a singular torus but the idea of Sphere Tracing can be extended to render more complicated scenes with multiple primitives. The points will keep walking along the ray directions until they interact with the closest SDF. We also use the following equation for the composed SDF of a primitive:

$$ f_{\text{union}}(\mathbf{x}) = \min_{i} \; f_i(\mathbf{x}) $$

Using this code for the SDF of a Pyramid and the union idea from above, I defined a new SDF class for a scene composed of 20 primitive pyramids. The heights and centers of the pyramids are varied such that some are closer to each other, some are slightly taller/shorter than others.

Then, using the same Sphere Tracing code from Part 5, we get the following output rendering:

8.2 - Neural Surface Extras (Fewer Training Views)¶

So far, for both NeRF and VolSDF, we have been using 100 input training views for a single scene. Theoretically, the surface representations should be able to infer the surface from fewer views (especially over NeRF) because:

The Eikonal constraint will more strongly regularize the geometry whereas our NeRF system does not have this same explicit regularization
The VolSDF system is explicitly modeling the surface (via the learned SDF) whereas NeRF is is modeling the volume. This means that NeRF needs many views to understand the surface location and should have geometrically noisy/blurry representations when it has few training views (and can't hallucinate unseen regions), whereas VolSDF's surface parametrization can do better even in this setting.

To check this, I reduced the number of training views inputted when optimizing the systems and we can see the resultant outputs from NeRF and VolSDF.

Note: I used the best Alpha/Beta parameters found in the previous section for VolSDF

Number of Train Views	VolSDF	NeRF
100 (Original)
20
10

In accordance with our predictions, we see that:

For 100 Views, both have very similar performance (With NeRF even looking more sharp and opaque due to its explicit density predictions)
For 20 Views, VolSDF begins to lose some finer details (The red lights on the truck for example) but still looks better than NeRF which is more fuzzy, especially around the body of the truck
For 10 Views, VolSDF is still able to optimize for a recognizable rendering. Many of the finer details around the wheels and the back of the truck have been lost. However, NeRF completely collapses and cannot hallucinate any of the unseen regions, resulting in an output that is just noise

8.3 - Neural Surface Extras (Alternate SDF to Density Conversions)¶

The SDF to Density conversion described in Part 7 is only one implementation by VolSDF and there exist other methods as well. In this part, I switched this converstion to the following suggested by the NeuS paper.

$$\phi_s(x) = \frac{s e^{-s x}}{(1 + e^{-s x})^2}$$

Where $x$ is our SDF output and $s$ is a parameter that controls the sharpness of the drop around the surface (Larger s, more steep drop). This formula has certain interesting impacts:

The sharp drop around the surface is in both directions (unlike VolSDF), so that even deep inside the object, the density is low. This means things inside don't have/emit color, and we simply have a thin shell around the surface
If we visualize this density drop like a Gaussian (centered at the surface), the issue is that the rendered ray will not stop in the middle of the Gaussian (as desired) but rather a bit earlier. This is because of the large exponential decay in the transmittance.

Using this alternate SDF to Density conversion, we get the following rendering for the geometry and the geometry + color (with the side by side comparison with the VolSDF conversion method):

Note that s = 64.0 in these renderings.

VolSDF Conversion	NeuS Conversion

We can see that because of the sharp density drop off, the rendered geometry using the NeuS method is extremely noisy as anything not on the surface (or extremely close) has density close to 0. The color rendered output is still reasonable as we get a sharp surface but this is likely due this being a low-resolution situation with ample views, where the model is still able to optimize and render fairly well.