


Rounded box center: (0.25, 0.25, -0.00)
Rounded box side lengths: (2.01, 1.50, 1.50)
The code renders a spiral sequence of the optimized volume in images/part_2.gif. Compare this gif to the one below, and attach it in your write-up:


There are some materials (metal for example) that may have specular highlights which change with viewing direction. Hence adding view dependence is nice, as it helps the network model these effects improving the photorealism. However doing so can affect the generalization of the netowrk. The model can overfit and memorize per-view textures that wouldn't work with novel views.
View Dependent: Material Low Res

View Dependent: Material High Res

In hierarchical sampling, there are two passes for sampling: coarse and fine. First, during coarse sampling, we uniformly sample along each ray. This is used to compute a weight w. Based on this weight w, we resample points for each ray, sampling more densely near the areas that are close to the surface. Hierarchical sampling thus results in much sharper structures as we focus more strongly around the structures. Usually, hierarchical sampling we are focusing less on the empty regions so the learning is more efficient. Within 150 epoch as we see in the first image we get pretty good representation, however without hierarchical sampling takes a lot more iterations to produce similar shapness. There is a slight overhead for training two networks, but we do get better quality.
Left to right (Hierarchical Sampling (Epoch 125), Hierarchical Sampling (Epoch 250), Without Hierarchical Sampling)

For sphere tracing, essentially we are advancing along each ray, in it's direction by the signed distance at the current point. I start with near distance. The procedure starts by computing the new sample points using this distance, origin, and the direction. Then I compute the sdf at each of these points. If the point is within a certain small distance threshold, I consider the point has hit the surface. This is the logic that's used to compute the returned mask. For all the rays that haven't hit the surface and haven't exceeeded the max distance, I advance the distances of those rays by the signed distance. The process continues by computing new points based on these new distances. The process tuns specific number of iterations or when there all rays have either hit the surface or exceeded the max distance. We return the final points by adding the distance computed via the marching to the origins in their respective directions of each ray.

After this, you should be able to train a NeuralSurface representation by:
python -m surface_rendering_main --config-name=points_surface
MLP: I chose to implement the MLP as mentioned in the paper. The input to the model is the harmonic embedding of the (x, y, z) coordinate and the harmonic embedding of the viewing direction which is a normalized vector. The initial part of the network is a FFN with 8 linear layers each followed by ReLU. Post that we have a single linear layer that gives 2 concatennated embeddings (one of predicting color and the other for predicting denstiy). I had to manually set the bias here at initialization since the gradients were not flowing properly without it. For Density, we take the first number from the embdedding and run it through ReLU to get a single non-negative number for a single point. For the color, we concat the remaining embedding with viewing direction embedding, and pass it though linear layers. We finally run sigmoid to output rgb values.
Eikonal Loss: Eikonal loss helps bias the network toward the true sdf. The network, i.e. the sdf is a function. The gradients of the function gives us the direction of fastest change. Now when take a small step in the direction of the gradient, we want f to give the same amount of change in the physical distance. This is only possible if the norm of the gradients is 1 everywhere. Eikonal Loss helps us enforce that constraint. If it's not enforced, then the surface still stands, but moving away or beyond the surface, the values of the distance would be scaled. This may give issues with approaches such as ray marching as we may over or undershoot. To compute this loss, the gradients come from autograd. We compute point-wise norm of the gradient. During optimization we are trying to reduce the difference between this and 1 as much as we can.


alpha and beta are used to control how we compute the density from the signed distance function which is then used in the EA model to render. The density is directly proportional to alpha and hence can be thought of as a parameter controlling the opacity. If the alpha is high, so is the density, making the surface appear more opaque and conversely if it's low, the density is low making the surface appear more translucent. beta on the other hand controlling how the density falls off as we move away from the surface. So if we have a higher beta, then the density is still higher as we move slight distance away from the surface in both directions, giving a thicker shell, and conversely with a lower beta, if we move away the shell, the density falls off exponentially giving us a thinner shell. Hence high beta biasses the learned SDF towards a thicker surface and low beta biases the learned SDF towards a thinner surface. Training would be much harder for lower beta. With very low beta the surface would be forced to be very fine and precise which would might be challenging. If we take a look at the derivate of this, we note that gradients would drop of very quickly away from the surface and most space would be contributing to 0 gradients making it hard for the network to learn anything. With higher beta the geometry is a lot coarse and smooth which might be easier to learn. The surface would be more precise with low beta as with high beta the change from empty space to surface would be gradual giving fuzzy surfaces. However, as mentioned ealier it might be hard for the network to learn using low beta from the very beginning.
I mostly chose the default settings with a low learning rate and high pretrain steps. Low learning rate helps with finetuning the network slowly and steadily to convert a sphere sdf to the given image sdf. The main change for me was adding 3 skip connection in the initial backbone of the network that's used to detect both signed distance as well as color. Adding these helps pass the gradients back to the inital embedding learning better representations and converging faster.

Here is my complex scene:

I used 20 randomly sampled views to learn NerF and VolSDF. These views were fixed for across both the experiments: NeRF and VolSDF. From left to right: VolSDF Geometry, VolSDF Output, NeRF output.

We can see even with sparse views, VolSDF is able to learn a coherent geometry. We can see that VolSDF is less glossy and Nerf has smoother shading. Nerf has slighly blurrier appearance near the back especially. We can also notice that VolSDF is better at geometry as thin structures such as tracks, are much crisper compared to nerf where it's blobbier.



Left images are from VolSDF and right iamge are from Naive NeuS.
For VolSDF we see a solid smooth geometry with minimal holes. This might be due to fact that the laplacian used to compute density produces continous smoother gradients. With naive NeuS, in terms of geometry we see finer edges, but a lot of missing details.This might be due to the fact that it focuses density more tightly around the surface, and without proper tuning, the optimization is more brittle. On the color front, VolSDF is consistent and slightly diffused while the NeuS have higher contrast edges brining out the fine textures. This again might be due to the fact the surface from NeuS acts like the actual thin surface which means there's less aberaging around the true surface and hence less blurriness .