Single View to 3D¶
Part 1 -- Exploring loss functions¶
1.1 - Fitting a Voxel Grid¶
For the loss function, I used the BCEWithLogitsLoss, which makes sense for this task given that the targets will be 0 or 1 (since we are predicting voxel occupancy) and this allows us to pass in raw logits, which the loss function will then apply a sigmoid to to get them in the [0, 1] range.
Below is the ground truth voxel grid and the optimized version:
| Ground Truth | Prediction |
|---|---|
![]() |
![]() |
1.2 - Fitting a Point Cloud¶
For the loss function, I used the Chamfer distance (I also added a normalization term to the below to get the mean)
This encourages points in 1 set to be closer to the points in the 2nd set and vice versa, which is why it makes sense for optimizing a target point cloud to be close to the ground truth
Below is the ground truth pointcloud and the optimized version: (I rendered the prediction more zoomed out so that you can see how some points still don't lie close to the chair object)
| Ground Truth | Prediction |
|---|---|
![]() |
![]() |
1.3 - Fitting a Mesh¶
For the loss function of the mesh optimization, both the chamfer loss (from the previous section) and a smoothness loss are used. The smoothness loss is based on the Laplacian uniform smoothing objective, which encourages that a point should not deviate too far from its neighbors.
Below is the ground truth mesh and the optimized version:
| Ground Truth | Prediction |
|---|---|
![]() |
![]() |
Part 2 -- Reconstructing 3D from Single View¶
NOTE: Because of AWS issues (Having difficulty getting on-demand EC2 instances approved), I had to train all of the following models on my local machine (CPU) and use the pretrained ResNet18 features of images for the encoder. Therefore, the results are not as strong as they could have otherwise been.
2.1 - Image to Voxel Grid¶
I tried several different decoder networks for the Voxel task, all of them having a similar base of using 3D transposed-convolutional layers, BatchNorms and RELU activations. Some of the different options I tried:
Switching between 3 or 4 (ConvTranspose3d, BatchNorm, RELU) blocks (before the final ConvTranspose3d for the output)
- The former (3 blocks) started from $4^{3}$ volume grid and upsampled to a $32^{3}$ grid, the latter started from $2^{3}$ and upsampled to $32^{3}$
Passing a value for the
pos_weightparameter in theBCEWithLogitsloss --> The logic behind this was that most of the voxel grid is empty so we want to upsample the importance of the occupied voxels in the loss contributionChanging the batch size (for the 4 block network) from 32 to 128 to see if this improves the effect of Batch Normalization (makes it more stable, better mean/variance estimate) - However, this was only trained for 3000 iterations
Running for 8,000 iterations vs 40,000 iterations (Weighing if the higher iterations just causes overfitting)
Ultimately, none of these trials performed extremely well (Avg F1@0.05 ~ 50 was the best) but the relatively best model was:
- 4 (ConvTranspose3D, BatchNorm, RELU) blocks, 32 batch size, no pos_weight parameter
- This model was trained for 8000 iterations and you can see the results below
| Single View Image | Ground Truth | Prediction |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
2.2 - Image to Point Cloud¶
My decoder network was a 4-layer MLP that expanded the input 512-dimensional latent vector through hidden layers of 1024, 2048, and 4096 units, each followed by batch normalization, RELU for activation, and I also included dropout for regularization.
The final layer outputs n_points*3 values and I also used a Tanh activation so that we get bounded 3D coordinates.
Hyperparameters were as follows:
- n_points = 1000
- batch_size = 32
- lr = 4e-4
I trained the model for 38,000 iterations and below are some of the results:
| Single View Image | Ground Truth | Prediction |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
2.3 - Image to Mesh¶
My decoder network was very similar to that for the point-cloud. It once again expands the 512-dimensional latent vector via hidden layers of 1024, 2048 and 4096 units with batch normalization, RELU and dropout in between, and a Tanh activation at the end.
The final layer will produce (n_vertices*3) values that will be used to offset the icosphere mesh.
Hyperparameters were as follows:
- n_vertices = 1000
- batch_size = 32
- lr = 4e-4
- chamfer_loss weight = 1.0
- smooth_loss weight = 0.1
I trained the model for 37,000 iterations and below are some of the results:
| Single View Image | Ground Truth | Prediction |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
2.4 - Quantitative comparisions¶
| Mesh | Point Cloud | Voxel |
|---|---|---|
![]() |
![]() |
![]() |
The Avg F1@0.05 between the 3 models is as follows:
- pointcloud_decoder --> 79.543
- mesh_decoder --> 72.771
- voxel_decoder --> 48.984
Firstly, all 3 models have a monotonic rise in their F1 score as the threshold increases. The threshold controls the maximum allowed distance between predictions and ground truth points, so when this threshold is smaller (i.e. there is a stricter criterion for what a match is), the F1 scores are understandably lower. However, the steep drop in F1 for a lower threshold indicates that all the models are not matching the ground truth to a very fine detail but their representations are reasonably matching the general 3D shape.
The point-cloud decoder has the highest average F1 score at the higher threshold, indicating it performs the "best" of the 3 models. This is likely because a point cloud does not have any connectivity restrictions and it is a flexible representation. During optimization, the model can learn to place a point closer to the ground truth without having to worry about creating holes or maintaining any type of continuity.
The mesh, which starts from an ico-sphere and then deforms the vertices, cannot create holes or disjoint parts and will not be able to model the ground truth as well (i.e. think of the bench data point where there are gaps between each plank). This is why it performs the second best (Getting an average F1 of ~70 at 0.05 threshold).
The voxel decoder performs the poorest by far, possibly for a few reasons: The output grid only has a resolution $32^{3}$ which limits the expressitivity of the model (however, if this had been increased, it would've conversely taken longer to train and would perform worse for the same number of iterations). Secondly, this model has to output 32,768 values, whereas the mesh and point-cloud decoders only have to output 3*1000 values, which means learning will be slower and optimization is more difficult.
2.5 - Analyse effects of hyperparams variations¶
Since many of the meshes we saw in section 2.4 were spikey, I wanted to tune the w_smooth parameter and analyze how the results are impacted.
As a recap, the w_smooth parameter will affect the weight associated with the smoothness loss in the objective function (i.e. increasing w_smooth will increase its relative important to the chamfer loss). The Laplacian smoothing loss measures how much each vertex deviates from its neighbors, and will be low when vertices are closer to the average of their neighbors (which basically flattens bumps and spikes).
I tried the following 3 w_smooth values:
- 0.03 --> Lowest (Penalizes spikiness the least)
- 0.1 --> Standard (What we trained with in Section 2.3)
- 0.3 --> Highest (Penalizes spikiness the most)
These were the Avg F1@0.05 Scores:
w_smooth= 0.03 --> 71.968w_smooth= 0.1 --> 72.242w_smooth= 0.3 --> 69.782
The model for 0.03 and 0.3 values were trained for 8,000 iterations each (all other parameters were held constant) and I reused my model checkpoint for 0.1. Below, we can see predictions from the 3 different models for the same input image:
| Input Image | Ground Truth | (w_smooth = 0.03) | (w_smooth = 0.1) | (w_smooth = 0.3) |
|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
It can be a little bit difficult to notice differences, especially since all 3 meshes do not capture the ground truth that well but we can see some patterns:
When w_smooth = 0.03, the model can predict very spiky/uneven meshes and overfit to variations in the Chamfer loss. This is seen especially well in the last 2 rows where the w_smooth = 0.03 column has many minor spikes, even around relatively flat areas around the chair. It is not prioritizing surface continuity and we get this noisy representation.
When w_smooth = 0.3, the model has slightly fewer spikes but is ignoring the detail in the geometry and you're getting representations that don't look much like the input. We see this best in the 1st row where we have the largest triangular meshes between the 3 models for w_smooth = 0.3 but it arguably looks the least like the sofa since all the high-frequency information has been ignored.
When w_smooth = 0.1, we have a balance as the outputs still look somewhat spiky but we can recognize the chair shape better than the other 2 models. Looking at the last row, the w_smooth=0.1 has less spikes than the 1st model (especially around the legs), but still maintains a better chair shape than the 3rd model. Overall, this model best balances smoothness and geometrical detail.
This analysis is also supported by the fact that the w_smooth = 0.1 model has the best Average F1 score between the 3 (Albeit not by a large amount), showing that it performs the best.
Overall, the w_smooth parameter controls a trade-off between noise levels (i.e. "Amount of spikiness") and detail in the 3D geometry.
2.6 - Interpret your model¶
Since the model is trained to predict a 3D representation from a single input view, I wanted to explore how its predictions vary when given different views of the same object. The R2N2 dataset provides multiple different views per object (one of which is randomly selected every time you index in), but here I separately fed 8 views of a single object into the trained model. The resulting 3D reconstructions are fairly consistent across views demonstrating that the model remains robust even when parts of the object are occluded. This suggests that it has learned strong shape priors such as recognizing that a chair usually has four legs and a backrest, which allows it to infer the complete geometry from partial visuals.
- I also computed the average Euclidean distance between the centroids of each point cloud and got = 0.2505
- Doing something similar for the meshes and computing the average pairwise distance between vertices across the list of 8 meshes, I got = 0.0184
This indicates how consistent these meshes/point clouds are (Even if they are not optimal globally w.r.t the ground truth).
Since the outputs of the voxel decoder model were very sub-optimal and noisy in general, I did not include it in this analysis
| Input View | Predicted Point Cloud | Predicted Mesh |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Part 3 - Exploring other architectures¶
3.2 - Parametric Network¶
For this section, I implemented a network that models a parametric function: it takes in sampled 2D points and outputs their respective 3D points. My network was based on some of the ideas from AtlastNet. The benefit of doing this is that at inference time, instead of only being able to output a certain number of points (like the point-cloud model), we can arbitrarily sample as many as we want and pass them through the function.
During training, I sample N (= 1000) 2D points (in a unit square), concatenate these with the latent representation of the input image (taken from ResNet) and feed this into my MLP which predicts the 3D positions of the points. AtlasNet uses multiple MLPs but given compute constraints, I only used 1 MLP (5 Linear + RELU layers, 1 final Linear layer and Tanh for bounding).
For the loss, I use the Chamfer loss (similar to the point cloud model) and use sample_points_from_meshes to get the ground truth points.
At first, this model did not perform very well (likely due to the singular MLP and its limited complexity) so to make it slightly more robust, I used the idea we learned about in class of incorporating frequency positional encodings and added this for each sampled point.
def positional_encoding(self, xy, num_freqs=6):
frequency_bands = [2 ** i for i in range(num_freqs)]
enc = [xy]
for f in frequency_bands:
enc.append(torch.sin(f * math.pi * xy))
enc.append(torch.cos(f * math.pi * xy))
return torch.cat(enc, dim=-1)
Essentially, for every sampled xy point, we will then be passing in
[x,sin(πx),cos(πx),sin(2πx),cos(2πx),etc...]
to the model, which lets it learn finer geometric details.
This did slightly improve results but given that I was not able to train the model for extremely long (Only 1500 iterations with a batch size of 32), performance was still limited. Below are some of the outputs:
| Single View Image | Parametric Output |
|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |

































































































