Assignment 2¶
1. Exploring loss functions¶
2. Reconstructing 3D from single view¶
2.4. Quantitative comparisions(10 points)¶

Given the above plots, we see that the point cloud has the highest F1 score, then mesh and the lowest one are the voxels, which is clear visually as well if you look at the results above. One reason I can imagine for this is the loss used to train each type of 3D representation. Voxels uses binary representation, which purely looks at whether a voxel is filled at a given location or not, which is not as descriptive as chamfer distance and adding a smoothness.
2.5. Analyse effects of hyperparams variations (10 points)¶
I analyzed the variations in n_points for the image to point cloud model with n_points = [100, 500, 1000, 5000, 10000]
| n_points | F1 Score | Image | Predicted Point Cloud | Ground Truth Point Cloud |
|---|---|---|---|---|
| 100 | 45.084 | ![]() |
![]() |
![]() |
| 500 | 71.762 | ![]() |
![]() |
![]() |
| 1000 | 77.087 | ![]() |
![]() |
![]() |
| 5000 | 83.268 | ![]() |
![]() |
![]() |
| 10000 | 83.548 | ![]() |
![]() |
![]() |
2.6. Interpret your model (15 points)¶
In order to better understand the image-voxel decoder model, we visualize the output at each deconvolution layer as XY/YZ.XZ slices at each layer to show what the model tends to learn at each step.
Let us look at an example input image and the outputs at each layer.
Input image:

The first deconvolution layer results in a 4x4x4 grid and slices of the layer output can be seen below. This seems to provide a very course sense of where the object might be, and shows that the first layer learns a very course estimation.

The second deconv layer results in a 8x8x8 grid and this has a little more spatial fidelity although the results look very noisy. You can kind of make out a region of interest but still looks noisy and has no clean geometry yet.

The third deconv layer results in a 16x16x16 grid and now you can start seeing a shape of a chair. The edges are still noisy but there is a sense of an area where the chair exists and where it doesnt.

The fourth layer results in a 32x32x32 grid and now the shape of the chair is much more clear. Not only are the edges cleaner but there is a sense of countouring within the chair suggesting the arm of the chair vs the seat area etc. There are still small artifacts but it is recognizable.

The last deconv layer results in a 32x32x32 channel and now the chair is much clearer. While the visualization below does not show it, we run the output through a sigmoid before plotting it to get the probability of a voxel being occupied rather than just using logits (we use this ffor moer stable training though).

Output Voxels:

Other Examples:







Another Example:





































