Assignment 2¶

1. Exploring loss functions¶

1.1. Fitting a voxel grid (5 points)¶

Generated Voxels

Generated
Groundtruth Voxels

1.2. Fitting a point cloud (5 points)¶

Generated Point Clouds

Generated
Groundtruth Point Clouds

1.3. Fitting a mesh (5 points)¶

Generated Mesh

Generated
Groundtruth Mesh

2. Reconstructing 3D from single view¶

2.1. Image to voxel grid (20 points)¶

   
         Input Image                Predicted Voxels               Groundtruth Voxels

No description has been provided for this image

No description has been provided for this image

2.2. Image to point cloud (20 points)¶

   
         Input Image             Predicted Point Cloud           Groundtruth Point Cloud

No description has been provided for this image

2.3. Image to mesh (20 points)¶

   
         Input Image             Predicted Mesh                  Groundtruth Mesh

No description has been provided for this image

2.4. Quantitative comparisions(10 points)¶

f1 plots

Given the above plots, we see that the point cloud has the highest F1 score, then mesh and the lowest one are the voxels, which is clear visually as well if you look at the results above. One reason I can imagine for this is the loss used to train each type of 3D representation. Voxels uses binary representation, which purely looks at whether a voxel is filled at a given location or not, which is not as descriptive as chamfer distance and adding a smoothness.

2.5. Analyse effects of hyperparams variations (10 points)¶

I analyzed the variations in n_points for the image to point cloud model with n_points = [100, 500, 1000, 5000, 10000]

n_points	F1 Score	Image	Predicted Point Cloud	Ground Truth Point Cloud
100	45.084
500	71.762
1000	77.087
5000	83.268
10000	83.548

2.6. Interpret your model (15 points)¶

In order to better understand the image-voxel decoder model, we visualize the output at each deconvolution layer as XY/YZ.XZ slices at each layer to show what the model tends to learn at each step.

Let us look at an example input image and the outputs at each layer.

Input image:
f1 plots

The first deconvolution layer results in a 4x4x4 grid and slices of the layer output can be seen below. This seems to provide a very course sense of where the object might be, and shows that the first layer learns a very course estimation.
f1 plots

The second deconv layer results in a 8x8x8 grid and this has a little more spatial fidelity although the results look very noisy. You can kind of make out a region of interest but still looks noisy and has no clean geometry yet.
f1 plots

The third deconv layer results in a 16x16x16 grid and now you can start seeing a shape of a chair. The edges are still noisy but there is a sense of an area where the chair exists and where it doesnt.
f1 plots

The fourth layer results in a 32x32x32 grid and now the shape of the chair is much more clear. Not only are the edges cleaner but there is a sense of countouring within the chair suggesting the arm of the chair vs the seat area etc. There are still small artifacts but it is recognizable.
f1 plots

The last deconv layer results in a 32x32x32 channel and now the chair is much clearer. While the visualization below does not show it, we run the output through a sigmoid before plotting it to get the probability of a voxel being occupied rather than just using logits (we use this ffor moer stable training though).
f1 plots

Output Voxels:
f1 plots

Other Examples:
f1 plots

Another Example:
f1 plots

3. Exploring other architectures / datasets.¶

3.1 Implicit network (10 points)¶

   
         Input Image             Predicted Output                  Groundtruth Output

No description has been provided for this image

3.2 Parametric network (10 points)¶

   
         Input Image             Predicted Output                  Groundtruth Output

No description has been provided for this image