Learning for 3D Vision: Assignment 2
- Name: Ahish Deshpande
- Andrew ID: ahishd
1. Exploring Loss Functions
1.1. Fitting a Voxel Grid
| Voxel Grid (Iterations = 2000) |
| Target Voxel Grid |
Predicted Voxel Grid |
 |
 |
1.2. Fitting a Point Cloud
| Point Cloud (Iterations = 6000) |
| Target Point Cloud |
Predicted Point Cloud |
 |
 |
1.3. Fitting a Mesh
| Mesh (Iterations = 6000) |
| Target Mesh |
Predicted Mesh |
 |
 |
2. Reconstructing 3D from single view
2.1. Image to Voxel Grid
| Voxels |
| RGB Image |
Target Voxel Grid |
Predicted Voxel Grid |
 |
 |
 |
 |
 |
 |
 |
 |
 |
2.2. Image to Point Cloud
| Point Clouds |
| RGB Image |
Target Point Cloud |
Predicted Point Cloud |
 |
 |
 |
 |
 |
 |
 |
 |
 |
2.3. Image to Mesh
| Mesh |
| RGB Image |
Target Mesh |
Predicted Mesh |
 |
 |
 |
 |
 |
 |
 |
 |
 |
2.4. Quantitative Comparisons


2.5. Analyse effects of hyperparams variations
The hyperparameter I tried varying was the weight of the Chamfer distance and the weight of the Laplacian smoothing.
| Loss Weights vs Output Mesh |
| Chamfer Distance Weight |
Laplacian Smoothing Weight |
Ground Truth Mesh |
Predicted Mesh |
Comments |
| 1 |
0.1 |
 |
 |
This configuration provides the best balance between Smoothness and Chamfer Distance. |
| 1 |
1 |
 |
 |
This configuration puts a bit too much emphasis on smoothness, and thus we can see much longer triangles, which results in lesser object resemblance. |
| 3 |
0.1 |
 |
 |
With this configuration, the mesh is much spikier, but is able to minimize the Chamfer loss better by creating triangles that change abruptly. |
2.6. Interpret Your Model
Point Cloud Model Interpretation
Encoder Weights
Looking at the weights in the encoder, we can see the filters used in the model for extracting features from the input. While a lot of the filters are not trivial to the human eye, a lot of the filters can be interpreted as gradient filters in the horizontal, vertical and diagonal directions, showing that initially, basic information is extracted from the input.
Conv Layers


The convolution kernels shown here represent the different features that are extracted by the model to generate the latent representation that is then used by the decoder.
Decoder Layer
The above decoder activations shows the weights that the model has learned to assign to each feature in the latent representation of the input. The x-axis represents the 512 features in the latent representation and the y-axis represents the output. Thus, each row represents how much weight is given to each element of the latent representation.
Voxel Grid Model Interpretation
Conv Transpose Kernels
The above 3D kernels used in the conv transpose for upsampling the latent representation show what the voxel grid prediction model learns during the training process.
Exploring other architectures / datasets.
3.3. Extended Dataset for Training
The full dataset training was done on the point cloud model and the following results were obtained.
Qualitative Results
| Point Cloud |
| Class |
Target Point Cloud |
Predicted Point Cloud |
| Plane |
 |
 |
| Car |
 |
 |
| Chair |
 |
 |
Quantitative Results
The F1 score for training done on the full dataset is much higher than that of the individual class model. The model is also able to learn the different classes correctly, and thus is likely to generalize better, rather than overfitting to one class.