Learning for 3D Vision: Assignment 2

Name: Ahish Deshpande
Andrew ID: ahishd

1. Exploring Loss Functions

1.1. Fitting a Voxel Grid

Voxel Grid (Iterations = 2000)
Target Voxel Grid	Predicted Voxel Grid

1.2. Fitting a Point Cloud

Point Cloud (Iterations = 6000)
Target Point Cloud	Predicted Point Cloud

1.3. Fitting a Mesh

Mesh (Iterations = 6000)
Target Mesh	Predicted Mesh

2. Reconstructing 3D from single view

2.1. Image to Voxel Grid

Voxels
RGB Image	Target Voxel Grid	Predicted Voxel Grid

2.2. Image to Point Cloud

Point Clouds
RGB Image	Target Point Cloud	Predicted Point Cloud

2.3. Image to Mesh

Mesh
RGB Image	Target Mesh	Predicted Mesh

2.4. Quantitative Comparisons

2.5. Analyse effects of hyperparams variations

The hyperparameter I tried varying was the weight of the Chamfer distance and the weight of the Laplacian smoothing.

Loss Weights vs Output Mesh
Chamfer Distance Weight	Laplacian Smoothing Weight	Ground Truth Mesh	Predicted Mesh	Comments
1	0.1			This configuration provides the best balance between Smoothness and Chamfer Distance.
1	1			This configuration puts a bit too much emphasis on smoothness, and thus we can see much longer triangles, which results in lesser object resemblance.
3	0.1			With this configuration, the mesh is much spikier, but is able to minimize the Chamfer loss better by creating triangles that change abruptly.

2.6. Interpret Your Model

Point Cloud Model Interpretation

Encoder Weights

Looking at the weights in the encoder, we can see the filters used in the model for extracting features from the input. While a lot of the filters are not trivial to the human eye, a lot of the filters can be interpreted as gradient filters in the horizontal, vertical and diagonal directions, showing that initially, basic information is extracted from the input.

Conv Layers

The convolution kernels shown here represent the different features that are extracted by the model to generate the latent representation that is then used by the decoder.

Decoder Layer

The above decoder activations shows the weights that the model has learned to assign to each feature in the latent representation of the input. The x-axis represents the 512 features in the latent representation and the y-axis represents the output. Thus, each row represents how much weight is given to each element of the latent representation.

Voxel Grid Model Interpretation

Conv Transpose Kernels

The above 3D kernels used in the conv transpose for upsampling the latent representation show what the voxel grid prediction model learns during the training process.

Exploring other architectures / datasets.

3.3. Extended Dataset for Training

The full dataset training was done on the point cloud model and the following results were obtained.

Qualitative Results

Point Cloud
Class	Target Point Cloud	Predicted Point Cloud
Plane
Car
Chair

Quantitative Results

The F1 score for training done on the full dataset is much higher than that of the individual class model. The model is also able to learn the different classes correctly, and thus is likely to generalize better, rather than overfitting to one class.