16-825 Assignment 2

Duc Doan

Q1. Exploring loss functions

1.1 Fitting a voxel grid

Ground truth voxel	Predicted voxel

1.2 Fitting a point cloud

Ground truth point cloud	Predicted point cloud

1.3 Fitting a mesh

Ground truth mesh	Predicted mesh

Q2. Reconstructing 3D from single view

2.1. Image to voxel grid

Input RGB	Predicted voxel grid	Ground truth mesh

2.2. Image to point cloud

Input RGB	Predicted point cloud	Ground truth mesh

2.3. Image to mesh

Input RGB	Predicted mesh	Ground truth mesh

2.4. Quantitative comparisons

q2_4

F1 scores of 3 methods of 3D reconstruction.

These are the main trends:

Mesh F1 is always higher than voxel grid F1 because the voxel grid is limited by its resolution.
At low threshold, the order of low to high F1 score is point cloud, voxel grid, and mesh.
- Due to nonuniform distribution of the predicted points, the point cloud F1 is the lowest. There are many holes in the cloud which are larger than the distance threshold, which results in low F1.
At high threshold, point cloud F1 is the highest
- The point cloud has high flexibility because the points can move freely. They can be moved very close to the ground truth points, resulting in high F1.

2.5. Hyperparams variation: positive weight for the voxel loss

The voxel loss is a binary cross entropy loss. Because of the unbalanced distribution of occupied and unoccupied voxels, a positive weight is often used to balance the loss. This section explores the effect of this weight on the training process and F1 score.

Range of positive weights: [1, 2, 4, 8, 10]

Effects on training loss

train loss

The positive weight is multiplied to the loss of positive voxels, so it's obvious that the higher the positive weight is, the higher the training loss is.

Effects on F1 score

val f1

Due to the unbalanced distribution of occupied and unoccupied voxels, the model tends to be unconfident to predict voxels. When training with no positive weight (weight = 1), the model actually fails to predict any voxel grid until after thousands of iterations later.

By increasing the positive weight, the model becomes more confident and predicts voxel grids earlier. The model first predicts a voxel grid after:

~3000 iterations with weight = 1
~2000 iterations with weight = 2
less than 500 iterations with weight >= 4

However, increasing positive weight seems to hurt the F1 score, as seen from the plot. This is probably because the model becomes too confident and overfits to the training data. We can also see that with a smaller positive weight, the model gains a bigger jump in F1 score after it first predicts a voxel grid. This could be because in the early iterations, even though the model is not confident enough to produce a voxel grid, it is still learning useful features that help it to predict better voxel grids later. The high positive weights probably makes the model harder to learn. I think in the long run all these models will converge to similar F1 scores, but this experiment was done with max_iters=9000.

2.6. Interpreting the point cloud model

The model actually produces garbage point clouds for some inputs. Visually, these inputs are chairs of rare shapes (long tail distribution). There are 2 main possible points of failure in the system: the image encoder and the point cloud decoder. Here, I choose to investigate the image encoder by visualizing using t-SNE for image features of the test set.

Top 5 inputs yielding worst F1 scores: 86, 293, 321, 477, 521

Input RGB	Predicted point cloud	Ground truth mesh

Total t-SNE plot of image features of the test set:

tsne total

There are 2 outlier clusters, 1 in red and 2 in blue. Cluster 1:

cluster1

Cluster 2:

cluster2

As expected, the worst performing inputs are in the outlier clusters:

86, 293, 521 in cluster 1
321, 477 in cluster 2

The t-SNE was generated with perplexity 50, which should yield a single big cluster of global structure of chairs because the whole dataset is about chairs. However, there are still outlier clusters. This means that the image encoder fails to extract chair features from these images and not classify them as chairs, leading to failure in 3D reconstruction.

Q3. Exploring implicit network and extended dataset

3.1. Implicit network

Reference: Occupancy Networks: Learning 3D Reconstruction in Function Space

main paper: https://arxiv.org/pdf/1812.03828
supplementary doc: https://www.cvlibs.net/publications/Mescheder2019CVPR_supplementary.pdf

Note: I trained with pretrained image features.

Predicted mesh	Ground truth mesh

F1 scores:

implicit f1

3.3. Extended dataset

I trained the voxel network on the extended dataset with positive weight = 2.

Training loss:

train loss

Validation F1:

val f1

F1 comparison with chair test set:

f1 comp

It turns out that the F1 scores of both extended and chair-only models are quite close to each other for smaller thresholds. At 0.05 threshold, the chair-only model has higher F1 than the extended model. I think this suggests that the chair-only model is still better at reconstructing chairs than the extended model. In lower thresholds, the coarse voxel resolution limits what we can see about the accuracy of the reconstruction. At higher thresholds, some chair features become more visible and important, and this is where the chair-only model shows better performance. Even though the extended model was trained with much more data and iterations (31k compared to 9k), it still cannot outperform the chair-only model in reconstructing chairs.

Example output

The example shown below probably best visualizes the reconstruction quality between the 2 models.

Input RGB	Extended prediction	Chair-only prediction	Ground truth mesh

We can clearly see that the extended model missed the round top of the chair, while the chair-only model correctly reconstructed it.