16-825 Assignment 2

Duc Doan

Q1. Exploring loss functions

1.1 Fitting a voxel grid

Ground truth voxel Predicted voxel
GT vox Pred vox

1.2 Fitting a point cloud

Ground truth point cloud Predicted point cloud
GT pc Pred pc

1.3 Fitting a mesh

Ground truth mesh Predicted mesh
GT mesh Pred mesh

Q2. Reconstructing 3D from single view

2.1. Image to voxel grid

Input RGB Predicted voxel grid Ground truth mesh
RGB Pred vox GT mesh
RGB Pred vox GT mesh
RGB Pred vox GT mesh

2.2. Image to point cloud

Input RGB Predicted point cloud Ground truth mesh
RGB Pred vox GT mesh
RGB Pred vox GT mesh
RGB Pred vox GT mesh

2.3. Image to mesh

Input RGB Predicted mesh Ground truth mesh
RGB Pred vox GT mesh
RGB Pred vox GT mesh
RGB Pred vox GT mesh

2.4. Quantitative comparisons

q2_4

F1 scores of 3 methods of 3D reconstruction.

These are the main trends:

2.5. Hyperparams variation: positive weight for the voxel loss

The voxel loss is a binary cross entropy loss. Because of the unbalanced distribution of occupied and unoccupied voxels, a positive weight is often used to balance the loss. This section explores the effect of this weight on the training process and F1 score.

Range of positive weights: [1, 2, 4, 8, 10]

Effects on training loss

train loss

Effects on F1 score

val f1

Due to the unbalanced distribution of occupied and unoccupied voxels, the model tends to be unconfident to predict voxels. When training with no positive weight (weight = 1), the model actually fails to predict any voxel grid until after thousands of iterations later.

By increasing the positive weight, the model becomes more confident and predicts voxel grids earlier. The model first predicts a voxel grid after:

However, increasing positive weight seems to hurt the F1 score, as seen from the plot. This is probably because the model becomes too confident and overfits to the training data. We can also see that with a smaller positive weight, the model gains a bigger jump in F1 score after it first predicts a voxel grid. This could be because in the early iterations, even though the model is not confident enough to produce a voxel grid, it is still learning useful features that help it to predict better voxel grids later. The high positive weights probably makes the model harder to learn. I think in the long run all these models will converge to similar F1 scores, but this experiment was done with max_iters=9000.

2.6. Interpreting the point cloud model

The model actually produces garbage point clouds for some inputs. Visually, these inputs are chairs of rare shapes (long tail distribution). There are 2 main possible points of failure in the system: the image encoder and the point cloud decoder. Here, I choose to investigate the image encoder by visualizing using t-SNE for image features of the test set.

Top 5 inputs yielding worst F1 scores: 86, 293, 321, 477, 521

Input RGB Predicted point cloud Ground truth mesh
RGB Pred vox GT mesh
RGB Pred vox GT mesh
RGB Pred vox GT mesh
RGB Pred vox GT mesh
RGB Pred vox GT mesh

Total t-SNE plot of image features of the test set:

tsne total

There are 2 outlier clusters, 1 in red and 2 in blue. Cluster 1:

cluster1

Cluster 2:

cluster2

As expected, the worst performing inputs are in the outlier clusters:

The t-SNE was generated with perplexity 50, which should yield a single big cluster of global structure of chairs because the whole dataset is about chairs. However, there are still outlier clusters. This means that the image encoder fails to extract chair features from these images and not classify them as chairs, leading to failure in 3D reconstruction.

Q3. Exploring implicit network and extended dataset

3.1. Implicit network

Reference: Occupancy Networks: Learning 3D Reconstruction in Function Space

Note: I trained with pretrained image features.

Predicted mesh Ground truth mesh
Pred vox GT mesh
Pred vox GT mesh
Pred vox GT mesh
Pred vox GT mesh

F1 scores:

implicit f1

3.3. Extended dataset

I trained the voxel network on the extended dataset with positive weight = 2.

Training loss:

train loss

Validation F1:

val f1

F1 comparison with chair test set:

f1 comp

It turns out that the F1 scores of both extended and chair-only models are quite close to each other for smaller thresholds. At 0.05 threshold, the chair-only model has higher F1 than the extended model. I think this suggests that the chair-only model is still better at reconstructing chairs than the extended model. In lower thresholds, the coarse voxel resolution limits what we can see about the accuracy of the reconstruction. At higher thresholds, some chair features become more visible and important, and this is where the chair-only model shows better performance. Even though the extended model was trained with much more data and iterations (31k compared to 9k), it still cannot outperform the chair-only model in reconstructing chairs.

Example output

The example shown below probably best visualizes the reconstruction quality between the 2 models.

Input RGB Extended prediction Chair-only prediction Ground truth mesh
RGB ext vox chair vox GT mesh

We can clearly see that the extended model missed the round top of the chair, while the chair-only model correctly reconstructed it.