| Ground truth voxel | Predicted voxel |
|---|---|
![]() |
![]() |
| Ground truth point cloud | Predicted point cloud |
|---|---|
![]() |
![]() |
| Ground truth mesh | Predicted mesh |
|---|---|
![]() |
![]() |
| Input RGB | Predicted voxel grid | Ground truth mesh |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Input RGB | Predicted point cloud | Ground truth mesh |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Input RGB | Predicted mesh | Ground truth mesh |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |

F1 scores of 3 methods of 3D reconstruction.
These are the main trends:
The voxel loss is a binary cross entropy loss. Because of the unbalanced distribution of occupied and unoccupied voxels, a positive weight is often used to balance the loss. This section explores the effect of this weight on the training process and F1 score.
Range of positive weights: [1, 2, 4, 8, 10]


Due to the unbalanced distribution of occupied and unoccupied voxels, the model tends to be unconfident to predict voxels. When training with no positive weight (weight = 1), the model actually fails to predict any voxel grid until after thousands of iterations later.
By increasing the positive weight, the model becomes more confident and predicts voxel grids earlier. The model first predicts a voxel grid after:
However, increasing positive weight seems to hurt the F1 score, as seen from the plot. This is probably because the model becomes too confident and overfits to the training data. We can also see that with a smaller positive weight, the model gains a bigger jump in F1 score after it first predicts a voxel grid. This could be because in the early iterations, even though the model is not confident enough to produce a voxel grid, it is still learning useful features that help it to predict better voxel grids later. The high positive weights probably makes the model harder to learn. I think in the long run all these models will converge to similar F1 scores, but this experiment was done with max_iters=9000.
The model actually produces garbage point clouds for some inputs. Visually, these inputs are chairs of rare shapes (long tail distribution). There are 2 main possible points of failure in the system: the image encoder and the point cloud decoder. Here, I choose to investigate the image encoder by visualizing using t-SNE for image features of the test set.
Top 5 inputs yielding worst F1 scores: 86, 293, 321, 477, 521
| Input RGB | Predicted point cloud | Ground truth mesh |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Total t-SNE plot of image features of the test set:

There are 2 outlier clusters, 1 in red and 2 in blue. Cluster 1:

Cluster 2:

As expected, the worst performing inputs are in the outlier clusters:
The t-SNE was generated with perplexity 50, which should yield a single big cluster of global structure of chairs because the whole dataset is about chairs. However, there are still outlier clusters. This means that the image encoder fails to extract chair features from these images and not classify them as chairs, leading to failure in 3D reconstruction.
Reference: Occupancy Networks: Learning 3D Reconstruction in Function Space
Note: I trained with pretrained image features.
| Predicted mesh | Ground truth mesh |
|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
F1 scores:

I trained the voxel network on the extended dataset with positive weight = 2.



It turns out that the F1 scores of both extended and chair-only models are quite close to each other for smaller thresholds. At 0.05 threshold, the chair-only model has higher F1 than the extended model. I think this suggests that the chair-only model is still better at reconstructing chairs than the extended model. In lower thresholds, the coarse voxel resolution limits what we can see about the accuracy of the reconstruction. At higher thresholds, some chair features become more visible and important, and this is where the chair-only model shows better performance. Even though the extended model was trained with much more data and iterations (31k compared to 9k), it still cannot outperform the chair-only model in reconstructing chairs.
The example shown below probably best visualizes the reconstruction quality between the 2 models.
| Input RGB | Extended prediction | Chair-only prediction | Ground truth mesh |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
We can clearly see that the extended model missed the round top of the chair, while the chair-only model correctly reconstructed it.