Vaibhav Parekh | Fall 2025
F1 score vs Threshold
Across thresholds, the point-cloud model scores the highest. This is because, for point clouds, the prediction lives directly in the same space the F1 metric uses (nearest-neighbor distances between sampled points).
They’re continuous, sub-voxel, and don’t quantize space, so small geometric details are preserved and counted as matches at tight thresholds.
In voxel grids, the performance is capped by grid resolution. Thin structures get “blocky” or disappear, which hurts recall at strict thresholds. Increasing resolution does helps but it is really memory intensive.
Lastly, for meshes, they can be very accurate in principle. However, with simple decoders and fixed topology/initialization, surfaces can turn out wavy or incomplete. If the topology doesn’t match the object, sampled points miss the GT surface, lowering F1 score.
F1 favors methods that place points precisely on surfaces. Point clouds align best with that criterion. Voxels are limited by discretization while meshes can underperform without strong surface/topology supervision or using superior decoders.
Comparison of w_smooth
I varied the mesh smoothness weight in the loss and measured its effect on the mesh model. Specifically, I tested w_smooth = 0.1, 1, 2.
As w_smooth increased, F1 tended to drop, while the look of the meshes changed visibly. Even small smoothing (w_smooth > 0) produced cleaner, less noisy surfaces at the cost of some edge sharpness.
Pushing smoothing too high degraded performance, likely because it washes out fine geometric details that the model needs for accurate reconstruction.
I wanted to see which parts of the input image the model relies on when producing its single-view 3D prediction, to sanity-check behavior and spot weaknesses. I computed Grad-CAM on the model’s image encoder, using the mean occupancy probability of the predicted voxel volume as the target, and overlaid the heatmap on the RGB with OpenCV's Jet colormap. The maps emphasize structural boundaries (junctions between seat and back, armrests, and legs), while large flat regions remain cooler. Overall, the model leans on edge/structure cues to form its 3D prediction.
Also, as a side script, I made a cool looking animation of the voxels coming into shape as the loss goes down in fit_data.py --type 'vox'.
Please note that this is separate from what's explained above.


Training on three classes produced similar F1 results to single-class training but required substantially more compute and longer training time. Qualitatively, the reconstructions looked comparable. In short: multi-class training may add robustness, but single-class is much more efficient. The choice comes down to robustness over multiple classes and the compute available for training/inferencing.