16-825: Learning for 3D Vision - Assignment 2

Vaibhav Parekh | Fall 2025

Problem 1-1: Fitting a voxel grid

Fig. 1-1(a): Predicted
Fig. 1-1(b): Ground Truth

Problem 1-2: Fitting a point cloud

Fig. 1-2(a): Predicted
Fig. 1-2(b): Ground Truth

Problem 1-3: Fitting a mesh

Fig. 1-3(a): Predicted
Fig. 1-3(b): Ground Truth

Problem 2-1: Image to voxel grid

Fig. 2-1(1-a): Input RGB
Fig. 2-1(1-b): Predicted
Fig. 2-1(1-c): Ground Truth
Fig. 2-1(2-a): Input RGB
Fig. 2-1(2-b): Predicted
Fig. 2-1(2-c): Ground Truth
Fig. 2-1(3-a): Input RGB
Fig. 2-1(3-b): Predicted
Fig. 2-1(3-c): Ground Truth

Problem 2-2: Image to point cloud

Fig. 2-2(1-a): Input RGB
Fig. 2-2(1-b): Predicted
Fig. 2-2(1-c): Ground Truth
Fig. 2-2(2-a): Input RGB
Fig. 2-2(2-b): Predicted
Fig. 2-2(2-c): Ground Truth
Fig. 2-2(3-a): Input RGB
Fig. 2-2(3-b): Predicted
Fig. 2-2(3-c): Ground Truth

Problem 2-3: Image to mesh

Fig. 2-3(1-a): Input RGB
Fig. 2-3(1-b): Predicted
Fig. 2-3(1-c): Ground Truth
Fig. 2-3(2-a): Input RGB
Fig. 2-3(2-b): Predicted
Fig. 2-3(2-c): Ground Truth
Fig. 2-3(3-a): Input RGB
Fig. 2-3(3-b): Predicted
Fig. 2-3(3-c): Ground Truth

Problem 2-4: Quantitative comparisions

F1 score vs Threshold

Fig. 2-4(a): Voxel grid
Fig. 2-4(b): Point cloud
Fig. 2-4(c): Mesh

Across thresholds, the point-cloud model scores the highest. This is because, for point clouds, the prediction lives directly in the same space the F1 metric uses (nearest-neighbor distances between sampled points). They’re continuous, sub-voxel, and don’t quantize space, so small geometric details are preserved and counted as matches at tight thresholds.
In voxel grids, the performance is capped by grid resolution. Thin structures get “blocky” or disappear, which hurts recall at strict thresholds. Increasing resolution does helps but it is really memory intensive.
Lastly, for meshes, they can be very accurate in principle. However, with simple decoders and fixed topology/initialization, surfaces can turn out wavy or incomplete. If the topology doesn’t match the object, sampled points miss the GT surface, lowering F1 score.
F1 favors methods that place points precisely on surfaces. Point clouds align best with that criterion. Voxels are limited by discretization while meshes can underperform without strong surface/topology supervision or using superior decoders.

Problem 2-5: Analysing effects of hyperparams variations

Comparison of w_smooth

Fig. 2-5(1-a): Input RGB
Fig. 2-5(1-b): Predicted (w_smooth = default)
Fig. 2-5(1-c): Predicted (w_smooth = 1.0)
Fig. 2-5(1-d): Predicted (w_smooth = 2.0)
Fig. 2-5(1-e): Ground Truth
Fig. 2-5(2-a): Input RGB
Fig. 2-5(2-b): Predicted (w_smooth = default)
Fig. 2-5(2-c): Predicted (w_smooth = 1.0)
Fig. 2-5(2-d): Predicted (w_smooth = 2.0)
Fig. 2-5(2-e): Ground Truth
Fig. 2-5(3-a): Input RGB
Fig. 2-5(3-b): Predicted (w_smooth = default)
Fig. 2-5(3-c): Predicted (w_smooth = 1.0)
Fig. 2-5(3-d): Predicted (w_smooth = 2.0)
Fig. 2-5(3-e): Ground Truth

I varied the mesh smoothness weight in the loss and measured its effect on the mesh model. Specifically, I tested w_smooth = 0.1, 1, 2.
As w_smooth increased, F1 tended to drop, while the look of the meshes changed visibly. Even small smoothing (w_smooth > 0) produced cleaner, less noisy surfaces at the cost of some edge sharpness.
Pushing smoothing too high degraded performance, likely because it washes out fine geometric details that the model needs for accurate reconstruction.

Problem 2-6: Interpreting the model

I wanted to see which parts of the input image the model relies on when producing its single-view 3D prediction, to sanity-check behavior and spot weaknesses. I computed Grad-CAM on the model’s image encoder, using the mean occupancy probability of the predicted voxel volume as the target, and overlaid the heatmap on the RGB with OpenCV's Jet colormap. The maps emphasize structural boundaries (junctions between seat and back, armrests, and legs), while large flat regions remain cooler. Overall, the model leans on edge/structure cues to form its 3D prediction.

Also, as a side script, I made a cool looking animation of the voxels coming into shape as the loss goes down in fit_data.py --type 'vox'.
Please note that this is separate from what's explained above.

Problem 3-3: Extended dataset for training

Fig. 3-3(1-a): Input RGB
Fig. 3-3(1-b): Predicted
Fig. 3-3(1-c): Ground Truth
Fig. 3-3(2-a): Input RGB
Fig. 3-3(2-b): Predicted
Fig. 3-3(2-c): Ground Truth
Fig. 3-3(3-a): Input RGB
Fig. 3-3(3-b): Predicted
Fig. 3-3(3-c): Ground Truth
Fig. 3-3: F1 score vs Threshold

Training on three classes produced similar F1 results to single-class training but required substantially more compute and longer training time. Qualitatively, the reconstructions looked comparable. In short: multi-class training may add robustness, but single-class is much more efficient. The choice comes down to robustness over multiple classes and the compute available for training/inferencing.