16-825: Learning for 3D Vision — Assignment 2

Manyung Emma Hon · mehon · Fall 2025

1.1. Fitting a voxel grid (5 points)

1.1 1.1
Source, target

1.2 Fitting a point cloud (5 points)

1.1 1.1
Source, target

1.3. Fitting a mesh (5 points)

1.2 1.2
Source, target

2.1. Image to voxel grid (20 points)

in order of input image, groundtruth, prediction
2.1.0 2.1.1 2.1.2
2.1.3 2.1.4 2.1.5
2.1.6 2.1.7 2.1.8

2.2. Image to point cloud (20 points)

in order of input image, groundtruth, prediction
2.1.0 2.1.1 2.1.2
2.1.3 2.1.4 2.1.5
2.1.6 2.1.7 2.1.8

2.3. Image to mesh (20 points)

in order of input image, groundtruth, prediction
2.1.0 2.1.1 2.1.2
2.1.3 2.1.4 2.1.5
2.1.6 2.1.7 2.1.8

2.4. Quantitative comparisions(10 points)

2.1.0 2.1.1 2.1.2
Looking at the F1-score results, point clouds wins with 78.231% at the 0.05 threshold, while meshes get 72.616% and voxels lag behind at 70.472%. Here are some possible reasons: Point clouds do the best since they're predicting exact XYZ coordinates as floating-point numbers, so they can place points anywhere in 3D space with high precision. Voxels perform the worst because they're stuck on a grid. The flaw is especially obvious on the second image, where there are not much continuous surface to reconstruct. Our voxel grid is only 32x32x32, so curved surfaces end up looking blocky and stair-stepped, and there's just no way to represent details smaller than one voxel no matter how good the model is. Meshes land somewhere in the middle because they have smooth surfaces like point clouds, but they're limited by the sphere shape we start with (the ico-sphere initialization), so the model can't easily add holes or change the topology. Also, the smoothness loss we added helps make nice-looking meshes, but it smooths out any potential fine details. The gap between them is biggest at tight thresholds like 0.02 where point clouds are about 3% better. At looser thresholds like 0.05 they all get closer together since we are just checking if the rough shape is right rather than exact details.

2.5. Analyse effects of hyperparams variations (10 points)

I analyzed the effect of varying n_points (number of predicted points) on point cloud reconstruction quality and efficiency. Testing with n_points = 1000, 3000, and 5000, I found that F1@0.05 scores improved from 78.2% to 81.6% to 85.7% as point count increased. While higher point counts consistently improved accuracy, the gains showed diminishing returns, tripling from 1000 to 3000 points gave a 3.4% improvement, while increasing from 3000 to 5000 only added 4.1% despite longer training times and higher memory usage. Qualitatively, I observed that points cluster more densely in the center regions (chair seat and backrest) compared to extremities like legs and armrests, which appear about 3x sparser. This non-uniform distribution suggests the model prioritizes placing points in high-confidence regions rather than ensuring even coverage across the entire surface, which explains why thin structures remain challenging even at higher point counts. From a practical standpoint, n_points = 3000 offers a good balance between accuracy (81.6% F1) and computational efficiency, while n_points = 5000 provides the best quality when maximum precision is required despite the additional computational cost.
3 3
1000
3 3
3000
3 3
5000

2.6. Interpret your model (15 points)

3 3 3
This is the first couch in another angle. The visualization color-codes predicted points by their distance to the nearest ground truth point - blue means accurate (low error), red means inaccurate (high error). The first two panels show the 3D prediction from different angles with error coloring, revealing spatial patterns in where the model struggles. The third panel shows the ground truth for reference, and the fourth panel is a histogram showing the distribution of errors across all points, with most errors being small but some outliers showing where the model completely missed parts of the geometry.

3.3 Extended dataset for training (10 points)

2.1.0 2.1.1 2.1.2 2.1.2
I'm using the same picture from 2.2 to compare the before (3rd picture) and after training on full dataset (4th picture). The accuracy has gone down from 78.231 to 76.860. I suspect it is because model capacity dilution that same network size must now learn 3 different object types and less focused on chair specific features. There might be some feature conflicts between the three objects, and a loss of specialization. However, there are better generization with better overall structure even when the details are fuzzy.