16-825 Assignment 2: Single View to 3D
1. Exploring loss functions
1.1. Fitting a voxel grid (5 points)- optimized voxel grid and ground-truth

1.2. Fitting a point cloud (5 points) pred and ground truth

1.3. Fitting a mesh (5 points) pred and ground truth

2. Reconstructing 3D from single view
*Trained on CPU before I got AWS credits- used --load_feats
2.1. Image to voxel grid (20 points)
| Input RGB | Predicted Voxel Grid | Ground Truth Mesh |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
2.2. Image to point cloud (20 points)
| Input RGB | Predicted Point Cloud | Ground Truth Mesh |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
2.3. Image to mesh (20 points)
| Input RGB | Predicted Mesh | Ground Truth Mesh |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
2.4. Quantitative comparisions(10 points)

Voxel-based reconstruction
- The F1-score for voxels is the lowest among the three methods (max 55 at threshold 0.05).
- This is expected because voxel grids are a coarse discretization of 3D space.
- Fine details get lost due to the grid resolution- leading to lower precision and recall especially at stricter thresholds.
Point cloud-based reconstruction
- Point clouds achieve the highest F1-scores, reaching close to 70 at threshold 0.05.
- Point-based methods directly represent surfaces with sampled points, avoiding quantization errors from voxels.
- This makes them more accurate for capturing fine details in the object geometry. However, point clouds lack explicit connectivity which can make them noisier.
Mesh-based reconstruction
- Meshes show mid-to-high performance, peaking around 62.
- They outperform voxels because meshes explicitly represent continuous surfaces.
- However, they slightly underperform point clouds since mesh prediction is more complex (requires both geometry and topology to be correct). Errors in faces or connectivity reduce recall.
2.5. Analyse effects of hyperparams variations (10 points)
Varied n_points for the point clouds to be 600, 1000 and 400. They were all trained for 40,000 iterations on a CPU using --load_feats
| eval n_points=600 | eval n_points=1000 | eval n_points=400 |
|---|---|---|
![]() |
![]() |
![]() |
Outputs from hyperparameter variations
| input RGB | ground truth mesh | pred n_points=600 | pred n_points=1000 | pred n_points=400 |
|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Varied the number of points (n_points) in the point cloud decoder to 600, 1000, and 4000, keeping all other settings constant and training each model for 40,000 iterations on CPU using precomputed ResNet features. Quantitatively, the model with 4000 points achieved the highest F1-scores across all thresholds, followed by 1000 and then 600. The larger output dimensionality of the 4k model increased training time but allowed finer reconstruction detail.
Visually, the 4000 point reconstructions captured the most complete and realistic 3D shapes, especially in curved and thin structures, while the 600 point outputs appeared sparse and lost geometric detail. The 1000 point results were more balanced but still coarser. Although the 4K outputs were generally superior, some examples exhibited noisy or overscattered points, indicating that the higher resolution occasionally introduced instability without additional regularization.
2.6. Interpret your model (15 points)
I measured mesh complexity using the number of faces and vertices in each ground-truth mesh and plotted it against F1@0.05. The scatter plot shows a very weak correlation between complexity and reconstruction accuracy, indicating that the model’s performance does not strongly depend on surface detail or polygon count. Visually, the (in the gifs below) the most complex and simplest meshes look similarly poor, with both suffering from missing or distorted geometry.
Interestingly, the lowest F1-score example corresponds to an intricate mesh- suggesting that detailed structures remain challenging for the model. Overall, the results suggest that while the decoder captures general shape structure, it struggles almost equally across simple and complex meshes, showing limited sensitivity to geometric complexity.

Most complex mesh ground truth and prediction

Simplest mesh ground truth and prediction

Highest F1 score mesh ground truth and prediction

Lowest F1 score mesh ground truth and prediction

3. Exploring other architectures / datasets
3.3 Extended dataset for training (10 points)
I trained the larger dataset for point cloud reconstruction using the same model, same hyperparameters and until the model reached the same loss of 0.03.
Training on the larger dataset resulted in a very slightly lower F1 score curve at each theshold (nearly identical at some points). This could be because the training set for this question had 2 other classes with different structures- the probability distribution during training isn't just that of chairs in this question- so the point clouds also look less condensed (more noisy and scattered) and less visually accurate.
| Input RGB | Ground Truth Mesh | Predicted point cloud trained on small dataset | Predicted point cloud trained on full dataset |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
F1 curves for small dataset and large dataset:







































