Assignment 2 β Single View to 3D
| Course | 16β825 (Learning for 3D Vision) |
|---|---|
| Assignment | 2 |
| Student | Kunwoo Lee (kunwool@andrew.cmu.edu) |
1. Exploring Loss Functions
1.1 Voxel BCE (5 pt)
Implement: voxel_loss in losses.py. Demo: run python fit_data.py --type 'vox'.
1.2 Point Cloud Chamfer (5 pt)
Implement: chamfer_loss in losses.py (no high-level PyTorch3D loss; knn_points/knn_gather allowed). Demo: python fit_data.py --type 'point'.
1.3 Mesh Smoothness (5 pt)
Implement: smoothness_loss in losses.py. Demo: python fit_data.py --type 'mesh'.
2. Reconstructing 3D from Single View
Train decoders for voxels, points, and meshes via train_model.py. Provide qualitative examples and quantitative evaluations per instructions.
2.1 Image β Voxel Grid (20 pt)
Implement: voxel decoder in model.py. Train: python train_model.py --type 'vox'. Eval: python eval_model.py --type 'vox' --load_checkpoint.
2.2 Image β Point Cloud (20 pt)
Implement: point-cloud decoder in model.py. Train: python train_model.py --type 'point'. Eval: python eval_model.py --type 'point' --load_checkpoint.
2.3 Image β Mesh (20 pt)
Implement: mesh decoder in model.py (try different initializations beyond ico_sphere). Train: python train_model.py --type 'mesh'. Eval: python eval_model.py --type 'mesh' --load_checkpoint.
2.4 Quantitative Comparisons (10 pt)
Plot F1-score vs threshold for voxel, point, and mesh networks (eval_{type}.png). Include short discussion.
All three models show a monotonic increase in F1-score as the distance threshold increases, which is expected since looser thresholds count more near-matches as correct. The voxel-based model performs reasonably but is limited by its coarse spatial resolution. The point-cloud network achieves the highest overall accuracy, likely due to its direct supervision on surface points. The mesh network follows a similar trend but slightly lags behind, reflecting the added difficulty of maintaining geometric regularity while fitting full surface connectivity.
2.5 Hyperparameter Study (10 pt)
Vary one hyperparameter (e.g., n_points, vox_size, w_chamfer, mesh initialization) and analyze its effect. Add plots/tables and a brief conclusion.
The plots compare model performance when varying the number of predicted points (n_point = 1000 vs. n_point = 2000). Increasing the number of predicted points from 1000 to 2000 leads to a modest but consistent improvement in F1 across all thresholds. This suggests that denser point sampling helps the network capture more surface detail and reduces sparsity in reconstruction, particularly for fine geometric parts (e.g., chair legs or backrests).
2.6 Model Interpretation (15 pt)
Create visualizations that explain what the model learns (e.g., saliency on image features, attention maps, error heatmaps on 3D). Add short insights.
To better understand what each model learned, I visualized per-vertex reconstruction error (color-coded from π΅ = low error β π΄ = high error) for point-cloud. Modelβs predicted surface or point set was compared against the ground-truth mesh using nearest-neighbor distances. The resulting error distributions reveal characteristic differences between representations. Pointcloud decoder reconstructs global geometry reasonably well, but red regions appear on the backrest and seat undersideβareas with heavy occlusion in the input view. This suggests the network relies strongly on visible surface cues and under-represents occluded structures.
3. Exploring Other Architectures / Datasets
Choose at least one subsection. Completing multiple earns extra credit.
3.1 Implicit Network (10 pt)
Implement an implicit decoder that maps (image features, 3D location) β occupancy. Predict over a 32Γ32Γ32 grid in (β1,1)3 and visualize the resulting mesh via marching cubes.