16-825 Assignment 2 — Kunwoo Lee

Assignment 2 — Single View to 3D

Course	16‑825 (Learning for 3D Vision)
Assignment	2
Student	Kunwoo Lee (kunwool@andrew.cmu.edu)

1. Exploring Loss Functions

1.1 Voxel BCE (5 pt)

Implement: voxel_loss in losses.py. Demo: run python fit_data.py --type 'vox'.

GT voxel grid (placeholder) — Ground-truth voxel grid.

Fitted voxel grid (placeholder) — Optimized voxel grid over iterations.

1.2 Point Cloud Chamfer (5 pt)

Implement: chamfer_loss in losses.py (no high-level PyTorch3D loss; knn_points/knn_gather allowed). Demo: python fit_data.py --type 'point'.

GT point cloud (placeholder) — Ground-truth point cloud.

Fitted point cloud (placeholder) — Optimized point cloud over iterations.

1.3 Mesh Smoothness (5 pt)

Implement: smoothness_loss in losses.py. Demo: python fit_data.py --type 'mesh'.

GT mesh (placeholder) — Ground-truth mesh.

Fitted mesh (placeholder) — Optimized mesh over iterations.

2. Reconstructing 3D from Single View

Train decoders for voxels, points, and meshes via train_model.py. Provide qualitative examples and quantitative evaluations per instructions.

2.1 Image → Voxel Grid (20 pt)

Implement: voxel decoder in model.py. Train: python train_model.py --type 'vox'. Eval: python eval_model.py --type 'vox' --load_checkpoint.

Pred voxel render (placeholder) — Predicted voxel render.

GT mesh render (placeholder) — Ground-truth voxel render.

2.2 Image → Point Cloud (20 pt)

Implement: point-cloud decoder in model.py. Train: python train_model.py --type 'point'. Eval: python eval_model.py --type 'point' --load_checkpoint.

Pred point cloud (placeholder) — Predicted point cloud.

GT pointcloud render (placeholder) — Ground-truth mesh render.

2.3 Image → Mesh (20 pt)

Implement: mesh decoder in model.py (try different initializations beyond ico_sphere). Train: python train_model.py --type 'mesh'. Eval: python eval_model.py --type 'mesh' --load_checkpoint.

Pred mesh render (placeholder) — Predicted mesh render.

2.4 Quantitative Comparisons (10 pt)

Plot F1-score vs threshold for voxel, point, and mesh networks (eval_{type}.png). Include short discussion.

F1 curve voxel — F1 vs threshold — voxel.

F1 curve point — F1 vs threshold — point cloud.

All three models show a monotonic increase in F1-score as the distance threshold increases, which is expected since looser thresholds count more near-matches as correct. The voxel-based model performs reasonably but is limited by its coarse spatial resolution. The point-cloud network achieves the highest overall accuracy, likely due to its direct supervision on surface points. The mesh network follows a similar trend but slightly lags behind, reflecting the added difficulty of maintaining geometric regularity while fitting full surface connectivity.

2.5 Hyperparameter Study (10 pt)

Vary one hyperparameter (e.g., n_points, vox_size, w_chamfer, mesh initialization) and analyze its effect. Add plots/tables and a brief conclusion.

The plots compare model performance when varying the number of predicted points (n_point = 1000 vs. n_point = 2000). Increasing the number of predicted points from 1000 to 2000 leads to a modest but consistent improvement in F1 across all thresholds. This suggests that denser point sampling helps the network capture more surface detail and reduces sparsity in reconstruction, particularly for fine geometric parts (e.g., chair legs or backrests).

2.6 Model Interpretation (15 pt)

Create visualizations that explain what the model learns (e.g., saliency on image features, attention maps, error heatmaps on 3D). Add short insights.

To better understand what each model learned, I visualized per-vertex reconstruction error (color-coded from 🔵 = low error → 🔴 = high error) for point-cloud. Model’s predicted surface or point set was compared against the ground-truth mesh using nearest-neighbor distances. The resulting error distributions reveal characteristic differences between representations. Pointcloud decoder reconstructs global geometry reasonably well, but red regions appear on the backrest and seat underside—areas with heavy occlusion in the input view. This suggests the network relies strongly on visible surface cues and under-represents occluded structures.

3. Exploring Other Architectures / Datasets

Choose at least one subsection. Completing multiple earns extra credit.

3.1 Implicit Network (10 pt)

Implement an implicit decoder that maps (image features, 3D location) → occupancy. Predict over a 32×32×32 grid in (−1,1)³ and visualize the resulting mesh via marching cubes.