Assignment 2 β€” Single View to 3D

Course16‑825 (Learning for 3D Vision)
Assignment2
StudentKunwoo Lee (kunwool@andrew.cmu.edu)

1. Exploring Loss Functions

1.1 Voxel BCE (5 pt)

Implement: voxel_loss in losses.py. Demo: run python fit_data.py --type 'vox'.

GT voxel grid (placeholder)
Ground-truth voxel grid.
Fitted voxel grid (placeholder)
Optimized voxel grid over iterations.

1.2 Point Cloud Chamfer (5 pt)

Implement: chamfer_loss in losses.py (no high-level PyTorch3D loss; knn_points/knn_gather allowed). Demo: python fit_data.py --type 'point'.

GT point cloud (placeholder)
Ground-truth point cloud.
Fitted point cloud (placeholder)
Optimized point cloud over iterations.

1.3 Mesh Smoothness (5 pt)

Implement: smoothness_loss in losses.py. Demo: python fit_data.py --type 'mesh'.

GT mesh (placeholder)
Ground-truth mesh.
Fitted mesh (placeholder)
Optimized mesh over iterations.

2. Reconstructing 3D from Single View

Train decoders for voxels, points, and meshes via train_model.py. Provide qualitative examples and quantitative evaluations per instructions.

2.1 Image β†’ Voxel Grid (20 pt)

Implement: voxel decoder in model.py. Train: python train_model.py --type 'vox'. Eval: python eval_model.py --type 'vox' --load_checkpoint.

Input RGB (placeholder)
Input RGB.
Pred voxel render (placeholder)
Predicted voxel render.
GT mesh render (placeholder)
Ground-truth voxel render.
Input RGB (placeholder)
Input RGB.
Pred voxel render (placeholder)
Predicted voxel render.
GT mesh render (placeholder)
Ground-truth voxel render.
Input RGB (placeholder)
Input RGB.
Pred voxel render (placeholder)
Predicted voxel render.
GT mesh render (placeholder)
Ground-truth voxel render.

2.2 Image β†’ Point Cloud (20 pt)

Implement: point-cloud decoder in model.py. Train: python train_model.py --type 'point'. Eval: python eval_model.py --type 'point' --load_checkpoint.

Input RGB (placeholder)
Input RGB.
Pred point cloud (placeholder)
Predicted point cloud.
GT pointcloud render (placeholder)
Ground-truth mesh render.
Input RGB (placeholder)
Input RGB.
Pred point cloud (placeholder)
Predicted point cloud.
GT pointcloud render (placeholder)
Ground-truth mesh render.
Input RGB (placeholder)
Input RGB.
Pred point cloud (placeholder)
Predicted point cloud.
GT pointcloud render (placeholder)
Ground-truth mesh render.

2.3 Image β†’ Mesh (20 pt)

Implement: mesh decoder in model.py (try different initializations beyond ico_sphere). Train: python train_model.py --type 'mesh'. Eval: python eval_model.py --type 'mesh' --load_checkpoint.

Input RGB (placeholder)
Input RGB.
Pred mesh render (placeholder)
Predicted mesh render.
GT mesh render (placeholder)
Ground-truth mesh render.
Input RGB (placeholder)
Input RGB.
Pred mesh render (placeholder)
Predicted mesh render.
GT mesh render (placeholder)
Ground-truth mesh render.
Input RGB (placeholder)
Input RGB.
Pred mesh render (placeholder)
Predicted mesh render.
GT mesh render (placeholder)
Ground-truth mesh render.

2.4 Quantitative Comparisons (10 pt)

Plot F1-score vs threshold for voxel, point, and mesh networks (eval_{type}.png). Include short discussion.

F1 curve voxel
F1 vs threshold β€” voxel.
F1 curve point
F1 vs threshold β€” point cloud.
F1 curve mesh
F1 vs threshold β€” mesh.

All three models show a monotonic increase in F1-score as the distance threshold increases, which is expected since looser thresholds count more near-matches as correct. The voxel-based model performs reasonably but is limited by its coarse spatial resolution. The point-cloud network achieves the highest overall accuracy, likely due to its direct supervision on surface points. The mesh network follows a similar trend but slightly lags behind, reflecting the added difficulty of maintaining geometric regularity while fitting full surface connectivity.

2.5 Hyperparameter Study (10 pt)

Vary one hyperparameter (e.g., n_points, vox_size, w_chamfer, mesh initialization) and analyze its effect. Add plots/tables and a brief conclusion.

F1 curve point
F1 vs threshold β€” point cloud: n_point = 1000.
F1 curve point
F1 vs threshold β€” point cloud: n_point = 2000

The plots compare model performance when varying the number of predicted points (n_point = 1000 vs. n_point = 2000). Increasing the number of predicted points from 1000 to 2000 leads to a modest but consistent improvement in F1 across all thresholds. This suggests that denser point sampling helps the network capture more surface detail and reduces sparsity in reconstruction, particularly for fine geometric parts (e.g., chair legs or backrests).

2.6 Model Interpretation (15 pt)

Create visualizations that explain what the model learns (e.g., saliency on image features, attention maps, error heatmaps on 3D). Add short insights.

F1 curve voxel
Error heatmap - pointcloud
F1 curve point
Error heatmap - pointcloud
F1 curve mesh
Error heatmap - pointcloud

To better understand what each model learned, I visualized per-vertex reconstruction error (color-coded from πŸ”΅ = low error β†’ πŸ”΄ = high error) for point-cloud. Model’s predicted surface or point set was compared against the ground-truth mesh using nearest-neighbor distances. The resulting error distributions reveal characteristic differences between representations. Pointcloud decoder reconstructs global geometry reasonably well, but red regions appear on the backrest and seat undersideβ€”areas with heavy occlusion in the input view. This suggests the network relies strongly on visible surface cues and under-represents occluded structures.

3. Exploring Other Architectures / Datasets

Choose at least one subsection. Completing multiple earns extra credit.

3.1 Implicit Network (10 pt)

Implement an implicit decoder that maps (image features, 3D location) β†’ occupancy. Predict over a 32Γ—32Γ—32 grid in (βˆ’1,1)3 and visualize the resulting mesh via marching cubes.

Input RGB (placeholder)
Input RGB.
Pred mesh render (placeholder)
Predicted implicit render.
GT mesh render (placeholder)
Ground-truth implicit render.
Input RGB (placeholder)
Input RGB.
Pred mesh render (placeholder)
Predicted implicit render.
GT mesh render (placeholder)
Ground-truth implicit render.
Input RGB (placeholder)
Input RGB.
Pred mesh render (placeholder)
Predicted implicit render.
GT mesh render (placeholder)
Ground-truth implicit render.