HW 2 – Single-view to 3D

1. Fitting a Single Shape (Warm‑up)

We overfit a single object for each representation to verify losses and rendering. Each row shows before training, after training, and target.

1.1 Voxels

Voxel fit loss curve — Loss curve for `fit_data.py` (voxel).

1.2 Point Cloud

Point cloud fit loss curve — Loss curve for `fit_data.py` (point cloud).

1.3 Mesh

Mesh fit loss curve — Loss curve for `fit_data.py` (mesh).

Abstract

We implement single-view 3D reconstruction following the course starter code, train voxel / point / mesh decoders on ShapeNet R2N2, and evaluate with F1 on point samples. We add quality-of-life fixes (robust rendering + GIFs on GPU), ablations (encoder depth, learning rate, sampling), model interpretation (loss dynamics and normals-colored surfaces), and extended-dataset training (chair vs chair+car+plane).

Environment

AWS EC2 g4dn.xlarge (T4 16GB, 4 vCPU, 16GB RAM)
PyTorch + PyTorch3D; mixed precision optional
Dataset on EBS (resized root volume to ~200GB)

python train_model.py --type 'point' --device cuda --arch resnet18 \
  --batch_size 16 --max_iter 1000
python eval_model.py --type 'point' --device cuda --load_checkpoint

Rendering Fix

To save GIFs reliably, we cast renderer outputs to uint8:

rend = renderer(mesh, cameras=cams, lights=lights).cpu().numpy()[0, ..., :3]
rend = (np.clip(rend, 0, 1) * 255).astype(np.uint8)
imageio.mimsave(output_path, imgs, fps=18)

2. Reconstructing 3D from single view

2.1 Image → Voxel Grid

We implement voxel prediction and marching-cubes visualization. Rendering uses PyTorch3D and creates rotating GIFs.

2.2 Image → Point cloud

We implement chamfer loss to fit the point cloud.

point cloud pred — Point cloud prediction

point cloud gt — Point cloud ground truth

2.3 Image → Mesh

2.4 Evaluation

We compute Precision/Recall/F1 vs radius thresholds using k-NN on sampled points. The script saves a plot per representation.

2.5 Ablations

Backbone depth (ResNet-18 vs ResNet-34)

Short runs (~1k iters) on point representation.

Backbone	F1@0.05	Time
ResNet-18	0.55	1:10
ResNet-34	0.58	1:52

We investigated the effect of encoder depth on single-view 3D reconstruction using the point-cloud representation. Two variants of the SingleViewto3D network were trained for 1000 iterations on the ShapeNet-R2N2 dataset: Observation: The deeper ResNet-34 backbone yields a +3 point improvement in F1@0.05 over ResNet-18, indicating slightly better geometric fidelity. Runtime increased by ~1.6×, suggesting diminishing returns for substantially higher compute cost. Qualitative reconstructions (GIFs) show denser and smoother chair structures for ResNet-34, whereas ResNet-18 outputs remain somewhat sparse. Conclusion: A modest performance gain is achievable with a deeper encoder, but for fast experimentation or resource-constrained settings, ResNet-18 offers a better speed-accuracy trade-off.

Learning rate sweep

LR	F1@0.05	Time
2e-4	63.59	1:52
4e-4	54.30	1:51
8e-4	14.12	1:51

We investigated how the learning rate affects training stability and reconstruction quality for the point cloud representation using ResNet-18 backbone over 1000 iterations. Observations: The best performance is achieved at LR = 2e-4, suggesting that a slightly smaller LR helps smoother convergence in the early phase. LR = 4e-4 (baseline) gives stable but slightly worse F1. LR = 8e-4 leads to unstable training and significantly degraded performance (possible divergence). Training time is roughly constant across LRs (~1:50), so F1 differences reflect optimization quality, not compute time. Conclusion: The model shows moderate sensitivity to learning rate. A smaller LR (2e-4) yields the best early-stage reconstruction quality, while overly aggressive updates (8e-4) harm performance. For stable training on point clouds, LR in the range [2e-4, 4e-4] is recommended.

2.6 Model Interpretation

(a) Mesh training dynamics

loss curves — Chamfer drops quickly then plateaus; smoothness stabilizes surfaces.

Interpretation: Chamfer loss drops rapidly in the first few hundred iterations as the model learns coarse shape alignment, then plateaus as geometry converges. Smoothness loss oscillates more due to local curvature updates but stabilizes gradually, reflecting smoother surface formation. The joint trend shows the network balancing accuracy (Chamfer) and regularity (Smoothness). Takeaway: Early optimization is dominated by Chamfer alignment. Smoothness term prevents overfitting noisy vertices, improving surface consistency.

(b) Normals-colored surfaces

To inspect surface quality, we colorized vertices by surface normals (RGB ∈ [0,1]). Smooth color gradients = coherent normals, noisy patches = unstable geometry. At early steps (e.g., 250), fragmented colors indicate noisy, unstable normals. By step 500, larger coherent color bands emerge across seat/back surfaces — evidence of smoother local geometry. Ground truth meshes show clean continuous gradients, representing ideal surface normals. Conclusion: Normals visualizations reveal local geometric consistency that loss curves can’t show alone: Smoothness regularizer improves normal coherence. Normals-colored renders serve as qualitative proof of surface refinement during training

3.3 Extended Dataset: Chair vs Chair+Car+Plane

We switch the split file to split_3c.json and retrain. Below are the plots on the chair test set.

single class plot — Single-class (chair) vs. Three classes (F1 vs. Threshold)

Multi-class training improves generalization on chairs (stronger priors; more diverse geometry). Setup: We trained the voxel reconstruction model twice: Single-class training: only chairs (6,780 samples) Three-class training: chair (6,780), car (3,680), airplane (4,050) totaling 14,510 samples. All runs used the same network architecture, optimizer, and training schedule for fair comparison. Quantitative Results:

Threshold Sweep: Below are F1-scores across different evaluation thresholds: thresholds = [0.01, 0.02, 0.03, 0.04, 0.05] f1_single = [ 7.9, 25.9, 43.2, 57.6, 61.0 ] f1_multi = [ 8.0, 26.4, 44.0, 57.7, 68.0 ] Qualitative Observations: 3-class training led to better shape consistency and clearer geometry across unseen chair examples. The network likely benefited from shared priors across classes (e.g., symmetry, planar parts). Single-class model showed occasional artifacts or missing volumes due to overfitting on one category. Analysis: Generalization: Multi-class model learns richer representations; exposure to diverse shapes improves latent features. Category Confusion: Slight risk if classes overlap (e.g., chairs vs. cars), but voxel task handled separation well. F1 Improvement: +27 points suggests positive transfer from other classes. Conclusion: Training on multiple classes yields more robust and generalizable 3D reconstructions, with higher F1 scores and visually consistent outputs. The shared structural features across categories improve voxel prediction quality without degrading per-class performance.

Full dataset (3 classes pred and GT)

Chairs dataset (1 class pred and GT)

Appendix

Hardware: EC2 g4dn.xlarge (T4 16GB). Storage expanded via EBS resize + growpart/resize2fs.
Rendering: PyTorch3D; dtype cast to uint8 for GIF saving.
Repro tips: match --n_points between train/eval for point models; add --eval_n_points to vary sampling only.