Single-View to 3D Reconstruction

CMU 16-825 · Vaishnavi Khindkar (vkhindka@andrew.cmu.edu) · Assignment 2 · AWS g4dn.xlarge (T4, 16GB)
Dataset ShapeNet R2N2
Representations Voxels · Point Cloud · Mesh

1. Fitting a Single Shape (Warm‑up)

We overfit a single object for each representation to verify losses and rendering. Each row shows before training, after training, and target.

1.1 Voxels

voxel before
Before training
voxel after
After training
voxel target
Target
Voxel fit loss curve
Loss curve for fit_data.py (voxel).

1.2 Point Cloud

pc before
Before training
pc after
After training
pc target
Target
Point cloud fit loss curve
Loss curve for fit_data.py (point cloud).

1.3 Mesh

mesh before
Before training
mesh after
After training
mesh target
Target
Mesh fit loss curve
Loss curve for fit_data.py (mesh).

Abstract

We implement single-view 3D reconstruction following the course starter code, train voxel / point / mesh decoders on ShapeNet R2N2, and evaluate with F1 on point samples. We add quality-of-life fixes (robust rendering + GIFs on GPU), ablations (encoder depth, learning rate, sampling), model interpretation (loss dynamics and normals-colored surfaces), and extended-dataset training (chair vs chair+car+plane).

Environment

python train_model.py --type 'point' --device cuda --arch resnet18 \
  --batch_size 16 --max_iter 1000
python eval_model.py --type 'point' --device cuda --load_checkpoint

Rendering Fix

To save GIFs reliably, we cast renderer outputs to uint8:

rend = renderer(mesh, cameras=cams, lights=lights).cpu().numpy()[0, ..., :3]
rend = (np.clip(rend, 0, 1) * 255).astype(np.uint8)
imageio.mimsave(output_path, imgs, fps=18)

2. Reconstructing 3D from single view

2.1 Image → Voxel Grid

We implement voxel prediction and marching-cubes visualization. Rendering uses PyTorch3D and creates rotating GIFs.

Voxel prediction
voxel pred voxel pred voxel pred
Voxel ground truth
voxel gt voxel gt voxel gt

2.2 Image → Point cloud

We implement chamfer loss to fit the point cloud.

Point cloud prediction
point cloud pred point cloud pred point cloud pred
Point cloud ground truth
point cloud gt point cloud gt point cloud gt

2.3 Image → Mesh

Mesh prediction
mesh pred mesh pred mesh pred
Mesh ground truth
mesh gt mesh gt mesh gt

2.4 Evaluation

We compute Precision/Recall/F1 vs radius thresholds using k-NN on sampled points. The script saves a plot per representation.

F1 vox
F1 vs threshold (vox)
F1 point
F1 vs threshold (point)
F1 point
F1 vs threshold (mesh)

2.5 Ablations

Backbone depth (ResNet-18 vs ResNet-34)

Short runs (~1k iters) on point representation.

BackboneF1@0.05Time
ResNet-180.551:10
ResNet-340.581:52
bk
backbone comparison

We investigated the effect of encoder depth on single-view 3D reconstruction using the point-cloud representation. Two variants of the SingleViewto3D network were trained for 1000 iterations on the ShapeNet-R2N2 dataset: Observation: The deeper ResNet-34 backbone yields a +3 point improvement in F1@0.05 over ResNet-18, indicating slightly better geometric fidelity. Runtime increased by ~1.6×, suggesting diminishing returns for substantially higher compute cost. Qualitative reconstructions (GIFs) show denser and smoother chair structures for ResNet-34, whereas ResNet-18 outputs remain somewhat sparse. Conclusion: A modest performance gain is achievable with a deeper encoder, but for fast experimentation or resource-constrained settings, ResNet-18 offers a better speed-accuracy trade-off.

Learning rate sweep

LRF1@0.05Time
2e-463.591:52
4e-454.301:51
8e-414.121:51
lr_sensitivity
LR sensitivity

We investigated how the learning rate affects training stability and reconstruction quality for the point cloud representation using ResNet-18 backbone over 1000 iterations. Observations: The best performance is achieved at LR = 2e-4, suggesting that a slightly smaller LR helps smoother convergence in the early phase. LR = 4e-4 (baseline) gives stable but slightly worse F1. LR = 8e-4 leads to unstable training and significantly degraded performance (possible divergence). Training time is roughly constant across LRs (~1:50), so F1 differences reflect optimization quality, not compute time. Conclusion: The model shows moderate sensitivity to learning rate. A smaller LR (2e-4) yields the best early-stage reconstruction quality, while overly aggressive updates (8e-4) harm performance. For stable training on point clouds, LR in the range [2e-4, 4e-4] is recommended.

2.6 Model Interpretation

(a) Mesh training dynamics

loss curves
Chamfer drops quickly then plateaus; smoothness stabilizes surfaces.

Interpretation: Chamfer loss drops rapidly in the first few hundred iterations as the model learns coarse shape alignment, then plateaus as geometry converges. Smoothness loss oscillates more due to local curvature updates but stabilizes gradually, reflecting smoother surface formation. The joint trend shows the network balancing accuracy (Chamfer) and regularity (Smoothness). Takeaway: Early optimization is dominated by Chamfer alignment. Smoothness term prevents overfitting noisy vertices, improving surface consistency.

(b) Normals-colored surfaces

normals grid gif
Normals across steps

To inspect surface quality, we colorized vertices by surface normals (RGB ∈ [0,1]). Smooth color gradients = coherent normals, noisy patches = unstable geometry. At early steps (e.g., 250), fragmented colors indicate noisy, unstable normals. By step 500, larger coherent color bands emerge across seat/back surfaces — evidence of smoother local geometry. Ground truth meshes show clean continuous gradients, representing ideal surface normals. Conclusion: Normals visualizations reveal local geometric consistency that loss curves can’t show alone: Smoothness regularizer improves normal coherence. Normals-colored renders serve as qualitative proof of surface refinement during training

3.3 Extended Dataset: Chair vs Chair+Car+Plane

We switch the split file to split_3c.json and retrain. Below are the plots on the chair test set.

single class plot
Single-class (chair) vs. Three classes (F1 vs. Threshold)

Multi-class training improves generalization on chairs (stronger priors; more diverse geometry). Setup: We trained the voxel reconstruction model twice: Single-class training: only chairs (6,780 samples) Three-class training: chair (6,780), car (3,680), airplane (4,050) totaling 14,510 samples. All runs used the same network architecture, optimizer, and training schedule for fair comparison. Quantitative Results:

single class plot
Single-class (chair) vs. Three classes (F1 vs. Threshold)

Threshold Sweep: Below are F1-scores across different evaluation thresholds: thresholds = [0.01, 0.02, 0.03, 0.04, 0.05] f1_single = [ 7.9, 25.9, 43.2, 57.6, 61.0 ] f1_multi = [ 8.0, 26.4, 44.0, 57.7, 68.0 ] Qualitative Observations: 3-class training led to better shape consistency and clearer geometry across unseen chair examples. The network likely benefited from shared priors across classes (e.g., symmetry, planar parts). Single-class model showed occasional artifacts or missing volumes due to overfitting on one category. Analysis: Generalization: Multi-class model learns richer representations; exposure to diverse shapes improves latent features. Category Confusion: Slight risk if classes overlap (e.g., chairs vs. cars), but voxel task handled separation well. F1 Improvement: +27 points suggests positive transfer from other classes. Conclusion: Training on multiple classes yields more robust and generalizable 3D reconstructions, with higher F1 scores and visually consistent outputs. The shared structural features across categories improve voxel prediction quality without degrading per-class performance.

Full dataset (3 classes pred and GT)

F1 vox
Full dataset sample GT
F1 point
Full dataset sample pred

Chairs dataset (1 class pred and GT)

F1 vox
Chairs only dataset sample GT
F1 point
Chairs only dataset sample pred

Appendix