In this section, we implemented and tested three different loss functions for fitting 3D representations: voxel grids, point clouds, and meshes.
Implemented binary cross-entropy loss to fit a 3D binary voxel grid. This loss function is ideal for voxel occupancy prediction as it treats each voxel as an independent binary classification problem.
Voxel grid optimization: Source (red) converging to target (blue)
Implemented Chamfer loss from scratch to fit 3D point clouds. The Chamfer distance measures the average nearest-neighbor distance between two point sets, providing bidirectional correspondence.
Point cloud optimization: Source (red) converging to target (blue)
Implemented smoothness loss to regularize mesh fitting. This loss penalizes differences in face normals between adjacent faces, encouraging smooth surfaces.
Mesh optimization: Source (red) converging to target (blue) with smoothness regularization
Trained a neural network decoder to predict 32×32×32 binary voxel grids from single RGB images. The decoder uses transposed convolutions to upsample from image features to 3D occupancy predictions.
Input RGB Image
Prediction (red) vs Ground Truth (blue)
Input RGB Image
Prediction (red) vs Ground Truth (blue)
Input RGB Image
Prediction (red) vs Ground Truth (blue)
Trained a decoder to directly predict 3D point coordinates (N×3) from image features. This representation provides higher resolution than voxels without the memory overhead.
Ground Truth
Prediction (red) vs GT (blue)
Ground Truth
Prediction (red) vs GT (blue)
Ground Truth
Prediction (red) vs GT (blue)
Trained a mesh deformation network that starts from an icosphere and learns to deform it into the target shape. This approach leverages mesh topology for smooth, continuous surfaces.
Ground Truth
Prediction vs GT
Ground Truth
Prediction vs GT
Ground Truth
Prediction vs GT
We evaluated all three representations using F1 score at varying distance thresholds. F1 score measures the harmonic mean of precision and recall in terms of nearest-neighbor distances between predicted and ground truth point clouds.
Voxel Grid F1 Scores
Point Cloud F1 Scores
Mesh F1 Scores
| Model Type | F1@0.05 | Strengths | Weaknesses |
|---|---|---|---|
| Point Cloud | ~70-80% | Highest F1 score, flexible representation, good detail capture | No surface information, discrete points |
| Mesh | ~40-50% | Smooth surfaces, continuous representation, rendering quality | Topology constraints, struggles with thin structures, spiky surfaces |
| Voxel Grid | ~70% | Explicit occupancy, easy to render, GPU-friendly | Limited resolution (32³), memory intensive for higher res |
Intuitive Explanation:
We analyzed the effect of varying the smoothness weight (w_smooth) in mesh reconstruction, which controls the trade-off between fitting accuracy and surface smoothness.
w_smooth = 20.0 (Moderate)
w_smooth = 200.0 (High)
Alternative experiment: Voxel evaluation with N=3000 sample points
Smoothness Weight (w_smooth) in Mesh Reconstruction:
Key Insight: There's a fundamental trade-off between reconstruction accuracy (low w_smooth) and visual quality (high w_smooth). The optimal value depends on the application: use lower values for accurate geometry measurement, higher values for visual rendering.
Additional Finding: Increasing the number of points (n_points) for voxel sampling from 1000 to 3000 improves F1 scores by providing denser surface coverage, but has diminishing returns beyond a certain threshold due to GPU memory constraints.
To better understand what the mesh model learns during training, we implemented loss component visualization that tracks individual loss terms over time.
Loss Components: w_smooth=2.0
Loss Components: w_smooth=20.0
Loss Components: w_smooth=200.0
Loss Component Visualization: By plotting weighted Chamfer loss and weighted smoothness loss separately, we can observe:
Why This Matters: This visualization reveals that the "spiky mesh" problem isn't a bug—it's the model correctly minimizing Chamfer loss without sufficient smoothness regularization. Understanding this trade-off helps us tune hyperparameters more intelligently.
Implemented an implicit occupancy network that takes 3D coordinates and image features as input and predicts occupancy values. This continuous representation can be queried at any resolution.
Input RGB Image
Implicit Network Prediction vs GT
Performance: Unfortunately, the implicit network performs poorly, producing mostly filled, slanted reconstructions that fail to capture the overall chair shape. The predictions appear as dense, blob-like structures rather than recognizable furniture.
Why This Happens:
Conclusion: While the implicit representation offers theoretical advantages (continuous querying, resolution-independence), achieving good reconstruction quality requires more sophisticated network architectures and training strategies than implemented here. This demonstrates that architectural choices significantly impact 3D reconstruction performance.
Trained point cloud models on both single-class (chair only) and multi-class (chair, car, plane) datasets to analyze the impact of dataset diversity on reconstruction quality.
Input Image 1
Input Image 2
Input Image 3
Chair-Only Model vs GT
Multi-Class Model vs GT
Chair-Only vs Multi-Class
Chair-Only Model vs GT
Multi-Class Model vs GT
Chair-Only vs Multi-Class
Chair-Only Model vs GT
Multi-Class Model vs GT
Chair-Only vs Multi-Class
F1 Score comparison: Chair-only (blue) vs Multi-class (red) training
Quantitative Results:
| Threshold | Chair-Only F1 | Multi-Class F1 | Difference |
|---|---|---|---|
| 0.01 | 6.0% | 6.3% | +0.3% |
| 0.02 | 27.9% | 28.7% | +0.8% |
| 0.03 | 58.9% | 52.0% | -6.9% |
| 0.04 | 80.9% | 66.7% | -14.2% |
| 0.05 | 90.3% | 76.9% | -13.4% |
Key Observations:
Conclusion:
Training on a single class produces specialized, higher-quality reconstructions for that class at the cost of generalization. Training on multiple classes produces a more general model that performs moderately well across categories but sacrifices some accuracy on each individual class.
Model Capacity Hypothesis: The performance gap may also reflect insufficient model capacity. Our network architecture (ResNet18 encoder + simple FC decoder) has limited representational capacity. When trained on a single class, all capacity focuses on learning chair-specific features. When trained on three diverse classes (chairs, cars, planes), the same fixed capacity must be divided among learning features for all categories, leading to a capacity bottleneck where no single class is learned as well. A larger model with more parameters might close this performance gap while maintaining multi-class versatility.
The choice depends on the application: use specialized models for best single-category performance, or use multi-class models (ideally with larger capacity) for versatility across categories.