Definition:
The object is represented as a voxel grid inside a fixed-size cube. Each voxel encodes occupancy (0/1 or probability).
Loss Function:
We use Binary Cross Entropy (BCE) to measure prediction vs. ground truth voxels.

Figure: 3D voxel comparison (Pred vs GT)
Definition:
The object is represented as a set of 3D points capturing its geometry.
Loss Function:
We use Chamfer Distance, computed as the average nearest-neighbor distance between predicted and ground truth points.
Evaluation:
Visualization compares predicted point cloud and ground truth.

Figure: Predicted vs. Ground Truth Point Clouds
Definition:
The object is represented by vertices and faces, forming a mesh surface.
Loss Function:
We use Laplacian Smoothness Loss to penalize large differences between neighboring vertices, encouraging smooth surfaces.
Evaluation:
We render predicted mesh and ground truth mesh side by side.

Figure: Predicted vs. Ground Truth Mesh
Input RGB → Predicted voxel isosurface → Ground-truth voxel.


Input RGB → Predicted point cloud → Ground-truth point cloud.


Input RGB → Predicted mesh → Ground-truth mesh.


Quantitatively compare the F1 score of 3D reconstruction for meshes vs pointcloud vs voxelgrids.
Voxelgrid

Pointcloud

Mesh

From the F1–threshold curves, we see that all three methods improve as the threshold increases. However:
Conclusion: Mesh-based reconstruction yields the most accurate geometry, pointclouds offer a good trade-off between coverage and detail, and voxelgrids are limited by resolution quantization.
voxel resolution (vox_size) on 3D reconstruction quality by increasing it from 32 → 48*.


F1-scores do not increase significantly when moving from 32 → 48.
However, higher vox_size improves visual detail capture: thin structures (e.g., chair legs, armrests) and sharper edges are reconstructed more faithfully.
The improvement is mostly qualitative: objects look closer to the ground truth, though the overall overlap metric (F1) does not reflect large changes.
Increasing vox_size comes with higher memory and compute costs.
1000-point sampling (np=1000) with an increased resolution of 2000 points (np=2000).


While np=2000 provides more realistic and detailed reconstructions, the quantitative metric (F1-score) does not show a large improvement. This suggests that the metric may not fully reflect perceptual quality in point cloud prediction.
w_chamfer=0.01 with a relaxed evaluation at w_chamfer=0.05.


Below we compare vox32 (left) vs vox48 (right) using rotating error visualizations.
Top block colors the GT surface by distance to the prediction (error_heat_*.gif).
Bottom block colors the Predicted points by distance to the GT surface (error_heat_pred_*.gif).
| Example | vox32 | vox48 |
|---|---|---|
| 00 | ![]() |
![]() |
| 01 | ![]() |
![]() |
| 02 | ![]() |
![]() |
| 03 | ![]() |
![]() |
| 04 | ![]() |
![]() |
| 05 | ![]() |
![]() |
Takeaways (qualitative):
Implement a parametric decoder (Parametric2Dto3D) that maps sampled 2D points (UV, in [-1,1]^2) and a global image feature vector to 3D coordinates.
At test time we sample N UV points on a canonical 2D domain (e.g., stratified or random) and predict their 3D positions to form a point cloud.
| Triptychs (RGB · Pred · GT) |
|---|
![]() |
![]() |
![]() |
| Average F1 across thresholds |
|---|
![]() |
| Example | PNG | GIF |
|---|---|---|
| 0 | ![]() |
![]() |
| 1 | ![]() |
![]() |
| 2 | ![]() |
![]() |
| 3 | ![]() |
![]() |
| 4 | ![]() |
![]() |
| 5 | ![]() |
![]() |
The parametric model recovers the global chair geometry and major surfaces, with concentrated low errors over large regions. Higher errors cluster around thin parts, sharp edges, and concavities, reflecting the difficulty of capturing high-frequency details from global features and a pointwise MLP. Increasing point count, adding positional encodings on UV, or incorporating local image features can reduce these localized errors and improve fine structures.