Course/Project: 16-825 Assignment 2: Single View to 3D
Assignment No.: Assignment 2
Author: Minghao Xu
Student ID: mxu3
Date: 09/26/2025

Assignment 2 Report

1.1 Voxel Representation

Definition:
The object is represented as a voxel grid inside a fixed-size cube. Each voxel encodes occupancy (0/1 or probability).

Loss Function:
We use Binary Cross Entropy (BCE) to measure prediction vs. ground truth voxels.

Voxel 3D
Figure: 3D voxel comparison (Pred vs GT)

1.2 Point Cloud Representation

Definition:
The object is represented as a set of 3D points capturing its geometry.

Loss Function:
We use Chamfer Distance, computed as the average nearest-neighbor distance between predicted and ground truth points.

Evaluation:
Visualization compares predicted point cloud and ground truth.

Figure: Predicted vs. Ground Truth Point Clouds

1.3 Mesh Representation

Definition:
The object is represented by vertices and faces, forming a mesh surface.

Loss Function:
We use Laplacian Smoothness Loss to penalize large differences between neighboring vertices, encouraging smooth surfaces.

Evaluation:
We render predicted mesh and ground truth mesh side by side.

Mesh
Figure: Predicted vs. Ground Truth Mesh

2.1 Voxel Representation

Visualization Example

Input RGB → Predicted voxel isosurface → Ground-truth voxel.

Voxel eval example 1 Voxel eval example 2 Voxel eval example 3

2.2 Point Cloud Representation

Visualization Example

Input RGB → Predicted point cloud → Ground-truth point cloud.

Point eval example 1 Point eval example 2 Point eval example 3

2.3 Mesh Representation

Visualization Example

Input RGB → Predicted mesh → Ground-truth mesh.

Mesh eval example 1 Mesh eval example 2 Mesh eval example 3

2.4 Quantitative Comparisons

Quantitatively compare the F1 score of 3D reconstruction for meshes vs pointcloud vs voxelgrids.

Voxelgrid
Pointcloud
Mesh

Intuitive Explanation

From the F1–threshold curves, we see that all three methods improve as the threshold increases. However:

Mesh generally achieves the highest F1 at stricter thresholds, because explicit surfaces and connectivity capture sharper geometry and reduce false positives.
Pointcloud performs reasonably well, especially at moderate thresholds, since dense sampling provides good coverage, but the lack of surface connectivity can lead to gaps and noise, lowering precision at higher thresholds.
Voxelgrid is most affected by discretization; voxelization blurs fine boundaries and thin structures, so its F1 lags behind the other methods when stricter alignment is required.

Conclusion: Mesh-based reconstruction yields the most accurate geometry, pointclouds offer a good trade-off between coverage and detail, and voxelgrids are limited by resolution quantization.

2.5 Analyse effects of hyperparams variations

vox_size 32 → 48

voxel resolution (vox_size) on 3D reconstruction quality by increasing it from 32 → 48*.

Voxel_48 eval example 1 Voxel_48 eval example 2 Voxel_48 eval example 3

eval voxel

F1-scores do not increase significantly when moving from 32 → 48.

However, higher vox_size improves visual detail capture: thin structures (e.g., chair legs, armrests) and sharper edges are reconstructed more faithfully.

The improvement is mostly qualitative: objects look closer to the ground truth, though the overall overlap metric (F1) does not reflect large changes.

Increasing vox_size comes with higher memory and compute costs.

n_points 1000 → 2000

1000-point sampling (np=1000) with an increased resolution of 2000 points (np=2000).

np_2000 eval example 1 np_2000 eval example 2 np_2000 eval example 3

eval voxel

np=2000
- Doubling the number of points results in denser and more uniformly distributed point clouds.
- Fine details such as armrests and backrests are captured more accurately.
- F1-score does not significantly increase compared to np=1000, but the visual quality and detail representation are clearly better.

While np=2000 provides more realistic and detailed reconstructions, the quantitative metric (F1-score) does not show a large improvement. This suggests that the metric may not fully reflect perceptual quality in point cloud prediction.

w_chamfer 0.01 → 0.05

w_chamfer=0.01 with a relaxed evaluation at w_chamfer=0.05.

wc_005 eval example 1
wc_005 eval example 2
wc_005 eval example 3

eval mesh wc005

w_chamfer=0.05
- Relaxed threshold gives higher tolerance in point-to-surface matching.
- F1-score shows some increase (from ~5% to ~73%), but this comes mainly from the looser evaluation, not from actual reconstruction improvements.
- Qualitative results show that fine structures (legs, thin frames) are still missing, with meshes remaining overly smoothed.

2.6. Interpret your model

Voxels: 32 vs 48 — Error heatmap GIFs

Below we compare vox32 (left) vs vox48 (right) using rotating error visualizations.
Top block colors the GT surface by distance to the prediction (error_heat_*.gif).
Bottom block colors the Predicted points by distance to the GT surface (error_heat_pred_*.gif).

Predicted points colored by error

Example	vox32	vox48
00
01
02
03
04
05

Takeaways (qualitative):

vox48 generally shows smaller high-error regions (fewer red patches) and smoother surfaces, especially around thin legs/edges.
Despite the F1 not increasing much, higher voxel resolution tends to capture finer geometry and produces more stable reconstructions across viewpoints.

3.2 Parametric Network

Model Overview

Implement a parametric decoder (Parametric2Dto3D) that maps sampled 2D points (UV, in [-1,1]^2) and a global image feature vector to 3D coordinates.
At test time we sample N UV points on a canonical 2D domain (e.g., stratified or random) and predict their 3D positions to form a point cloud.

Encoder: ResNet18 global feature (512-d) or precomputed features.
Decoder: MLP taking [UV, image feature] → XYZ.
Supervision: Chamfer distance between predicted points and points sampled from the GT mesh.
Output: Point cloud (B, N, 3). (Meshes can be reconstructed downstream via surface fitting if desired.)

Evaluation Examples

Triptychs (RGB · Pred · GT)

Training Curve

Average F1 across thresholds

Per-point Error Visualization (Pred → GT)

Example	PNG	GIF
0
1
2
3
4
5

Result Analysis

The parametric model recovers the global chair geometry and major surfaces, with concentrated low errors over large regions. Higher errors cluster around thin parts, sharp edges, and concavities, reflecting the difficulty of capturing high-frequency details from global features and a pointwise MLP. Increasing point count, adding positional encodings on UV, or incorporating local image features can reduce these localized errors and improve fine structures.