16-825 Assignment 2: Single View to 3D

Goals: In this assignment, you will explore the types of loss and decoder functions for regressing to voxels, point clouds, and mesh representation from single view RGB input.

Setup
Exploring Loss Functions
Reconstructing 3D from single view
Exploring other architectures / datasets

1. Exploring loss functions

1.1. Fitting a voxel grid (5 points)

Ground truth voxel grid:
Learned voxel grid:

1.2. Fitting a point cloud (5 points)

Ground truth pointcloud:
Learned pointcloud:

1.3. Fitting a mesh (5 points)

Ground truth mesh:
Learned mesh:

2. Reconstructing 3D from single view

2.1. Image to voxel grid (20 points)

Final F1@0.05 score: 73.88

Input RGB:
Ground truth:
Prediction:
Input RGB:
Ground truth:
Prediction:
Input RGB:
Ground truth:
Prediction:

2.2. Image to point cloud (20 points)

For better coverage, I chose n_points = 2048. Final F1@0.05 score: 76.81

Input RGB:
Ground truth:
Prediction:
Input RGB:
Ground truth:
Prediction:
Input RGB:
Ground truth:
Prediction:

2.3. Image to mesh (20 points)

Final F1@0.05 score: 72.34

Input RGB:
Ground truth:
Prediction:
Input RGB:
Ground truth:
Prediction:
Input RGB:
Ground truth:
Prediction:

2.4. Quantitative comparisions(10 points)

At low thresholds, even small deviations are penalized harshly, lowering both recall and precision. Given similar result qualities, pointcloud should have slightly higher F1 scores, since pointclouds are trained to directly fit the ground truth. Voxels and meshes require an extra sample_points_from_meshes during F1 score computation, and they are subject to a random sampling process, which may slightly increase the F1 score.

Voxel:
Pointcloud:
Mesh:

2.5. Analyse effects of hyperparams variations (10 points)

Analyse the results, by varying a hyperparameter of your choice. For example n_points or vox_size or w_chamfer or initial mesh (ico_sphere) etc. Try to be unique and conclusive in your analysis.

During point cloud training, I explored the effect of varying the relative weights between the two terms in the Chamfer loss, which correspond to precision and recall. The modified chamfer loss is defined as:

\[ d_{CD}(S_{\mathrm{pred}}, S_{\mathrm{gt}})= \alpha \sum_{x \in S_{\mathrm{pred}}} \min_{y \in S_{\mathrm{gt}}} \|x-y\|_2^2 +(2-\alpha)\sum_{y \in S_{\mathrm{gt}}} \min_{x \in S_{\mathrm{pred}}} \|x-y\|_2^2 \]

When $\alpha < 1$, recall is prioritized, which ensures all ground truth points have nearby predictions. The resulting point clouds tend to loosely cover the entire shape but often lack fine structural details.

When $\alpha > 1$, precision is prioritized, which encourages predicted points to be closer to actual surface regions. However, as $\alpha$ increases, points begin to cluster densely around high-confidence areas, causing sparse coverage elsewhere and potentially missing thin or distant structures (e.g., chair legs).

To mitigate over-clustering, I added an additional repulsion loss defined as

$$ L_{\mathrm{rep}}= \sum_{i=0}^{\hat N} \sum_{i'\in K(i)} -\lVert x_{i'}-x_i\rVert\, w\!\big(\lVert x_{i'}-x_i\rVert\big) $$ $$ w(r)=\exp\!\left(-\frac{r^{2}}{h^{2}}\right) $$

In practice, I find that repulsion loss + chamfer loss with $\alpha = 1.0$ produces the best qualitative results.

Trained for 150k iterations with $\alpha = 1.0$, followed by 10k iteration of $\alpha = 0.5$ (all with repulsion loss)
Results in a prediction every structure details are evenly covered, but the shape is less defined.
Trained for 150k iterations with $\alpha = 1.0$, followed by 10k iteration of $\alpha = 1.5$ (all with repulsion loss)
Results in a tighter prediction, but fewer points are allocated to the chair legs where confidence is lower.

2.6. Interpret your model (15 points)

We can gain insight into the model’s behavior by visualizing the per-point contribution to the Chamfer distance. Each ground truth point is colored according to its nearest distance to any predicted point. Green indicates accurate reconstruction, while red represents larger deviations.

This error heatmap reveals the model’s typical failure modes. Most of the sitting surfaces of chairs are reconstructed precisely, showing dense green regions. In contrast, backrests, thin legs, edges, and thin structures often appear red, suggesting higher geometric uncertainty or incomplete recovery of fine structures.

Here are some representative examples.

3. Exploring other architectures / datasets. (Choose at least one! More than one is extra credit)

3.1 Implicit network (10 points)

Input: a single RGB image and 2048 randomly sampled 3D query points in normalized space $(-1,\,1)^3$.
The image is encoded with ResNet-18 into a global feature vector. Each query point is concatenated with this feature and fed into an MLP that outputs occupancy probabilities.
Loss: Binary cross-entropy between predicted and ground-truth occupancies extracted from the ground-truth voxels.
During inference, a 32×32×32 meshgrid of points is passed through the network to generate a dense occupancy field, which is reshaped into voxels and converted to a mesh via marching cubes.

Final F1@0.05 score: 41.32

Input RGB:
Ground truth:
Prediction:
Input RGB:
Ground truth:
Prediction:
Input RGB:
Ground truth:
Prediction: