16-825 Assignment 2: Single View to 3D

Goals: In this assignment, you will explore the types of loss and decoder functions for regressing to voxels, point clouds, and mesh representation from single view RGB input.

Table of Contents

  1. Setup
  2. Exploring Loss Functions
  3. Reconstructing 3D from single view
  4. Exploring other architectures / datasets

1. Exploring loss functions

1.1. Fitting a voxel grid (5 points)

1.2. Fitting a point cloud (5 points)

1.3. Fitting a mesh (5 points)

2. Reconstructing 3D from single view

2.1. Image to voxel grid (20 points)

Final F1@0.05 score: 73.88

2.2. Image to point cloud (20 points)

For better coverage, I chose n_points = 2048. Final F1@0.05 score: 76.81

2.3. Image to mesh (20 points)

Final F1@0.05 score: 72.34

2.4. Quantitative comparisions(10 points)

At low thresholds, even small deviations are penalized harshly, lowering both recall and precision. Given similar result qualities, pointcloud should have slightly higher F1 scores, since pointclouds are trained to directly fit the ground truth. Voxels and meshes require an extra sample_points_from_meshes during F1 score computation, and they are subject to a random sampling process, which may slightly increase the F1 score.

2.5. Analyse effects of hyperparams variations (10 points)

Analyse the results, by varying a hyperparameter of your choice. For example n_points or vox_size or w_chamfer or initial mesh (ico_sphere) etc. Try to be unique and conclusive in your analysis.

During point cloud training, I explored the effect of varying the relative weights between the two terms in the Chamfer loss, which correspond to precision and recall. The modified chamfer loss is defined as:

\[ d_{CD}(S_{\mathrm{pred}}, S_{\mathrm{gt}})= \alpha \sum_{x \in S_{\mathrm{pred}}} \min_{y \in S_{\mathrm{gt}}} \|x-y\|_2^2 +(2-\alpha)\sum_{y \in S_{\mathrm{gt}}} \min_{x \in S_{\mathrm{pred}}} \|x-y\|_2^2 \]

When $\alpha < 1$, recall is prioritized, which ensures all ground truth points have nearby predictions. The resulting point clouds tend to loosely cover the entire shape but often lack fine structural details.

When $\alpha > 1$, precision is prioritized, which encourages predicted points to be closer to actual surface regions. However, as $\alpha$ increases, points begin to cluster densely around high-confidence areas, causing sparse coverage elsewhere and potentially missing thin or distant structures (e.g., chair legs).

To mitigate over-clustering, I added an additional repulsion loss defined as

$$ L_{\mathrm{rep}}= \sum_{i=0}^{\hat N} \sum_{i'\in K(i)} -\lVert x_{i'}-x_i\rVert\, w\!\big(\lVert x_{i'}-x_i\rVert\big) $$ $$ w(r)=\exp\!\left(-\frac{r^{2}}{h^{2}}\right) $$

In practice, I find that repulsion loss + chamfer loss with \(\alpha = 1.0\) produces the best qualitative results.

2.6. Interpret your model (15 points)

We can gain insight into the model’s behavior by visualizing the per-point contribution to the Chamfer distance. Each ground truth point is colored according to its nearest distance to any predicted point. Green indicates accurate reconstruction, while red represents larger deviations.

This error heatmap reveals the model’s typical failure modes. Most of the sitting surfaces of chairs are reconstructed precisely, showing dense green regions. In contrast, backrests, thin legs, edges, and thin structures often appear red, suggesting higher geometric uncertainty or incomplete recovery of fine structures.

Here are some representative examples.

3. Exploring other architectures / datasets. (Choose at least one! More than one is extra credit)

3.1 Implicit network (10 points)

Final F1@0.05 score: 41.32