Assignment 2: Single View to 3D¶

Name: Xinyu Liu¶

1. Exploring loss functions¶

1.1. Fitting a voxel grid¶

Left: target; Right: predicted.

GIF GIF

1.2. Fitting a point cloud¶

Left: target; Right: predicted.

GIF GIF

1.3. Fitting a mesh¶

Left: target; Right: predicted.

GIF GIF

2. Reconstructing 3D from single view¶

2.1. Image to voxel grid¶

Left: 2D Image; Middle: 3D ground-truth mesh; Right: 3D predicted voxel grid

Image GIF GIF

Image GIF GIF

Image GIF GIF

2.2. Image to point cloud¶

Left: 2D Image; Middle: 3D ground-truth mesh; Right: 3D predicted point cloud

Image GIF GIF

Image GIF GIF

Image GIF GIF

2.3. Image to mesh¶

Left: 2D Image; Middle: 3D ground-truth mesh; Right: 3D predicted mesh

Image GIF GIF

Image GIF GIF

Image GIF GIF

2.4 Quantitative comparisions¶

For my F1-score results, point cloud > voxel > mesh. The reason behind might be:

  1. Theoretically, voxel grids are the eaiest to predict, as each voxel is either occupied or empty. Point clouds are more challenging, as they represent surfaces as sets of 3D points. And meshes are the hardest, given they require predicting both vertices and explicit faces.
  2. But in my case, point clouds achieve a slightly higher F1-score than voxel grids. This is likely because I increase n_point from the default 1000 to 3000, allowing the point cloud to capture finer surface details. In contrast, the voxel grid uses the default resolution, which is relatively low, making it harder to represent thin or fine geometry and resulting in more mismatches along boundaries.

2.5. Analyse effects of hyperparams variations¶

I experimented with n_points = {1000, 3000, 5000, 10000}. The f1-score graph is as following (1000 -> 10000 from left to right): The model with 1000 predicted points has the lowest average F1-score, while the 5000-point version achieves the highest, with the 3000- and 10000-point variants falling in between.

This trend reflects a trade-off between surface coverage and prediction noise. When the number of points is small (1000), the predicted point cloud is too sparse to capture the full geometry of the object. As the number of points increases, the surface coverage improves, allowing for more complete and accurate reconstructions. However, when the number becomes excessively large (10000), the model may begin to produce redundant or slightly off-surface points, which increases noise and slightly reduces precision.

The figure below illustrates the same object reconstructed with different point counts. The 1000-point version shows noticeable gaps and missing areas, while the 3000- and 5000-point versions capture surface details accurately with minimal noise. The 10000-point version shows some prediction noise near the surface. These observations suggest that a moderate point count (around 3000–5000) is generally optimal.

GIF GIF GIF GIF

2.6. Interpret your model¶

To better understand my model, I start by visualizing the “failure” reconstruction cases to identify which parts of the reconstruction are going wrong and whether there are common patterns among these failures. I focus on single-view to 3D point cloud reconstruction and use error color-coding for analysis. My visualization pipeline includes the following steps:

  1. Sample a dense set of points from the ground-truth mesh surface, proportional to their surface area.
  2. Compute nearest-neighbor distances to quantify local reconstruction errors.
  3. Plot the predicted points color-coded by error magnitude, with the ground-truth projected on top (color scale: black → red → yellow → white, indicating more → fewer errors).

Here are three object visualizations. each from two different angles (front & right). It is atually an 3D interactive HTML that allows rotation, zooming, and point selection, but only screenshots are shown here.

From these visualizations, I observe that for large, continuous surfaces, the model reconstructs the 3D geometry smoothly. However, for sparse or detailed regions, such as unexpected holes or thin chair legs, the model sometimes misses these structures or reconstructs them as plain surfaces, ignoring finer details. Additionally, some noise appears near surface boundaries.

3. Exploring other architectures / datasets.¶

3.3 Extended dataset for training¶

I trained the single-view to point cloud model with 3000 points and evaluated it on the same chair testing set. The F1-score results are shown below (Left: original dataset; Right: extended dataset). The overall F1-scores are similar, but the model trained exclusively on the chair class performs slightly better (a little higher F1-score).

If we visualize the reconstruction by the two models as below (Left: ground-truth; Middle: original dataset; Right: extended dataset). For the first example, the chair has irregular legs. The model trained on a single class (middle) fails to accurately capture these geometric details, while the model trained on three classes (right) demonstrates better generalization, resulting in a better reconstruction.

GIF GIF GIF

In the second example, the chair has a distinctive structure beneath the seat. The single-class model (middle) produces a more generic chair shape, missing these subtle features. The three-class model (right), however, captures part of this structure but also introduces additional surface noise and ambiguity around other regions, especially thin regions such as the legs.

GIF GIF GIF

In summary, the model trained solely on the chair class produces more consistent and stable reconstructions, whereas the model trained on multiple classes generates more diverse results—some with finer geometric details, but others with increased noise or reduced class-specific accuracy.