Left: target; Right: predicted.

Left: target; Right: predicted.

Left: target; Right: predicted.

Left: 2D Image; Middle: 3D ground-truth mesh; Right: 3D predicted voxel grid



Left: 2D Image; Middle: 3D ground-truth mesh; Right: 3D predicted point cloud



Left: 2D Image; Middle: 3D ground-truth mesh; Right: 3D predicted mesh



For my F1-score results, point cloud > voxel > mesh. The reason behind might be:

I experimented with n_points = {1000, 3000, 5000, 10000}. The f1-score graph is as following (1000 -> 10000 from left to right): The model with 1000 predicted points has the lowest average F1-score, while the 5000-point version achieves the highest, with the 3000- and 10000-point variants falling in between.

This trend reflects a trade-off between surface coverage and prediction noise. When the number of points is small (1000), the predicted point cloud is too sparse to capture the full geometry of the object. As the number of points increases, the surface coverage improves, allowing for more complete and accurate reconstructions. However, when the number becomes excessively large (10000), the model may begin to produce redundant or slightly off-surface points, which increases noise and slightly reduces precision.
The figure below illustrates the same object reconstructed with different point counts. The 1000-point version shows noticeable gaps and missing areas, while the 3000- and 5000-point versions capture surface details accurately with minimal noise. The 10000-point version shows some prediction noise near the surface. These observations suggest that a moderate point count (around 3000–5000) is generally optimal.

To better understand my model, I start by visualizing the “failure” reconstruction cases to identify which parts of the reconstruction are going wrong and whether there are common patterns among these failures. I focus on single-view to 3D point cloud reconstruction and use error color-coding for analysis. My visualization pipeline includes the following steps:
Here are three object visualizations. each from two different angles (front & right). It is atually an 3D interactive HTML that allows rotation, zooming, and point selection, but only screenshots are shown here.

From these visualizations, I observe that for large, continuous surfaces, the model reconstructs the 3D geometry smoothly. However, for sparse or detailed regions, such as unexpected holes or thin chair legs, the model sometimes misses these structures or reconstructs them as plain surfaces, ignoring finer details. Additionally, some noise appears near surface boundaries.
I trained the single-view to point cloud model with 3000 points and evaluated it on the same chair testing set. The F1-score results are shown below (Left: original dataset; Right: extended dataset). The overall F1-scores are similar, but the model trained exclusively on the chair class performs slightly better (a little higher F1-score).

If we visualize the reconstruction by the two models as below (Left: ground-truth; Middle: original dataset; Right: extended dataset). For the first example, the chair has irregular legs. The model trained on a single class (middle) fails to accurately capture these geometric details, while the model trained on three classes (right) demonstrates better generalization, resulting in a better reconstruction.

In the second example, the chair has a distinctive structure beneath the seat. The single-class model (middle) produces a more generic chair shape, missing these subtle features. The three-class model (right), however, captures part of this structure but also introduces additional surface noise and ambiguity around other regions, especially thin regions such as the legs.

In summary, the model trained solely on the chair class produces more consistent and stable reconstructions, whereas the model trained on multiple classes generates more diverse results—some with finer geometric details, but others with increased noise or reduced class-specific accuracy.