16-825 Assignment 2: Single View to 3D

1. Exploring loss functions

1.1. Fitting a voxel grid (5 points)

360-degree cow render
Figure 1: Left: 360-degree view of target voxel grid. Right: 360-degree view of optimized voxel grid.

1.2. Fitting a point cloud (5 points)

360-degree cow render
Figure 2: Left: 360-degree view of target point cloud. Right: 360-degree view of optimized point cloud.

1.3. Fitting a mesh (5 points)

360-degree cow render
Figure 3: Left: 360-degree view of target mesh. Right: 360-degree view of optimized mesh.

2. Reconstructing 3D from single view

2.1. Image to voxel grid (20 points)

360-degree cow render

Figure 4: Left: Input image to the model. Center: Voxel Grid Ground-Truth. Right: Voxel Grid Prediction.

2.2. Image to point cloud (20 points)

360-degree cow render

Figure 5: Left: Input image to the model. Center: Point Cloud Ground-Truth. Right: Point Cloud Prediction.

2.3. Image to mesh (20 points)

360-degree cow render

Figure 6: Left: Input image to the model. Center: Mesh Ground-Truth. Right: Mesh Prediction.

2.4. Quantitative comparisions(10 points)

First, it is important to consider how the performance metrics are computed. Points are sampled from the ground-truth mesh. For voxel-based predictions, a mesh is generated using the marching cubes method, and points are sampled from this mesh. For mesh-based predictions, points are sampled directly from the predicted mesh. For point cloud predictions, no additional preprocessing is needed. The comparison is then made between the ground-truth points and the sampled prediction points.

The voxel-based model exhibits the lowest performance at lower threshold values, primarily due to the limitations of voxelization, which restrict the spatial accuracy of predictions. The mesh-based model shows reduced F1-scores at higher thresholds, likely because it struggles to represent holes, leading to lower precision and more false positives. In contrast, the point cloud model consistently achieves the highest F1-scores. Although the predicted and ground-truth point distributions may differ—often concentrating more points in certain regions such as legs and edges—the overall performance remains superior.

360-degree cow render
Figure 7: F1-score of different models.

2.5. Analyse effects of hyperparams variations (10 points)

I chose to vary the n_points hyperparameter and conducted experiments using values of 1000, 2000, and 4000 points.
Qualitative results demonstrate that increasing the n_points value leads to improved reconstructions, particularly in thin structures such as legs. This is expected, as a higher number of points allows for a more detailed and accurate representation, though it also results in a larger model size.

360-degree cow render

Figure 8: Left: Prediction using 1000 points. Center: Prediction using 2000 points. Right: Prediction using 4000 points.

Quantitatively, higher n_points values also yield better performance. With denser point clouds, the likelihood of finding a nearest neighbor within the evaluation threshold increases, which aligns with both the metric's calculation and the improved capture of fine details.

360-degree cow render
Figure 9: F1-score of models trained varying n_point hyperparameter.

2.6. Interpret your model (15 points)

The second example provides a clear comparison between different 3D representations and models. The voxel-based model struggles with thin structures and fails to capture fine geometry, but it excels at representing varying topologies, such as holes. In contrast, the mesh-based model accurately reconstructs thin volumes but cannot represent holes, as it is limited by the topology of the initial mesh (a sphere in this case). Point cloud predictions yield good results, especially as the n_points parameter increases, resulting in denser reconstructions. However, converting point clouds into meshes remains challenging, so the usefulness of this representation depends on the intended application.

360-degree cow render
Figure 10: Left: Voxel prediction. Center: Mesh prediction. Right: Point Cloud prediction.

3. Exploring other architectures / datasets.

3.2 Parametric network (10 points)

I decided to implement the AtlasNet model as parametric network to predict a point cloud based on image features and points sampled from a parametric surface.

360-degree cow render

Figure 11: Left: Input image to the model. Center: Point Cloud Ground-Truth. Right: AtlasNet Prediction.

3.3 Extended dataset for training (10 points)

I trained the point cloud model on the full dataset. Due to time constraints, training was limited to 50,000 iterations instead of 100,000.

360-degree cow render

Figure 12: Left: Input image to the model. Center: Point Cloud Ground-Truth. Right: Point Cloud Prediction.

Qualitative results demonstrate that the model generalizes effectively across different classes, producing outputs that are comparable to those achieved when trained exclusively on the chair dataset.

360-degree cow render

Figure 13: Left: Input image to the model. Center: Point Cloud Prediction on chair dataset. Right: Point Cloud Prediction on full dataset.

Metrics calculated on the same test set (chair only) show a slight decrease in performance at low threshold values, but an improvement at higher thresholds. This suggests that the model generalizes well to chairs, even when trained on three different classes.

360-degree cow render
Figure 14: F1-score of models trained using different datasets.