Assignment 2: Single View to 3D¶

Name: Simson D'Souza, Andrew ID: sjdsouza, Email: sjdsouza@andrew.cmu.edu¶

1: Exploring loss functions¶

1.1 Fitting a voxel grid (5 points)¶

No description has been provided for this image

Figure 1: Ground truth Voxel Grid

Figure 2: Target Voxel Grid

1.2 Fitting a point cloud (5 points)¶

Figure 3: Ground truth Point Cloud

Figure 4: Target Point Cloud

1.3 Fitting a mesh (5 points)¶

Figure 5: Ground truth Mesh

Figure 6: Target Mesh

2. Reconstructing 3D from single view¶

2.1 Image to voxel grid (20 points)¶

Example 1:

Figure 7: RGB Image

Figure 8: Ground truth Voxel Grid

Figure 9: Predicted Voxel grid

Example 2:

Figure 10: RGB Image

Figure 11: Ground truth Voxel Grid

Figure 12: Predicted Voxel grid

Example 3:

Figure 13: RGB Image

Figure 14: Ground truth Point Cloud

Figure 15: Predicted Voxel grid

2.2 Image to point cloud (20 points)¶

Example 1:

Figure 16: RGB Image

Figure 17: Ground truth Point Cloud

Figure 18: Predicted Point Cloud

Example 2:

Figure 19: RGB Image

Figure 20: Ground truth Point Cloud

Figure 21: Predicted Point Cloud

Example 3:

Figure 22: RGB Image

Figure 23: Ground truth Point Cloud

Figure 24: Predicted Point Cloud

2.3 Image to mesh (20 points)¶

Example 1:

Figure 25: RGB Image

Figure 26: Ground truth Mesh

Figure 27: Predicted Point Mesh

Example 2:

Figure 28: RGB Image

Figure 29: Ground truth Mesh

Figure 30: Predicted Point Mesh

Example 3:

Figure 31: RGB Image

Figure 32: Ground truth Mesh

Figure 33: Predicted Point Mesh

2.4 Quantitative comparisions(10 points)¶

Voxel: Avg F1@0.05: 69.519

Point: Avg F1@0.05: 79.251

Mesh: Avg F1@0.05: 74.978

a. Point Cloud: The point cloud method achieves the highest F1-score because it's the most flexible and least constrained representation. The model doesn't have to learn a rigid structure; it just has to predict a set of (x, y, z) coordinates. This is a much simpler task, as it can place points anywhere in space without worrying about connectivity (like a mesh) or a fixed grid (like voxels).

b. Voxel: Voxel grids have a good F1-score, but they are limited by the resolution of the grid. The model has to decide whether each small cube is "occupied" or "not occupied." The fixed resolution means the model can't capture fine details or curved surfaces as precisely as a point cloud can.

c. Mesh: It is the most constrained and complex representation. The model must not only predict the position of each vertex but also ensure that the faces connecting them form a valid surface This requires the model to learn both the geometry and the topology of the object, which is a much harder task than predicting an unordered set of points. The rigid connectivity of the mesh limits the model's flexibility.

Figure 34: Voxel F1 Score

Figure 35: Point Cloud F1 Score

Figure 36: Mesh F1 Score

2.5 Analyse effects of hyperparams variations (10 points)¶

Comparison of Quantitative results of point cloud representation using 500, 1000, 1500 and 2500 points

500 Points: Avg F1@0.05: 62.113
1000 Points: Avg F1@0.05: 79.251
2500 Points: Avg F1@0.05: 81.561

Figure 37: Point Cloud F1 Score - 500 points

Figure 38: Point Cloud F1 Score - 1000 points

Figure 40: Point Cloud F1 Score - 2500 points

Comparison of Qualitative results of point cloud representation using 500, 1000, 1500 and 2500 points

Figure 41: RGB Image

Figure 42: Ground truth Point Cloud - 500 points

Figure 43: Predicted Point Cloud - 500 points

Figure 44: Ground truth Point Cloud - 1000 points

Figure 45: Predicted Point Cloud - 1000 points

Figure 48: Ground truth Point Cloud - 2500 points

Figure 49: Predicted Point Cloud - 2500 points

Inference: Quantitative Analysis

The F1-score consistently increases as the number of points (n_points) increases. This shows that having more points allows the model to create a more detailed and accurate representation of the ground truth. A higher point count provides a finer-grained surface, which in turn leads to better F1-scores.

Inference: Qualitative Analysis

500 Points: The predicted point cloud is very sparse and lacks any clear structural features of the chair. It appears as a diffused cloud of points, making it difficult to distinguish the seat from the legs.
1000 Points: The structural features of the chair become much more apparent. The seat and backrest are clearly defined, and you can start to see the outline of the legs, though they are still very sparse.
2500 Points: The visualizations for the 2500-point prediction show that while the model has a higher overall F1-score, its output is less structurally coherent than the 1000-point prediction. The prediction appears as a more diffused, cloud-like blob, lacking the distinct and organized representation of the chair's legs and backrest. This is a direct consequence of the increased complexity of predicting a higher number of points without a corresponding increase in training time. The model is effectively under-trained for this task, as it hasn't had enough iterations to learn how to organize the additional points into a consistent and stable structure. While the 1000-point model has converged on a clear, albeit sparse, representation of the chair, the 2500-point model is still in an early, unorganized phase of learning, demonstrating that a larger output space requires a more robust training to achieve qualitative consistency.

2.6 Interpret your model (15 points)¶

To go beyond the F1-score, I created a visualization that represents my model's error distribution. It highlights prediction quality by coloring each point based on its distance to the nearest point in the ground truth. The predicted point cloud is colored based on its Chamfer distance to the nearest point in the ground truth point cloud.

Green points indicate areas of high accuracy where predictions are very close to the ground truth.
Orange points show areas of medium error, representing less accurate but not entirely incorrect predictions.
Red points highlight areas of high error where predicted points are far from the ground truth.

This method offers a clear, intuitive way to see exactly where the model is succeeding and where it's failing. Below are the results.

Figure 50: RGB Image

Figure 51: Ground truth Point Cloud

Figure 52: Predicted Point Cloud

Figure 53: Predicted Point Cloud Colored by Error Distribution

Figure 54: RGB Image

Figure 55: Ground truth Point Cloud

Figure 56: Predicted Point Cloud

Figure 57: Predicted Point Cloud Colored by Error Distribution

Figure 58: RGB Image

Figure 59: Ground truth Point Cloud

Figure 60: Predicted Point Cloud

Figure 61: Predicted Point Cloud Colored by Error Distribution

Insights from the visualization of error distribution:

What the model learns well:
- In all three examples, the model learns the main, solid parts of the chair's body very well. The seat and the backrest are predominantly colored green and yellow, indicating that the predicted points are very close to the ground truth. This shows the model is good at capturing the overall, most prominent shape of the object.
- The model is able to get the general height, width, and depth of the chair correct. The predicted point clouds are not wildly misplaced; they occupy a similar volume to the ground truth.
What the model fails to learn:
- The most significant errors are consistently found in the legs and thin supports of the chairs. These areas are almost exclusively orange and red. This is because these thin structures are a small part of the total volume, making them difficult to learn and accurately predict. The model tends to fill in these gaps or miss them entirely.
- The model struggles to capture the empty spaces beneath the chairs and between the legs. Instead of predicting individual legs, it often predicts a blob of points that fills in the space, as seen in the green-yellow-red coloration that extends downward from the seat. This suggests the model has a bias toward predicting solid, compact shapes rather than objects with complex, open-air structures.

More examples of voxel representation showing what the model fails to predict correctly as explained in the insights:

Example 1: Chair legs and arm rest not predicted correctly

Figure 62: RGB Image

Figure 63: Ground truth Voxel Grid

Figure 64: Predicted Voxel grid

Example 2: Couch not predicted correctly

Figure 65: RGB Image

Figure 66: Ground truth Voxel Grid

Figure 67: Predicted Voxel grid

3. Exploring other architectures / datasets. (Choose at least one! More than one is extra credit)¶

3.3 Extended dataset for training (10 points)¶

Trained on extended dataset and obtained the following results:

Point Cloud Representation:

Comparison of Quantitative and Qualitative results of "training on one class" VS "training on three classes"

Training on one class: Avg F1@0.05: 79.251

Training on three classes: Avg F1@0.05: 77.547

Figure 68: Point Cloud F1 Score
Figure 69: Point Cloud F1 Score on Extended Dataset

Example 1:

Figure 70: RGB Image

Figure 71: Ground truth Point Cloud

Figure 72: Predicted Point Cloud

Figure 73: Predicted Point Cloud on 3 Classes

Example 2:

Figure 74: RGB Image

Figure 75: Ground truth Point Cloud

Figure 76: Predicted Point Cloud

Figure 77: Predicted Point Cloud on 3 Classes

Example 3:

Figure 78: RGB Image

Figure 79: Ground truth Point Cloud

Figure 80: Predicted Point Cloud

Figure 81: Predicted Point Cloud on 3 Classes

- Inference on Quantitative Comparison (F1-Score):

a. The F1-score for the point cloud model shows a slight decrease when training on the extended, three-class dataset.

b. This minor drop suggests that adding more classes (car and plane) introduces more variability and complexity into the model's training process. While the model is still performing very well, the added diversity slightly reduces its ability to perfectly reconstruct a chair, as it now has to generalize across a wider range of shapes.

Qualitative Comparison (3D Consistency & Diversity):

The images show a clear difference in the quality of the predicted point clouds, especially in the consistency and diversity of the outputs.

a. Training on One Class: The predicted point clouds are generally more consistent and more accurate. The model has learned a strong, singular "chair" representation. As a result, its predictions for a chair are well-defined, and the points are tightly clustered, especially around the seat and backrest.

b. Training on Three Classes: The predicted point clouds are less consistent and more diverse. The model now has to learn features that are common to all three classes (chair, car, plane). This causes its predictions for a chair to be less precise and more spread out. For example, in the second and third examples, the predicted point clouds are more blob-like and lack the clear definition of the legs seen in the single-class model.