Assignment 2: Single View to 3D¶

Name: Simson D'Souza, Andrew ID: sjdsouza, Email: sjdsouza@andrew.cmu.edu¶


1: Exploring loss functions¶

1.1 Fitting a voxel grid (5 points)¶

No description has been provided for this image
Figure 1: Ground truth Voxel Grid
No description has been provided for this image
Figure 2: Target Voxel Grid

1.2 Fitting a point cloud (5 points)¶

No description has been provided for this image
Figure 3: Ground truth Point Cloud
No description has been provided for this image
Figure 4: Target Point Cloud

1.3 Fitting a mesh (5 points)¶

No description has been provided for this image
Figure 5: Ground truth Mesh
No description has been provided for this image
Figure 6: Target Mesh

2. Reconstructing 3D from single view¶

2.1 Image to voxel grid (20 points)¶

Example 1:

No description has been provided for this image
Figure 7: RGB Image
No description has been provided for this image
Figure 8: Ground truth Voxel Grid
No description has been provided for this image
Figure 9: Predicted Voxel grid

Example 2:

No description has been provided for this image
Figure 10: RGB Image
No description has been provided for this image
Figure 11: Ground truth Voxel Grid
No description has been provided for this image
Figure 12: Predicted Voxel grid

Example 3:

No description has been provided for this image
Figure 13: RGB Image
No description has been provided for this image
Figure 14: Ground truth Point Cloud
No description has been provided for this image
Figure 15: Predicted Voxel grid

2.2 Image to point cloud (20 points)¶

Example 1:

No description has been provided for this image
Figure 16: RGB Image
No description has been provided for this image
Figure 17: Ground truth Point Cloud
No description has been provided for this image
Figure 18: Predicted Point Cloud

Example 2:

No description has been provided for this image
Figure 19: RGB Image
No description has been provided for this image
Figure 20: Ground truth Point Cloud
No description has been provided for this image
Figure 21: Predicted Point Cloud

Example 3:

No description has been provided for this image
Figure 22: RGB Image
No description has been provided for this image
Figure 23: Ground truth Point Cloud
No description has been provided for this image
Figure 24: Predicted Point Cloud

2.3 Image to mesh (20 points)¶

Example 1:

No description has been provided for this image
Figure 25: RGB Image
No description has been provided for this image
Figure 26: Ground truth Mesh
No description has been provided for this image
Figure 27: Predicted Point Mesh

Example 2:

No description has been provided for this image
Figure 28: RGB Image
No description has been provided for this image
Figure 29: Ground truth Mesh
No description has been provided for this image
Figure 30: Predicted Point Mesh

Example 3:

No description has been provided for this image
Figure 31: RGB Image
No description has been provided for this image
Figure 32: Ground truth Mesh
No description has been provided for this image
Figure 33: Predicted Point Mesh

2.4 Quantitative comparisions(10 points)¶

Voxel: Avg F1@0.05: 69.519

Point: Avg F1@0.05: 79.251

Mesh: Avg F1@0.05: 74.978

a. Point Cloud: The point cloud method achieves the highest F1-score because it's the most flexible and least constrained representation. The model doesn't have to learn a rigid structure; it just has to predict a set of (x, y, z) coordinates. This is a much simpler task, as it can place points anywhere in space without worrying about connectivity (like a mesh) or a fixed grid (like voxels).

b. Voxel: Voxel grids have a good F1-score, but they are limited by the resolution of the grid. The model has to decide whether each small cube is "occupied" or "not occupied." The fixed resolution means the model can't capture fine details or curved surfaces as precisely as a point cloud can.

c. Mesh: It is the most constrained and complex representation. The model must not only predict the position of each vertex but also ensure that the faces connecting them form a valid surface This requires the model to learn both the geometry and the topology of the object, which is a much harder task than predicting an unordered set of points. The rigid connectivity of the mesh limits the model's flexibility.

No description has been provided for this image
Figure 34: Voxel F1 Score
No description has been provided for this image
Figure 35: Point Cloud F1 Score
No description has been provided for this image
Figure 36: Mesh F1 Score

2.5 Analyse effects of hyperparams variations (10 points)¶

Comparison of Quantitative results of point cloud representation using 500, 1000, 1500 and 2500 points

  1. 500 Points: Avg F1@0.05: 62.113
  2. 1000 Points: Avg F1@0.05: 79.251
  3. 2500 Points: Avg F1@0.05: 81.561
No description has been provided for this image
Figure 37: Point Cloud F1 Score - 500 points
No description has been provided for this image
Figure 38: Point Cloud F1 Score - 1000 points
No description has been provided for this image
Figure 40: Point Cloud F1 Score - 2500 points

Comparison of Qualitative results of point cloud representation using 500, 1000, 1500 and 2500 points

No description has been provided for this image
Figure 41: RGB Image
No description has been provided for this image
Figure 42: Ground truth Point Cloud - 500 points
No description has been provided for this image
Figure 43: Predicted Point Cloud - 500 points
No description has been provided for this image
Figure 44: Ground truth Point Cloud - 1000 points
No description has been provided for this image
Figure 45: Predicted Point Cloud - 1000 points
No description has been provided for this image
Figure 48: Ground truth Point Cloud - 2500 points
No description has been provided for this image
Figure 49: Predicted Point Cloud - 2500 points

Inference: Quantitative Analysis

  • The F1-score consistently increases as the number of points (n_points) increases. This shows that having more points allows the model to create a more detailed and accurate representation of the ground truth. A higher point count provides a finer-grained surface, which in turn leads to better F1-scores.

Inference: Qualitative Analysis

  • 500 Points: The predicted point cloud is very sparse and lacks any clear structural features of the chair. It appears as a diffused cloud of points, making it difficult to distinguish the seat from the legs.
  • 1000 Points: The structural features of the chair become much more apparent. The seat and backrest are clearly defined, and you can start to see the outline of the legs, though they are still very sparse.
  • 2500 Points: The visualizations for the 2500-point prediction show that while the model has a higher overall F1-score, its output is less structurally coherent than the 1000-point prediction. The prediction appears as a more diffused, cloud-like blob, lacking the distinct and organized representation of the chair's legs and backrest. This is a direct consequence of the increased complexity of predicting a higher number of points without a corresponding increase in training time. The model is effectively under-trained for this task, as it hasn't had enough iterations to learn how to organize the additional points into a consistent and stable structure. While the 1000-point model has converged on a clear, albeit sparse, representation of the chair, the 2500-point model is still in an early, unorganized phase of learning, demonstrating that a larger output space requires a more robust training to achieve qualitative consistency.

2.6 Interpret your model (15 points)¶

To go beyond the F1-score, I created a visualization that represents my model's error distribution. It highlights prediction quality by coloring each point based on its distance to the nearest point in the ground truth. The predicted point cloud is colored based on its Chamfer distance to the nearest point in the ground truth point cloud.

  • Green points indicate areas of high accuracy where predictions are very close to the ground truth.
  • Orange points show areas of medium error, representing less accurate but not entirely incorrect predictions.
  • Red points highlight areas of high error where predicted points are far from the ground truth.

This method offers a clear, intuitive way to see exactly where the model is succeeding and where it's failing. Below are the results.

No description has been provided for this image
Figure 50: RGB Image
No description has been provided for this image
Figure 51: Ground truth Point Cloud
No description has been provided for this image
Figure 52: Predicted Point Cloud
No description has been provided for this image
Figure 53: Predicted Point Cloud Colored by Error Distribution
No description has been provided for this image
Figure 54: RGB Image
No description has been provided for this image
Figure 55: Ground truth Point Cloud
No description has been provided for this image
Figure 56: Predicted Point Cloud
No description has been provided for this image
Figure 57: Predicted Point Cloud Colored by Error Distribution
No description has been provided for this image
Figure 58: RGB Image
No description has been provided for this image
Figure 59: Ground truth Point Cloud
No description has been provided for this image
Figure 60: Predicted Point Cloud
No description has been provided for this image
Figure 61: Predicted Point Cloud Colored by Error Distribution

Insights from the visualization of error distribution:

  1. What the model learns well:
    • In all three examples, the model learns the main, solid parts of the chair's body very well. The seat and the backrest are predominantly colored green and yellow, indicating that the predicted points are very close to the ground truth. This shows the model is good at capturing the overall, most prominent shape of the object.
    • The model is able to get the general height, width, and depth of the chair correct. The predicted point clouds are not wildly misplaced; they occupy a similar volume to the ground truth.
  2. What the model fails to learn:
    • The most significant errors are consistently found in the legs and thin supports of the chairs. These areas are almost exclusively orange and red. This is because these thin structures are a small part of the total volume, making them difficult to learn and accurately predict. The model tends to fill in these gaps or miss them entirely.
    • The model struggles to capture the empty spaces beneath the chairs and between the legs. Instead of predicting individual legs, it often predicts a blob of points that fills in the space, as seen in the green-yellow-red coloration that extends downward from the seat. This suggests the model has a bias toward predicting solid, compact shapes rather than objects with complex, open-air structures.

More examples of voxel representation showing what the model fails to predict correctly as explained in the insights:

Example 1: Chair legs and arm rest not predicted correctly

No description has been provided for this image
Figure 62: RGB Image
No description has been provided for this image
Figure 63: Ground truth Voxel Grid
No description has been provided for this image
Figure 64: Predicted Voxel grid

Example 2: Couch not predicted correctly

No description has been provided for this image
Figure 65: RGB Image
No description has been provided for this image
Figure 66: Ground truth Voxel Grid
No description has been provided for this image
Figure 67: Predicted Voxel grid

3. Exploring other architectures / datasets. (Choose at least one! More than one is extra credit)¶

3.3 Extended dataset for training (10 points)¶

Trained on extended dataset and obtained the following results:

Point Cloud Representation:

  • Comparison of Quantitative and Qualitative results of "training on one class" VS "training on three classes"

    Training on one class: Avg F1@0.05: 79.251

    Training on three classes: Avg F1@0.05: 77.547

    No description has been provided for this image
    Figure 68: Point Cloud F1 Score
    No description has been provided for this image
    Figure 69: Point Cloud F1 Score on Extended Dataset

Example 1:

No description has been provided for this image
Figure 70: RGB Image
No description has been provided for this image
Figure 71: Ground truth Point Cloud
No description has been provided for this image
Figure 72: Predicted Point Cloud
No description has been provided for this image
Figure 73: Predicted Point Cloud on 3 Classes

Example 2:

No description has been provided for this image
Figure 74: RGB Image
No description has been provided for this image
Figure 75: Ground truth Point Cloud
No description has been provided for this image
Figure 76: Predicted Point Cloud
No description has been provided for this image
Figure 77: Predicted Point Cloud on 3 Classes

Example 3:

No description has been provided for this image
Figure 78: RGB Image
No description has been provided for this image
Figure 79: Ground truth Point Cloud
No description has been provided for this image
Figure 80: Predicted Point Cloud
No description has been provided for this image
Figure 81: Predicted Point Cloud on 3 Classes
- Inference on Quantitative Comparison (F1-Score):

a. The F1-score for the point cloud model shows a slight decrease when training on the extended, three-class dataset.

b. This minor drop suggests that adding more classes (car and plane) introduces more variability and complexity into the model's training process. While the model is still performing very well, the added diversity slightly reduces its ability to perfectly reconstruct a chair, as it now has to generalize across a wider range of shapes.

  • Qualitative Comparison (3D Consistency & Diversity):

    The images show a clear difference in the quality of the predicted point clouds, especially in the consistency and diversity of the outputs.

    a. Training on One Class: The predicted point clouds are generally more consistent and more accurate. The model has learned a strong, singular "chair" representation. As a result, its predictions for a chair are well-defined, and the points are tightly clustered, especially around the seat and backrest.

    b. Training on Three Classes: The predicted point clouds are less consistent and more diverse. The model now has to learn features that are common to all three classes (chair, car, plane). This causes its predictions for a chair to be less precise and more spread out. For example, in the second and third examples, the predicted point clouds are more blob-like and lack the clear definition of the legs seen in the single-class model.