Q1. Classification Model (40 points)

I had to use 1000 points for the evaluation because of GPU out of memory errors. The test accuracy was 0.9759

Correct examples

alt text GT label is chair, predicted label is chair.

alt text GT label is vase, predicted label is vase.

alt text GT label is lamp, predicted label is lamp.

Failure cases

alt text GT label is chair, predicted label is lamp. There was only one misclassification for the chair class, so the model performs well for the chair class. When visualizing this point cloud, it seems like this chair does not have a clear 3D structure compared to the previous chair visualization. In other words, the seat of the chair in the positive sample is clearly defined, but with this sample, it is not very clear.

alt text GT label is vase, predicted label is lamp. This may be due to the fact that I was forced to use a smaller number of points due to GPU constraints. In the point cloud, we can see that there is a significant part of the point cloud that is missing. This may cause the model to make an incorrect classification.

alt text GT label is lamp, predicted label is vase. The lamp in this example is very long and thin. The previous visualization of the lamp had a clear defining shape. This root cause of this failure case may be that the global features are not strong enough for the model to correctly identify it as a lamp. The shape of this lamp is very thin and doesn't have any defining cone-like features, like the previous lamp example.

Q2. Segmentation Model (40 points)

Due to GPU constraints, I ran with num_points==900. The test accuracy was 0.8937.

The following visualizations show GT first, prediction second.

Example 1 (good) Accuracy: 0.8932. The segmentation results are accurate. alt text alt text

Example 2 (good) Accuracy: 0.8920. The segmentation results are accurate. alt text alt text

Example 3 (good) Accuracy: 0.8938. The segmentation results are accurate. alt text alt text

Example 4 (bad) Accuracy: 0.8789. The qualitative results show that the in the prediction, the back of the chair (cyan) gets spread into the bottom. This may be because this sample has a lot more classes than the previous three, so it was more complex to predict.
alt text alt text

Example 5 (bad) Accuracy: 0.8821. In this sample, we can see that the red class bleeds through to the bottom of the chair, which causes the poor performance. This may be caused by the fact that there are a lot more classes, similar to the previous example.
alt text alt text

Q3. Robustness Analysis (20 points)

Rotate input point clouds

Classification

I rotated the input point clouds by 15, 30, 60, and 90 degrees counterclockwise along the z-axis inside eval_cls.py and ran the evaluation. I report quantitative and qualitative metrics below.

Quantitative Metrics:
original test accuraacy: 0.9759
15-degrees test accuracy: 0.9192
30-degrees test accuracy: 0.6631
60-degrees test accuracy: 0.3431
90-degrees test accuracy: 0.2339

Sample 0

Qualitative Metrics: 15-degree visualization. The GT label was chair, and predicted label was chair. alt text

30-degree visualization. The GT label was chair, and the predicted label was chair. alt text

60-degree visualization. The GT label was chair, and the predicted label was vase. alt text

90-degree visualization. The GT label was chair, and the predicted label was vase. alt text

Sample 1

Qualitative Metrics: 15-degree visualization. The GT label was lamp, and predicted label was lamp.
alt text

30-degree visualization. The GT label was lamp, and the predicted label was lamp.
alt text

60-degree visualization. The GT label was lamp, and the predicted label was vase.
alt text

90-degree visualization. The GT label was lamp, and the predicted label was vase.
alt text

Analysis/Interpretation

Based on these results, it is clear that the initially trained model is not robust to rotation. As the magnitude of the rotation increases, the overall test accuracy decreases. We can see that with slight rotations, the model is able to accurately classify the first example in the test set (chair), but with 60,90 degree rotations, the model makes an incorrect classification.

From the training data, the model learned orientation-specific features. We can mitigate the performance regression by applying data augmentation. By randomly adding rotations to the training point cloud data, we can train the model to be more robust to rotations. We can also try to inject rotation-invariant features that tell us relative distances/angles between points. If the model learns that points are aligned in a certain way (with respect to each other), it may make more accurate predictions on rotated inputs.

Segmentation

The procedure for rotating input point clouds on the segmentation task is the same as described above. We rotate the input point clouds by 15, 30, 60, and 90 degrees counterclockwise along the z-axis inside eval_seg.py. I report quantitative and qualitative metrics below.

Sample 0

Quantitative and Qualitative Metrics: Original test accuracy: 0.8932
Original GT vs Pred:
alt text alt text

15-deg test accuracy: 0.8293 15-deg GT vs Pred:
TODO TODO

30-deg test accuracy: 0.7245
30-deg GT vs Pred:
TODO TODO

60-deg test accuracy: 0.5035
60-deg GT vs Pred:
TODO TODO

90-deg test accuracy: 0.3615
90-deg GT vs Pred:
TODO TODO

Sample 1

Quantitative and Qualitative Metrics: Original test accuracy: 0.8925
Original GT vs Pred:
TODO TODO

15-deg test accuracy: 0.8309 15-deg GT vs Pred:
TODO TODO

30-deg test accuracy: 0.7253
30-deg GT vs Pred:
TODO TODO

60-deg test accuracy: 0.5053
60-deg GT vs Pred:
TODO TODO

90-deg test accuracy: 0.3623
90-deg GT vs Pred:
TODO TODO

Analysis/Interpretation

Based on these results, we can see that if we increase the magnitude of rotation, the test accuracy decreases for each sample. Additionally, for the 60-deg and 90-deg example, it is clear that our model begins to predict several different incorrect classes along the same part of the chair. This degradation occurs because the model relies on orientation-dependent spatial features, and rotating the point cloud disrupts the geometric relationships learned during training. Since rotation invariance is not inherently built into the network architecture, the model struggles to generalize to unseen orientations. To improve robustness, future work could incorporate random 3D rotation augmentation during training, and also applying alignment modules like T-Nets to make the model more robust to rotations.

Adjusting number of points

Classification

Due to GPU constraints, the maximum number of points I can use for classification is 1000. Therefore, I test the model on 1000, 750, 500, 250, 50 points. I report quantitative and qualitative metrics below. I run these experiments on two different samples from the test set.

Quantitative Metrics:
original (1000 points) test accuraacy: 0.9759
750 points test accuracy: 0.9706
500 points test accuracy: 0.9653
250 points test accuracy: 0.9664
50 points test accuracy: 0.9517

Sample 0

Qualitative Metrics:
For all the visualizations below, the GT label is a chair, and the predicted label is a chair.

1000 points:
alt text
750 points:
alt text
500 points:
alt text
250 points:
alt text
50 points:
alt text

Sample 1

Qualitative Metrics:
For all the visualizations below, the GT label is a lamp, and the predicted label is a lamp.
1000 points:
alt text
750 points:
alt text
500 points:
alt text
250 points:
alt text
50 points:
alt text

Analysis/Interpretation

Based on the quantitative and qualitative metrics, we can see that the test accuracy does not decrease drastically as we decrease the number of input points. It only decreases slightly. This suggests that the model has learned to capture the global geometric structure of objects effectively, rather than relying heavily on dense point sampling. In other words, PointNet’s use of shared MLPs and global max pooling allows it to extract global features when without many points. This means that it performs well even with sparse inputs.

Segmentation

Due to GPU constraints, the maximum number of points I can use for segmentation is 900. Therefore, I test the model on 900, 750, 500, 250, 50 points. I report quantitative and qualitative metrics below. I run these experiments on two different samples from the test set.

Sample 0

Quantitative and Qualitative Metrics: Original (900 points) test accuracy: 0.8922
Original GT vs Pred:
alt text alt text

750 points test accuracy: 0.8900
750 points GT vs Pred:
alt text alt text

500 points test accuracy: 0.8809
500 points GT vs Pred:
alt text alt text

250 points test accuracy: 0.8605
250 points GT vs Pred:
alt text alt text

50 points test accuracy: 0.7873
50 points GT vs Pred:
alt text alt text

Sample 1

Quantitative and Qualitative Metrics: Original (900 points) test accuracy: 0.8932
Original GT vs Pred:
alt text alt text

750 points test accuracy: 0.8905
750 points GT vs Pred:
alt text alt text

500 points test accuracy: 0.8825
500 points GT vs Pred:
alt text alt text

250 points test accuracy: 0.8605
250 points GT vs Pred:
alt text alt text

50 points test accuracy: 0.7933
50 points GT vs Pred:
alt text alt text

Analysis/Interpretation

Similarly, the segmentation model also performs well even when the number of input points is reduced. Since PointNet aggregates features globally through max pooling, it can maintain meaningful per-point feature representations even with fewer samples. From the visualizations and quantitative metrics, we can see that the classification task is more robust than the segmentation task. However, we can still say that the segmentation model is more robust to sparse inputs than rotated inputs.