I had to use 1000 points for the evaluation because of GPU out of memory errors. The test accuracy was 0.9759
GT label is chair, predicted label is chair.
GT label is vase, predicted label is vase.
GT label is lamp, predicted label is lamp.
GT label is chair, predicted label is lamp. There was only one misclassification for the chair class, so the model performs well for the chair class. When visualizing this point cloud, it seems like this chair does not have a clear 3D structure compared to the previous chair visualization. In other words, the seat of the chair in the positive sample is clearly defined, but with this sample, it is not very clear.
GT label is vase, predicted label is lamp. This may be due to the fact that I was forced to use a smaller number of points due to GPU constraints. In the point cloud, we can see that there is a significant part of the point cloud that is missing. This may cause the model to make an incorrect classification.
GT label is lamp, predicted label is vase. The lamp in this example is very long and thin. The previous visualization of the lamp had a clear defining shape. This root cause of this failure case may be that the global features are not strong enough for the model to correctly identify it as a lamp. The shape of this lamp is very thin and doesn't have any defining cone-like features, like the previous lamp example.
Due to GPU constraints, I ran with num_points==900. The test accuracy was 0.8937.
The following visualizations show GT first, prediction second.
Example 1 (good)
Accuracy: 0.8932. The segmentation results are accurate.

Example 2 (good)
Accuracy: 0.8920. The segmentation results are accurate.

Example 3 (good)
Accuracy: 0.8938. The segmentation results are accurate.

Example 4 (bad)
Accuracy: 0.8789. The qualitative results show that the in the prediction, the back of the chair (cyan) gets spread into the bottom. This may be because this sample has a lot more classes than the previous three, so it was more complex to predict.

Example 5 (bad)
Accuracy: 0.8821. In this sample, we can see that the red class bleeds through to the bottom of the chair, which causes the poor performance. This may be caused by the fact that there are a lot more classes, similar to the previous example.

I rotated the input point clouds by 15, 30, 60, and 90 degrees counterclockwise along the z-axis inside eval_cls.py and ran the evaluation. I report quantitative and qualitative metrics below.
Quantitative Metrics:
original test accuraacy: 0.9759
15-degrees test accuracy: 0.9192
30-degrees test accuracy: 0.6631
60-degrees test accuracy: 0.3431
90-degrees test accuracy: 0.2339
Qualitative Metrics:
15-degree visualization. The GT label was chair, and predicted label was chair.
30-degree visualization. The GT label was chair, and the predicted label was chair.
60-degree visualization. The GT label was chair, and the predicted label was vase.
90-degree visualization. The GT label was chair, and the predicted label was vase.
Qualitative Metrics:
15-degree visualization. The GT label was lamp, and predicted label was lamp.
30-degree visualization. The GT label was lamp, and the predicted label was lamp.
60-degree visualization. The GT label was lamp, and the predicted label was vase.
90-degree visualization. The GT label was lamp, and the predicted label was vase.
Based on these results, it is clear that the initially trained model is not robust to rotation. As the magnitude of the rotation increases, the overall test accuracy decreases. We can see that with slight rotations, the model is able to accurately classify the first example in the test set (chair), but with 60,90 degree rotations, the model makes an incorrect classification.
From the training data, the model learned orientation-specific features. We can mitigate the performance regression by applying data augmentation. By randomly adding rotations to the training point cloud data, we can train the model to be more robust to rotations. We can also try to inject rotation-invariant features that tell us relative distances/angles between points. If the model learns that points are aligned in a certain way (with respect to each other), it may make more accurate predictions on rotated inputs.
The procedure for rotating input point clouds on the segmentation task is the same as described above. We rotate the input point clouds by 15, 30, 60, and 90 degrees counterclockwise along the z-axis inside eval_seg.py. I report quantitative and qualitative metrics below.
Quantitative and Qualitative Metrics:
Original test accuracy: 0.8932
Original GT vs Pred:

15-deg test accuracy: 0.8293
15-deg GT vs Pred:

30-deg test accuracy: 0.7245
30-deg GT vs Pred:

60-deg test accuracy: 0.5035
60-deg GT vs Pred:

90-deg test accuracy: 0.3615
90-deg GT vs Pred:

Quantitative and Qualitative Metrics:
Original test accuracy: 0.8925
Original GT vs Pred:

15-deg test accuracy: 0.8309
15-deg GT vs Pred:

30-deg test accuracy: 0.7253
30-deg GT vs Pred:

60-deg test accuracy: 0.5053
60-deg GT vs Pred:

90-deg test accuracy: 0.3623
90-deg GT vs Pred:

Based on these results, we can see that if we increase the magnitude of rotation, the test accuracy decreases for each sample. Additionally, for the 60-deg and 90-deg example, it is clear that our model begins to predict several different incorrect classes along the same part of the chair. This degradation occurs because the model relies on orientation-dependent spatial features, and rotating the point cloud disrupts the geometric relationships learned during training. Since rotation invariance is not inherently built into the network architecture, the model struggles to generalize to unseen orientations. To improve robustness, future work could incorporate random 3D rotation augmentation during training, and also applying alignment modules like T-Nets to make the model more robust to rotations.
Due to GPU constraints, the maximum number of points I can use for classification is 1000. Therefore, I test the model on 1000, 750, 500, 250, 50 points. I report quantitative and qualitative metrics below. I run these experiments on two different samples from the test set.
Quantitative Metrics:
original (1000 points) test accuraacy: 0.9759
750 points test accuracy: 0.9706
500 points test accuracy: 0.9653
250 points test accuracy: 0.9664
50 points test accuracy: 0.9517
Qualitative Metrics:
For all the visualizations below, the GT label is a chair, and the predicted label is a chair.
1000 points:
750 points:
500 points:
250 points:
50 points:
Qualitative Metrics:
For all the visualizations below, the GT label is a lamp, and the predicted label is a lamp.
1000 points:
750 points:
500 points:
250 points:
50 points:
Based on the quantitative and qualitative metrics, we can see that the test accuracy does not decrease drastically as we decrease the number of input points. It only decreases slightly. This suggests that the model has learned to capture the global geometric structure of objects effectively, rather than relying heavily on dense point sampling. In other words, PointNet’s use of shared MLPs and global max pooling allows it to extract global features when without many points. This means that it performs well even with sparse inputs.
Due to GPU constraints, the maximum number of points I can use for segmentation is 900. Therefore, I test the model on 900, 750, 500, 250, 50 points. I report quantitative and qualitative metrics below. I run these experiments on two different samples from the test set.
Quantitative and Qualitative Metrics:
Original (900 points) test accuracy: 0.8922
Original GT vs Pred:

750 points test accuracy: 0.8900
750 points GT vs Pred:

500 points test accuracy: 0.8809
500 points GT vs Pred:

250 points test accuracy: 0.8605
250 points GT vs Pred:

50 points test accuracy: 0.7873
50 points GT vs Pred:

Quantitative and Qualitative Metrics:
Original (900 points) test accuracy: 0.8932
Original GT vs Pred:

750 points test accuracy: 0.8905
750 points GT vs Pred:

500 points test accuracy: 0.8825
500 points GT vs Pred:

250 points test accuracy: 0.8605
250 points GT vs Pred:

50 points test accuracy: 0.7933
50 points GT vs Pred:

Similarly, the segmentation model also performs well even when the number of input points is reduced. Since PointNet aggregates features globally through max pooling, it can maintain meaningful per-point feature representations even with fewer samples. From the visualizations and quantitative metrics, we can see that the classification task is more robust than the segmentation task. However, we can still say that the segmentation model is more robust to sparse inputs than rotated inputs.