Andrew ID: abhinavm
I used AWS and spun an ec2 instance with the AMI (Deep Learning AMI GPU PyTorch 2.1.0 (Ubuntu 20.04) 20231103) with a g4dn.xlarge instance. Few set up tips if you’re using something similar, since the CUDA version of this image is 12.1, there will be some issues with cuda+torch+pytorch compatibility, so install torch, torchvision 2.5.1 (compatible with cuda 12.1) and VERY IMPORTANTLY use this following command to install pytorch3d:
Commands: python train.py --task 'cls'
python eval_cls.py
After training my model for 250 epochs, the best model had a test accuracy of 0.981
Successful predictions:
| Class | Chairs | Vases | Lamps |
|---|---|---|---|
| Point Cloud | ![]() |
![]() |
![]() |
Unsuccessful predicitions:
| Class | Vases | Lamps |
|---|---|---|
| Prediction | Lamps | Vases |
| Point Cloud | ![]() |
![]() |
Inference: My model had 100% recall on the chairs! the accuracy is very high of this model cause its a relatively simple task. there are some sample in test that are however, very hard for even humans to tell, like for example the lamp point cloud that resembles a vase. Also, the training data had a lot of upside down lamps, and i feel this is why for the incorrectly classifed vase point cloud, the model has predicted a lamp because a lot of samples in training have a bulb cover like shown below:

Commands:
python train.py --task 'seg'
python eval_seg.py
After training my model for 200 epochs, the best model had a test accuracy of 0.903
Good predictions (accuracy > 0.9):
| Accuracy | 0.935 | 0.991 | 0.992 |
|---|---|---|---|
| Predicted | ![]() |
![]() |
![]() |
| Ground Truth | ![]() |
![]() |
![]() |
Bad predictions (accuracy < 0.60):
| Accuracy | 0.513 | 0.567 | 0.537 |
|---|---|---|---|
| Predicted | ![]() |
![]() |
![]() |
| Ground Truth | ![]() |
![]() |
![]() |
Inference: The model does a decent job predicting most of the chair parts, but struggles with the exact boundaries between sections. This works better for chairs where the parts (seat, back, arms, legs, etc.) are clearly distinct from each other - you can see this in the successful predictions. But for shapes like couches where the boundaries are more ambiguous, the model has a harder time. To be fair, even humans might disagree on where exactly the “seat” ends and the “back” begins on some of these.
I conducted two experiments to analyze the robustness of my classification model: (1) rotation of input point clouds and (2) varying the number of input points.
Procedure: I rotated all test point clouds around the X-axis (tilting forward/backward) by angles ranging from 0 to 135 in 5 steps. The model was trained on unrotated data, so this tests how well it generalizes to unseen orientations.
Commands:
python eval_cls.py --rotate --output_dir output_robustness
Results:
| Rotation Angle | 0 | 33.75 | 67.5 | 101.25 | 135 |
|---|---|---|---|---|---|
| Accuracy | 0.981 | 0.775 | 0.302 | 0.298 | 0.66 |
Visualization with Predictions:
Sample 1: Chair
| Angle | 0 | 33.75 | 67.5 | 101.25 | 135 |
|---|---|---|---|---|---|
| Point Cloud | ![]() |
![]() |
![]() |
![]() |
![]() |
| Prediction | Chair | Lamp | Vase | Chair | Chair |
Sample 2: Chair
| Angle | 0 | 33.75 | 67.5 | 101.25 | 135 |
|---|---|---|---|---|---|
| Point Cloud | ![]() |
![]() |
![]() |
![]() |
![]() |
| Prediction | Chair | Chair | Chair | Vase | Chair |
Sample 3: Vase
| Angle | 0 | 33.75 | 67.5 | 101.25 | 135 |
|---|---|---|---|---|---|
| Point Cloud | ![]() |
![]() |
![]() |
![]() |
![]() |
| Prediction | Vase | Vase | Vase | Vase | Chair |
Sample 4: Lamp
| Angle | 0 | 33.75 | 67.5 | 101.25 | 135 |
|---|---|---|---|---|---|
| Point Cloud | ![]() |
![]() |
![]() |
![]() |
![]() |
| Prediction | Vase | Vase | Vase | Lamp | Lamp |
Sample 5: Chair
| Angle | 0 | 33.75 | 67.5 | 101.25 | 135 |
|---|---|---|---|---|---|
| Point Cloud | ![]() |
![]() |
![]() |
![]() |
![]() |
| Prediction | Chair | Chair | Lamp | Vase | Chair |
Inference: The model’s accuracy degrades significantly as rotation increases, dropping from 98% to around 30% at intermediate angles. PointNet’s global max pooling provides some rotation invariance, but the model has learned orientation-specific features from the training distribution. X-axis rotation is particularly challenging because it disrupts the vertical structure the model relies on - chairs tilted forward start resembling lamps or vases from certain angles.
Procedure: I evaluated the model with varying numbers of points per object: 10000 (full), 5000, 500, 100, 50, and 25. This tests how well the model handles sparse point clouds.
Commands:
python eval_cls.py --sample_points --output_dir output_robustness
Results:
| Num Points | 10000 | 5000 | 500 | 100 | 50 | 25 |
|---|---|---|---|---|---|---|
| Accuracy | 0.9811 | 0.9801 | 0.9685 | 0.9286 | 0.7901 | 0.4858 |
Visualization with Predictions:
Sample 1: Chair
| Points | 10000 | 5000 | 500 | 100 | 50 | 25 |
|---|---|---|---|---|---|---|
| Point Cloud | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Prediction | chair | chair | chair | lamp | lamp | lamp |
Sample 2: Chair
| Points | 10000 | 5000 | 500 | 100 | 50 | 25 |
|---|---|---|---|---|---|---|
| Point Cloud | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Prediction | chair | chair | chair | chair | chair | lamp |
Sample 3: Vase
| Points | 10000 | 5000 | 500 | 100 | 50 | 25 |
|---|---|---|---|---|---|---|
| Point Cloud | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Prediction | vase | vase | vase | vase | vase | lamp |
Sample 4: Lamp
| Points | 10000 | 5000 | 500 | 100 | 50 | 25 |
|---|---|---|---|---|---|---|
| Point Cloud | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Prediction | vase | vase | vase | vase | vase | lamp |
Sample 5: Chair
| Points | 10000 | 5000 | 500 | 100 | 50 | 25 |
|---|---|---|---|---|---|---|
| Point Cloud | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Prediction | chair | chair | chair | chair | chair | lamp |
Inference: The model shows strong robustness to point sparsity, maintaining 97% accuracy even at 500 points and 93% at 100 points. Performance only degrades significantly below 50 points (79% → 49% at 25 points). This is because PointNet’s global max pooling aggregates the most salient features regardless of point count. Interestingly, at very low counts everything tends to get classified as “lamp” - likely because sparse point clouds lose distinctive structural features and default to the simplest shape.
I also ran the same robustness experiments on my segmentation model.
Commands:
python eval_seg.py --rotate --output_dir output_robustness_seg
Results:
| Rotation Angle | 0 | 33.75 | 67.5 | 101.25 | 135 |
|---|---|---|---|---|---|
| Accuracy | 0.9031 | 0.7072 | 0.4241 | 0.2510 | 0.2197 |
Visualization with Per-Sample Accuracy:
Sample 1 (idx: 49):
| Angle | 0 | 33.75 | 67.5 | 101.25 | 135 |
|---|---|---|---|---|---|
| Ground Truth | ![]() |
![]() |
![]() |
![]() |
![]() |
| Prediction | ![]() |
![]() |
![]() |
![]() |
![]() |
| Accuracy | 0.865 | 0.680 | 0.545 | 0.270 | 0.297 |
Sample 2 (idx: 581):
| Angle | 0 | 33.75 | 67.5 | 101.25 | 135 |
|---|---|---|---|---|---|
| Ground Truth | ![]() |
![]() |
![]() |
![]() |
![]() |
| Prediction | ![]() |
![]() |
![]() |
![]() |
![]() |
| Accuracy | 0.962 | 0.463 | 0.536 | 0.309 | 0.074 |
Sample 3 (idx: 82):
| Angle | 0 | 33.75 | 67.5 | 101.25 | 135 |
|---|---|---|---|---|---|
| Ground Truth | ![]() |
![]() |
![]() |
![]() |
![]() |
| Prediction | ![]() |
![]() |
![]() |
![]() |
![]() |
| Accuracy | 0.870 | 0.782 | 0.546 | 0.253 | 0.155 |
Sample 4 (idx: 304):
| Angle | 0 | 33.75 | 67.5 | 101.25 | 135 |
|---|---|---|---|---|---|
| Ground Truth | ![]() |
![]() |
![]() |
![]() |
![]() |
| Prediction | ![]() |
![]() |
![]() |
![]() |
![]() |
| Accuracy | 0.926 | 0.751 | 0.388 | 0.265 | 0.172 |
Sample 5 (idx: 109):
| Angle | 0 | 33.75 | 67.5 | 101.25 | 135 |
|---|---|---|---|---|---|
| Ground Truth | ![]() |
![]() |
![]() |
![]() |
![]() |
| Prediction | ![]() |
![]() |
![]() |
![]() |
![]() |
| Accuracy | 0.829 | 0.704 | 0.527 | 0.447 | 0.199 |
Inference: The segmentation model is far more sensitive to rotation than the classification model, with accuracy plummeting from 90% to just 22% at 135. This is expected since segmentation requires precise per-point predictions that depend heavily on learned spatial patterns. The model relies on vertical cues like “legs are at the bottom” and “back is at the top” to assign part labels, and these assumptions completely break down when the chair is tilted.
Commands:
python eval_seg.py --sample_points --output_dir output_robustness_seg
Results:
| Num Points | 10000 | 5000 | 500 | 100 | 50 |
|---|---|---|---|---|---|
| Accuracy | 0.9031 | 0.9030 | 0.8865 | 0.8195 | 0.7704 |
Visualization with Per-Sample Accuracy:
Sample 1 (idx: 49):
| Points | 10000 | 5000 | 500 | 100 | 50 |
|---|---|---|---|---|---|
| Ground Truth | ![]() |
![]() |
![]() |
![]() |
![]() |
| Prediction | ![]() |
![]() |
![]() |
![]() |
![]() |
| Accuracy | 0.865 | 0.869 | 0.880 | 0.770 | 0.740 |
Sample 2 (idx: 581):
| Points | 10000 | 5000 | 500 | 100 | 50 |
|---|---|---|---|---|---|
| Ground Truth | ![]() |
![]() |
![]() |
![]() |
![]() |
| Prediction | ![]() |
![]() |
![]() |
![]() |
![]() |
| Accuracy | 0.962 | 0.967 | 0.968 | 0.880 | 0.940 |
Sample 3 (idx: 82):
| Points | 10000 | 5000 | 500 | 100 | 50 |
|---|---|---|---|---|---|
| Ground Truth | ![]() |
![]() |
![]() |
![]() |
![]() |
| Prediction | ![]() |
![]() |
![]() |
![]() |
![]() |
| Accuracy | 0.870 | 0.865 | 0.832 | 0.530 | 0.480 |
Sample 4 (idx: 304):
| Points | 10000 | 5000 | 500 | 100 | 50 |
|---|---|---|---|---|---|
| Ground Truth | ![]() |
![]() |
![]() |
![]() |
![]() |
| Prediction | ![]() |
![]() |
![]() |
![]() |
![]() |
| Accuracy | 0.926 | 0.928 | 0.938 | 0.890 | 0.800 |
Sample 5 (idx: 109):
| Points | 10000 | 5000 | 500 | 100 | 50 |
|---|---|---|---|---|---|
| Ground Truth | ![]() |
![]() |
![]() |
![]() |
![]() |
| Prediction | ![]() |
![]() |
![]() |
![]() |
![]() |
| Accuracy | 0.829 | 0.826 | 0.808 | 0.610 | 0.560 |
Inference: Unlike rotation, the segmentation model handles point sparsity gracefully - accuracy only drops from 90% to 77% at 50 points. The model has learned geometric features that remain recognizable even with sparse sampling. Per-sample variance is notable: simple chairs like sample 581 maintain 94% accuracy at 50 points, while complex structures like sample 82 drop to 48%. This suggests the model’s robustness depends on whether the sparse points happen to capture the key part boundaries.
For Classification task:
Commands:
python train.py --task 'cls' --checkpoint_dir 'checkpoints_local_new' --locality --batch_size 32 --num_epochs 200
python eval_cls.py --output_dir 'output_local' --checkpoint_dir 'checkpoints_local_new/cls' --locality
Description:
Accuracy of best model: 0.976 (w/ locality) vs 0.981 (w/o locality)!! The performance actually slightly drops by using locality! Note: The model without locality was much bigger than the model with locality.
Visualization:
| Class | Chair | Lamp | Vase |
|---|---|---|---|
| W/O Locality Prediction | Chair | Chair | Lamp |
| W Locality Prediction | Lamp | Chair | Vase |
| Point Cloud | ![]() |
![]() |
![]() |
Inference:
I have added an example of where both the models are wrong (in classifying the lamp example), an example where w/o locality the model got it right (chair example) and an example where the model with locality got it right but w/o locality got it wrong (vase example). For the instances when both the model got the classification wrong, its just ebcause of the samples being too hard, to an extent wehre even humans would be unavailable to tell the differences. For the cahor example, its so flat that the model assumes its not a chair. Also the model without locality is much bigfer because at each stage im using all the points and their independent features until i pool and apply a MLP on it, wheras im pruning downsampling the number of points heirarchically in the w/ locality model, which measn that the model without with locality has a beter chance to overfit, and thats what is happening! Also, I feel the model with locality is able ot generalize better on fine structures, like the vase example where there the individual connected components can easily convey its a vase and not a lamp, unlike the model without locality which looks at the structure at the bottom and concludes its a lamp not a vase (this is an intuitive explanation, i have not visualised the gradcamm outputs nor the activations)
For Segmentation:
Comamnds:
python train.py --task 'seg' --checkpoint_dir 'checkpoints_local_new' --locality --batch_size 32 --num_epochs 200
python eval_seg.py --output_dir 'output_local' --checkpoint_dir 'checkpoints_local_new/seg' --locality
Description:
Accuracy of best model: 0.914 (w/ locality) vs 0.901 (w/o locality). The performance increases with locality, as expected, in comparison to classification as we segmentation is a harder task and it requires informations around where surfaces change and with locality its more informed. Note: The model without locality was much bigger than the model with locality.
Visualization:
| Ground Truth | ![]() |
![]() |
![]() |
![]() |
![]() |
|---|---|---|---|---|---|
| W/O Locality Accuracy | 0.510 | 0.622 | 0.561 | 0.625 | 0.991 |
| W/O Locality Prediction | ![]() |
![]() |
![]() |
![]() |
![]() |
| W Locality Accuracy | 0.743 | 0.681 | 0.606 | 0.455 | 0.989 |
| W Locality Prediction | ![]() |
![]() |
![]() |
![]() |
![]() |
Inference:
Keeping in mind the model with locality is much smaller than the model without it, its getting better performance in general and most samples have a bit of boost in accuracy as the model is able to segment points better at transition regions, and there is significant boosts in certain examples as in example 1 (from the left). There are however certain examples where the model does not perform as good as the model with locality in complex examples (2nd from the right). The important thing seems to be that the model’s improved predictions outweigh the decrease in performance in other predictions. however, in general, this lightweight model performs better than the default model.