Assignment 5: Point Cloud Classification and Segmentation

Andrew ID: abhinavm

Setup

I used AWS and spun an ec2 instance with the AMI (Deep Learning AMI GPU PyTorch 2.1.0 (Ubuntu 20.04) 20231103) with a g4dn.xlarge instance. Few set up tips if you’re using something similar, since the CUDA version of this image is 12.1, there will be some issues with cuda+torch+pytorch compatibility, so install torch, torchvision 2.5.1 (compatible with cuda 12.1) and VERY IMPORTANTLY use this following command to install pytorch3d:

Part 1: Classification Model

Commands: python train.py --task 'cls'

python eval_cls.py

After training my model for 250 epochs, the best model had a test accuracy of 0.981

Successful predictions:

Class	Chairs	Vases	Lamps
Point Cloud

Unsuccessful predicitions:

Class	Vases	Lamps
Prediction	Lamps	Vases
Point Cloud

Inference: My model had 100% recall on the chairs! the accuracy is very high of this model cause its a relatively simple task. there are some sample in test that are however, very hard for even humans to tell, like for example the lamp point cloud that resembles a vase. Also, the training data had a lot of upside down lamps, and i feel this is why for the incorrectly classifed vase point cloud, the model has predicted a lamp because a lot of samples in training have a bulb cover like shown below:

Part 2: Segmentation Model

Commands:

python train.py --task 'seg'

python eval_seg.py

After training my model for 200 epochs, the best model had a test accuracy of 0.903

Good predictions (accuracy > 0.9):

Accuracy	0.935	0.991	0.992
Predicted
Ground Truth

Bad predictions (accuracy < 0.60):

Accuracy	0.513	0.567	0.537
Predicted
Ground Truth

Inference: The model does a decent job predicting most of the chair parts, but struggles with the exact boundaries between sections. This works better for chairs where the parts (seat, back, arms, legs, etc.) are clearly distinct from each other - you can see this in the successful predictions. But for shapes like couches where the boundaries are more ambiguous, the model has a harder time. To be fair, even humans might disagree on where exactly the “seat” ends and the “back” begins on some of these.

Part 3: Robustness Analysis

For Classification:

I conducted two experiments to analyze the robustness of my classification model: (1) rotation of input point clouds and (2) varying the number of input points.

Experiment 1: Rotation Robustness

Procedure: I rotated all test point clouds around the X-axis (tilting forward/backward) by angles ranging from 0 to 135 in 5 steps. The model was trained on unrotated data, so this tests how well it generalizes to unseen orientations.

Commands: python eval_cls.py --rotate --output_dir output_robustness

Results:

Rotation Angle	0	33.75	67.5	101.25	135
Accuracy	0.981	0.775	0.302	0.298	0.66

Visualization with Predictions:

Sample 1: Chair

Angle	0	33.75	67.5	101.25	135
Point Cloud
Prediction	Chair	Lamp	Vase	Chair	Chair

Sample 2: Chair

Angle	0	33.75	67.5	101.25	135
Point Cloud
Prediction	Chair	Chair	Chair	Vase	Chair

Sample 3: Vase

Angle	0	33.75	67.5	101.25	135
Point Cloud
Prediction	Vase	Vase	Vase	Vase	Chair

Sample 4: Lamp

Angle	0	33.75	67.5	101.25	135
Point Cloud
Prediction	Vase	Vase	Vase	Lamp	Lamp

Sample 5: Chair

Angle	0	33.75	67.5	101.25	135
Point Cloud
Prediction	Chair	Chair	Lamp	Vase	Chair

Inference: The model’s accuracy degrades significantly as rotation increases, dropping from 98% to around 30% at intermediate angles. PointNet’s global max pooling provides some rotation invariance, but the model has learned orientation-specific features from the training distribution. X-axis rotation is particularly challenging because it disrupts the vertical structure the model relies on - chairs tilted forward start resembling lamps or vases from certain angles.

Experiment 2: Point Sampling Robustness

Procedure: I evaluated the model with varying numbers of points per object: 10000 (full), 5000, 500, 100, 50, and 25. This tests how well the model handles sparse point clouds.

Commands: python eval_cls.py --sample_points --output_dir output_robustness

Results:

Num Points	10000	5000	500	100	50	25
Accuracy	0.9811	0.9801	0.9685	0.9286	0.7901	0.4858

Visualization with Predictions:

Sample 1: Chair

Points	10000	5000	500	100	50	25
Point Cloud
Prediction	chair	chair	chair	lamp	lamp	lamp

Sample 2: Chair

Points	10000	5000	500	100	50	25
Point Cloud
Prediction	chair	chair	chair	chair	chair	lamp

Sample 3: Vase

Points	10000	5000	500	100	50	25
Point Cloud
Prediction	vase	vase	vase	vase	vase	lamp

Sample 4: Lamp

Points	10000	5000	500	100	50	25
Point Cloud
Prediction	vase	vase	vase	vase	vase	lamp

Sample 5: Chair

Points	10000	5000	500	100	50	25
Point Cloud
Prediction	chair	chair	chair	chair	chair	lamp

Inference: The model shows strong robustness to point sparsity, maintaining 97% accuracy even at 500 points and 93% at 100 points. Performance only degrades significantly below 50 points (79% → 49% at 25 points). This is because PointNet’s global max pooling aggregates the most salient features regardless of point count. Interestingly, at very low counts everything tends to get classified as “lamp” - likely because sparse point clouds lose distinctive structural features and default to the simplest shape.

For Segmentation:

I also ran the same robustness experiments on my segmentation model.

Rotation Robustness (Segmentation)

Commands: python eval_seg.py --rotate --output_dir output_robustness_seg

Results:

Rotation Angle	0	33.75	67.5	101.25	135
Accuracy	0.9031	0.7072	0.4241	0.2510	0.2197

Visualization with Per-Sample Accuracy:

Sample 1 (idx: 49):

Angle	0	33.75	67.5	101.25	135
Ground Truth
Prediction
Accuracy	0.865	0.680	0.545	0.270	0.297

Sample 2 (idx: 581):

Angle	0	33.75	67.5	101.25	135
Ground Truth
Prediction
Accuracy	0.962	0.463	0.536	0.309	0.074

Sample 3 (idx: 82):

Angle	0	33.75	67.5	101.25	135
Ground Truth
Prediction
Accuracy	0.870	0.782	0.546	0.253	0.155

Sample 4 (idx: 304):

Angle	0	33.75	67.5	101.25	135
Ground Truth
Prediction
Accuracy	0.926	0.751	0.388	0.265	0.172

Sample 5 (idx: 109):

Angle	0	33.75	67.5	101.25	135
Ground Truth
Prediction
Accuracy	0.829	0.704	0.527	0.447	0.199

Inference: The segmentation model is far more sensitive to rotation than the classification model, with accuracy plummeting from 90% to just 22% at 135. This is expected since segmentation requires precise per-point predictions that depend heavily on learned spatial patterns. The model relies on vertical cues like “legs are at the bottom” and “back is at the top” to assign part labels, and these assumptions completely break down when the chair is tilted.

Point Sampling Robustness (Segmentation)

Commands: python eval_seg.py --sample_points --output_dir output_robustness_seg

Results:

Num Points	10000	5000	500	100	50
Accuracy	0.9031	0.9030	0.8865	0.8195	0.7704

Visualization with Per-Sample Accuracy:

Sample 1 (idx: 49):

Points	10000	5000	500	100	50
Ground Truth
Prediction
Accuracy	0.865	0.869	0.880	0.770	0.740

Sample 2 (idx: 581):

Points	10000	5000	500	100	50
Ground Truth
Prediction
Accuracy	0.962	0.967	0.968	0.880	0.940

Sample 3 (idx: 82):

Points	10000	5000	500	100	50
Ground Truth
Prediction
Accuracy	0.870	0.865	0.832	0.530	0.480

Sample 4 (idx: 304):

Points	10000	5000	500	100	50
Ground Truth
Prediction
Accuracy	0.926	0.928	0.938	0.890	0.800

Sample 5 (idx: 109):

Points	10000	5000	500	100	50
Ground Truth
Prediction
Accuracy	0.829	0.826	0.808	0.610	0.560

Inference: Unlike rotation, the segmentation model handles point sparsity gracefully - accuracy only drops from 90% to 77% at 50 points. The model has learned geometric features that remain recognizable even with sparse sampling. Per-sample variance is notable: simple chairs like sample 581 maintain 94% accuracy at 50 points, while complex structures like sample 82 drop to 48%. This suggests the model’s robustness depends on whether the sparse points happen to capture the key part boundaries.

Part 4: Bonus Question - Locality

For Classification task:

Commands:

python train.py --task 'cls' --checkpoint_dir 'checkpoints_local_new' --locality --batch_size 32 --num_epochs 200

python eval_cls.py --output_dir 'output_local' --checkpoint_dir 'checkpoints_local_new/cls' --locality

Description:

Accuracy of best model: 0.976 (w/ locality) vs 0.981 (w/o locality)!! The performance actually slightly drops by using locality! Note: The model without locality was much bigger than the model with locality.

Visualization:

Class	Chair	Lamp	Vase
W/O Locality Prediction	Chair	Chair	Lamp
W Locality Prediction	Lamp	Chair	Vase
Point Cloud

Inference:

I have added an example of where both the models are wrong (in classifying the lamp example), an example where w/o locality the model got it right (chair example) and an example where the model with locality got it right but w/o locality got it wrong (vase example). For the instances when both the model got the classification wrong, its just ebcause of the samples being too hard, to an extent wehre even humans would be unavailable to tell the differences. For the cahor example, its so flat that the model assumes its not a chair. Also the model without locality is much bigfer because at each stage im using all the points and their independent features until i pool and apply a MLP on it, wheras im pruning downsampling the number of points heirarchically in the w/ locality model, which measn that the model without with locality has a beter chance to overfit, and thats what is happening! Also, I feel the model with locality is able ot generalize better on fine structures, like the vase example where there the individual connected components can easily convey its a vase and not a lamp, unlike the model without locality which looks at the structure at the bottom and concludes its a lamp not a vase (this is an intuitive explanation, i have not visualised the gradcamm outputs nor the activations)

For Segmentation:

Comamnds:

python train.py --task 'seg' --checkpoint_dir 'checkpoints_local_new' --locality --batch_size 32 --num_epochs 200

python eval_seg.py --output_dir 'output_local' --checkpoint_dir 'checkpoints_local_new/seg' --locality

Description:

Accuracy of best model: 0.914 (w/ locality) vs 0.901 (w/o locality). The performance increases with locality, as expected, in comparison to classification as we segmentation is a harder task and it requires informations around where surfaces change and with locality its more informed. Note: The model without locality was much bigger than the model with locality.

Visualization:

Ground Truth
W/O Locality Accuracy	0.510	0.622	0.561	0.625	0.991
W/O Locality Prediction
W Locality Accuracy	0.743	0.681	0.606	0.455	0.989
W Locality Prediction

Inference:

Keeping in mind the model with locality is much smaller than the model without it, its getting better performance in general and most samples have a bit of boost in accuracy as the model is able to segment points better at transition regions, and there is significant boosts in certain examples as in example 1 (from the left). There are however certain examples where the model does not perform as good as the model with locality in complex examples (2nd from the right). The important thing seems to be that the model’s improved predictions outweigh the decrease in performance in other predictions. however, in general, this lightweight model performs better than the default model.