Point Cloud Processing¶
Part 1 - Classification Model¶
In this part, I implemented a classification model for point clouds that predicts for each object (with N points), which one of the following classes it belongs to: [chair, vase, lamp]
The network was a simplified PointNet-style architecture: It first processes raw points using shared MLP Layers implemented as 1D Convolutions (with BatchNorms and ReLUs). Then, a global max pooling is done to extract an order-invariant global feature vector, and this 1024-dimensional vectory is passed through a 3-layer fully connected classification head (with BatchNorm, ReLU and dropout) to predict the final object class.
The results are as follows:
Test Accuracy: 0.9790
Visualized Point Clouds and Predictions Per Class:
- For each class, we visualize 2 correct predictions and 2 incorrect (noted with which class was predicted instead)
Class 0 (Chair)
| Correct #1 | Correct #2 | Incorrect #1 (Predicted Lamp) | Incorrect #2 (Predicted Lamp) |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
Class 1 (Vase)
| Correct #1 | Correct #2 | Incorrect #1 (Predicted Lamp) | Incorrect #2 (Predicted Lamp) |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
Class 2 (Lamp)
| Correct #1 | Correct #2 | Incorrect #1 (Predicted Vase) | Incorrect #2 (Predicted Vase) |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
The model achieves 97.9% accuracy, indicating that it performs well, especially when objects follow the typical geometry of their class. However, misclassifications happen in interesting edge cases where the shape of one object begins to resemble another class. We see a few examples above which highlight why this may happen:
Class 0 (Chair): The second incorrect example has an unusually tall, thin backrest. This vertical structure closely resembles many floor-lamp shapes (e.g., Lamp Correct #2) which likely led to confusion
Class 1 (Vase): The second incorrect vase example has a wide base and a bent upper stem that visually resembles a desk lamp, making the lamp prediction understandable
Both vases and lamps represent visually diverse categories with no single canonical shape. E.g. lamps may be chandeliers, desk lamps, floor lamps, etc., and vases range from tall cylinders to wide bottom decorative bowl like shapes. Because of this shape variability, their geometries often overlap leading to symmetric misclassification (vases predicted as lamps and lamps predicted as vases).
The model aggregates local features into a single global feature vector by max pooling, which can cause fine structure (e.g., chandelier arms, thin lamp stems, small vase openings) to be lost. This makes it particularly difficult to correctly classify objects with intricate or delicate geometries, such as the chandelier-like lamp in Incorrect #1 for Class 2.
Part 2 - Segmentation Model¶
In this part, I implemented a segmentation model for point clouds that outputs a per-point prediction (as opposed to the 1 global prediction in the classification model), allowing us to segment the points into 6 classes (Representing 6 different parts of the object).
The segmentation network also resembles a PointNet-style architecture:
- First, it extracts per-point local features through 3 layers of 1D Convolutions (with Batch Norm and ReLU). It produces a 256-dim feature vecotr for each point
- Then, a global feature is computed by passing these local features through deeper shared MLP layers (still implemented as Conv1d) at the end of which, I used a max-pool across points. This produces a 1024-dim global vector (to capture overall object geometry)
- I concatenate this global feature with every point's local features (For a 1280 dim vector that has both local and global context). These combined features are fed through a final segmentation head (Conv1d, BatchNorm, ReLU) and finally output a per-point class probability over the segmentation labels
The results are as follows:
Test Accuracy: 0.9029
Visualized Ground Truth Point Clouds and Predictions:
- Below, I visualized the 3 point clouds with the highest segmentation accuracy and the 3 with the lowest:
Best Predictions
Accuracy: 0.9942
| Ground Truth | Pred |
![]() |
![]() |
Accuracy: 0.9945
| Ground Truth | Pred |
![]() |
![]() |
Accuracy: 0.9974
| Ground Truth | Pred |
![]() |
![]() |
Worst Predictions
Accuracy: 0.4462
| Ground Truth | Pred |
![]() |
![]() |
Accuracy: 0.5124
| Ground Truth | Pred |
![]() |
![]() |
Accuracy: 0.5292
| Ground Truth | Pred |
![]() |
![]() |
The model achieves 90.29% accuracy, indicating strong overall performance, and the best/worst examples visualized above help reveal patterns in when it succeeds or struggles.
The three best predictions correspond to chair objects that closely follow a canonical chair geometry (two or four legs, a flat seat and a slab-style back). Since all objects have 10,000 points, these shapes have more densely packed point clouds due to their thinner structures, making segmentation easier because there will be many neighboring points share the same correct label. In addition, these objects only contain three of the six possible segmentation classes, which simplifies the task for the model
In contrast, the lowest accuracy cases involve chairs with more atypical geometries. For example, the second worst example has a cylindrical style body which blurs boundaries between the seat, the sides, and the back. Across the worst cases, the model tends to apply the same labeling pattern it learned from typical chairs: assigning lower height points to the dark-blue class, the seat to red, and the chair back to light blue. While this pattern works well for canonical shapes, it fails when the ground truth involves more segmentation classes (e.g. in the third example, it has an additional pink top region, which the model misses entirely).
Part 3 - Robustness Analysis¶
Experiment 1 - Octant Removal¶
Motivation: I wanted to understand if I take points out of a specific part of the object (rather than uniformly downsampling the points), how this would affect classification and segmentation results from the trained models. This is roughly simulating the idea of having a broken chair with a leg missing or a lamp with a torn lampshade, etc. My prediction was that it would not affect segmentation much (Since it relies on local features more heavily and should not be as affected by points being removed) and would affect classification more strongly, since that is more reliant on global features.
Procedure:
- At test time, I wrote a script that divides each point cloud into 8 octants and then removes any points that fall in 2 randomly chosen octants
- Since different objects may have a different distribution of points in the 8 octants, this could lead to different objects having a different number of remaining points (Which would cause issues with tensor batching)
- Thus, I also pad back to N (=10,000) points by randomly duplicating existing points (For segmentation, I also had to copy over the labels of these duplicated points).
These were the accuracy results and some visualizations:
| Base Accuracy | After Removing 2 Octants | Change in Accuracy | |
|---|---|---|---|
| Classification | 0.9790 | 0.9098 | 0.0692 |
| Segmentation | 0.9029 | 0.8837 | 0.0192 |
3 Correct Classifications
![]() |
![]() |
![]() |
3 Incorrect Classifications
| GT Chair, Pred Lamp | GT Chair, Pred Lamp | GT Lamp, Pred Vase |
![]() |
![]() |
![]() |
3 Highest Accuracy Segmentations (Ground Truth is 1st Row, Prediction is 2nd Row)
| Success #1 | Success #2 | Success #3 |
![]()
|
![]()
|
![]()
|
3 Lowest Accuracy Segmentations (Ground Truth is 1st Row, Prediction is 2nd Row)
| Fail #1 | Fail #2 | Fail #3 |
![]()
|
![]()
|
![]()
|
Interpretation
The segmentation model shows < 2% accuracy drop after the 2 octant removal, indicating strong robustness. This makes sense intuitively: segmentation is a local prediction task and the network relies more on neighborhood-level geometric structure rather than full object global shape. Removing 2 octants won't significantly disrupt the local patterns that the model likely uses (e.g. bottom is legs, middle is seat etc.), so predictions remain pretty stable
The best and worst segmentation results visualized above match the patterns observed in Part 2. The model continues to perform well on thin, canonical chair structures and struggles on the same bulkier, atypical geometries as before. This consistency indicates that the errors are tied to inherent shape ambiguity and not as much to the the missing octants
The classification model experiences a slightly more notable ~7% drop in accuracy. This is expected because the classifier depends on global object geometry and removing ~25% of the points will disrupt the global representation. The accuracy drop off is still moderate though showing that the model is not very brittle/overfitting too much and can still infer the correct class when the remaining geometry is somwhat information
The classification failure cases above follow the same trends seen in Part 1 --> vases and lamps that resemble each other or chairs with unusual or abstract shapes. Essentially, the mistakes are not caused by new failures introduced by removing octants and the challenging cases remain similar. The fact that many chairs are still classified correctly even with missing legs or partial seats demonstrates that the classifier generalizes reasonably well despite the removal
Experiment 2 - Different Number of Input Points¶
Procedure:
- At test time, we change the number of points of the input point clouds to be in
[10, 50, 100, 1000, 10000] - For each num_points value, run the trained model on the whole test dataset and get the accuracy (along with visualizing some successes and failure cases)
We get the following results:
| Num Points | Classification Accuracy | Segmentation Accuracy |
|---|---|---|
| 10 | 0.2602 | 0.7206 |
| 50 | 0.5687 | 0.8018 |
| 100 | 0.8772 | 0.8280 |
| 1000 | 0.9759 | 0.8979 |
| 10000 (Base Model) | 0.9790 | 0.9029 |
Below, I have visualized 3 correct classification predictions and 3 incorrect predictions for each of the point increments:
| Num Points | Fail #1 | Fail #2 | Fail #3 | Success #1 | Success #2 | Success #3 |
|---|---|---|---|---|---|---|
| 10 |
![]() GT Chair, Pred Lamp |
![]() GT Chair, Pred Lamp |
![]() GT Vase, Pred Lamp |
|
|
|
| 50 |
![]() GT Chair, Pred Lamp |
![]() GT Chair, Pred Lamp |
![]() GT Chair, Pred Lamp |
|
|
|
| 100 |
![]() GT Chair, Pred Lamp |
![]() GT Chair, Pred Lamp |
![]() GT Chair, Pred Lamp |
|
|
|
| 1000 |
![]() GT Lamp, Pred Vase |
![]() GT Lamp, Pred Vase |
![]() GT Lamp, Pred Vase |
|
|
|
| 10000 |
![]() GT Chair, Pred Lamp |
![]() GT Vase, Pred Lamp |
![]() GT Lamp, Pred Vase |
|
|
|
Below, I have visualized 2 of the most accurate segmentations and 2 of the worst for each point increment:
| Num Points | Failure #1 (GT, Pred) |
Failure #2 (GT, Pred) |
Success #1 (GT, Pred) |
Success #2 (GT, Pred) |
|---|---|---|---|---|
| 10 |
![]()
|
![]()
|
![]()
|
![]()
|
| 50 |
![]()
|
![]()
|
![]()
|
![]()
|
| 100 |
![]()
|
![]()
|
![]()
|
![]()
|
| 1000 |
![]()
|
![]()
|
![]()
|
![]()
|
| 10000 |
![]()
|
![]()
|
![]()
|
![]()
|
Interpretation
We see that both the classification and segmentation models are still fairly robust even at 1000 and 100 input points. At 100 input points, they have 87.8% and 82.8% accuracies respectively. This is likely because the points are downsampled uniformly so even with 100 points, we still get decent coverage of the geometry of the object, especially for cases when the chairs/vases/lamps have a thin frame
Below 100 points (dropping to 50 and 10), we see that the segmentation performance stays fairly robust (Still remaining above 70%) whereas classification drops to 26% for 10 points (which is worse than random guessing, given 3 classes). This is to be expected since segmentation relies a lot on local geometry so even with a sparse set of points, the model can roughly approximate which part of the object they would belong to based on their coordinates/features (i.e. Bottom points are legs). In the above visualization, we see that the chairs the model performs the best on (thin, canonical type chairs) are similar across all point levels and the chairs it performs the worst on (large, abstract chairs) do too
Segmentation models are making a global prediction and with very few points, it is hard to have a strong global structure understanding. The distinction between chair vs vase vs lamp becomes extremely difficult at such sparsity (Which we can verify since the N=10 row would be very difficult even for a human). Interestingly, we see that a lot of the times, the classification model guesses "class=2, lamp" when the ground truth is a chair. This may be because the model has a strong prior that abstract objects with strange geometries are often lamps, as compared to chairs which (at high point density) have a more predictable shape.
Part 4 - Incorporating Locality (Bonus)¶
In this part, I implemented a classification and segmentation model based on the PointNet++ architecture shown below:
PointNet++ Inspired Classification Architecture Explanation
SetAbstraction Layers --> These use furthest point sampling to pick representative centroids from the input point cloud; Then for each centroid, I gather the k-nearest points and their features which forms a local patch of points (Also applying normalization so point positions are relative to the centroid).
LocalPointNet Module --> This is a lightweight feature extractor (using Conv2d, BatchNorm and ReLU layers) that acts across each each centroid patch to extract neighborhood-level features. Max pooling is then used to product a single feature vector per centroid. This is used within the SetAbstraction Layers
Classification Model Overall --> A two-level hierarchical model using 2 SetAbstraction Layers. It will first sample N points to 1024 centroids and extract local features. These 1024 centroids are further reduced into 256 centroids (while fusing the previous features). This resembles the left side of the above image and allows for local to global hierarchical reasoning. Finally, I have a lightweight MLP head that will take the descriptors from the 2nd Set Abstraction Layer and use them for the classification task (Bottom right of the image)
PointNet++ Inspired Segmentation Architecture Explanation
Same SetAbstraction layers are used (2 layers) as the classification model, these build the hierarchical encoding of local neighborhoods
FeaturePropagation Layers --> Based this on Section 4.3 of the PointNet++ Paper. The idea is to upsample the features of the coarse centroids (SetAbstraction Layer 2) to the fine centroids (SetAbstraction Layer 1) using distance-weighted interpolation of the 3 nearest neighbors, and then do this again for the fine centroids to the original N points. I also use skip concatenation to feed back in the original features from the SetAbstraction Layer1 or the original XYZ Coordinates (Seen on the top part of the above diagram). This feature propagation basically helps propagate coarser features back to the per-point resolution
Segmentation Model Overall --> The model combines 2 Set Abstraction Layers, 2 Feature Propagation Layers, and then a final segmentation head (per-point MLP with 1D Convolutions) to get per-point predictions that use this local-global feature information
Both models were trained for 150 epochs.
Accuracy Results
| Model Type | Test Accuracy | # Parameters |
|---|---|---|
| Base Classification Model | 0.9790 | 801,539 |
| PointNet++ Style Classif. Model | 0.9664 | 81,795 |
| Model Type | Test Accuracy | # Parameters |
|---|---|---|
| Base Segmentation Model | 0.9029 | 1,525,126 |
| PointNet++ Style Segmentation Model. | 0.9103 | 125,446 |
Analysis:
We can see that for the classification task, the PointNet++ style model is very close to the performance of the base model (From part 1) and for segmentation, the PointNet++ style model slightly overperforms the base (from part 2)
Perhaps more interestingly, the PointNet++ style models are a lot lower in their number of parameters, having around 8-10% the number of parameters as the base, while being competitive in terms of accuracy
The hierarchical style of these models lets them extract local and global features in a clever way (by sampling, grouping, and propagating) rather than using many convolution layers in an MLP, and despite being much smaller models, they perform really well
If I built a larger PointNet++ style model (More set abstraction layers, using larger feature dimensions), it is very possible that we would see the performance beating the base models (Unfortunately, I did not have the time to run this experiment for this submission)
We can also visualize cases where the PointNet++ models perform better than the base models and vice versa
Classification¶
| Base Model Correct, PointNet++ Wrong | Base Model Correct, PointNet++ Wrong | PointNet++ Correct, Base Model Wrong | PointNet++ Correct, Base Model Wrong |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
| PointNet++ Predicts Lamp | PointNet++ Predicts Chair | Base Model Predicts Lamp | Base Model Predicts Lamp |
- We see the PointNet++ model correcting two of the mistakes made by the base model in Part 1 (Two examples on the right), perhaps due to its multi-scale representation information
Segmentation¶
| Base Better — Eg 1 | Base Better — Eg 2 | Base Better — Eg 3 | PointNet++ Better — Eg 1 | PointNet++ Better — Eg 2 | PointNet++ Better — Eg 3 | |
|---|---|---|---|---|---|---|
| Ground Truth | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Base Model | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| PointNet++ | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
We see a pattern of the PointNet++ model "over-segmenting", where it tends to predict more classes than the base model across all these cases. Sometimes this can lead to suboptimal cases such as in the left 3 columns where it predicts points to belong the yellow class even when they don't. However, in the right 3 columns, we see how this can also be a strength as it helps it to segment smaller portions of the chair object, such as the sides or the little backrest (in column 4), better than the base model
This is likely due to its hierarchical structure which lets it capture features at different neighborhood sizes, so its probable that 1 or more centroids were created at the chair's armrests, allowing for better segmentation there, whereas the base model doesn't do grouping like this at different levels




























































































