Q1. Classification Model (40 points)
Implemented a PointNet-based classification model to classify point clouds into three categories: chairs, vases, and lamps.
Model Architecture
The classification model follows the PointNet architecture:
- Shared MLP: Three 1D convolutional layers (3→64→128→1024) with BatchNorm and ReLU
- Global Feature: Max pooling over all points to extract global feature vector
- Classifier: Fully connected layers (1024→512→256→3) with BatchNorm, ReLU, and Dropout
Test Accuracy
Final Test Accuracy: 98.22% (0.9822)
The model achieved excellent performance on the test set, correctly classifying 98.22% of the point cloud objects.
Q2. Segmentation Model (40 points)
Implemented a PointNet-based segmentation model to perform per-point semantic segmentation on chair point clouds with 6 semantic classes.
Model Architecture
The segmentation model uses an encoder-decoder architecture:
- Encoder: Three 1D convolutional layers (3→64→128→1024) to extract point features
- Global Feature: Max pooling to get global context (1024-dim vector)
- Feature Concatenation: Concatenate local features (64-dim) with global feature (1024-dim) → 1088-dim
- Decoder: MLP layers (1088→512→256→128→6) to predict per-point class labels
Test Accuracy
Final Test Accuracy: 90.45% (0.9045)
The model correctly segments 90.45% of all points across all test objects.
Segmentation Results
Visualized segmentation results for 5 objects, including 2 failure cases:
Object 0
Ground Truth
Prediction
Object 1
Ground Truth
Prediction
Object 2
Ground Truth
Prediction
Object 3
Ground Truth
Prediction
Object 4 (Failure case on armrests)
Ground Truth
Prediction
Object 5
Ground Truth
Prediction
Interpretation
The PointNet segmentation model achieves good overall performance (90.45% accuracy) by combining local point features with global context. Key observations:
- Good Cases (Objects 0, 1, 2, 3, 5): The model correctly segments most parts of the chairs, with clear boundaries between different semantic regions (seat, backrest, legs, etc.). The global feature provides useful context for disambiguating similar local geometries.
- Failure Case - Object 4: This chair exhibits a notable segmentation error where the model incorrectly predicts parts of the flat chair seat as chair arms (armrests). This failure can be attributed to two main factors:
- Dataset Bias: The training data likely contains a majority of armchairs (as seen in Objects 0-3), causing the model to develop a bias toward predicting arm structures on the sides of chairs.
- Poor Local Information Extraction: PointNet's reliance on global max pooling limits its ability to capture fine-grained local geometric differences between chair seats and armrests. The slightly distinguished local geometry of these regions is not adequately captured by the model's point-wise features, leading to confusion between semantically different but geometrically similar regions.
- Limitations: PointNet's lack of explicit local neighborhood modeling means it may miss fine-grained details at region boundaries. The max pooling operation aggregates information globally but may not preserve important local spatial relationships needed for precise segmentation, particularly when distinguishing between parts with subtle geometric differences.
Q3. Robustness Analysis (20 points)
Conducted two experiments to analyze the robustness of the learned models: rotation robustness and sensitivity to the number of input points.
Experiment 1: Rotation Robustness
Tested model performance when input point clouds are randomly rotated around all three axes (X, Y, Z) at different angles: 0°, 30°, 60°, 90°, and 180°.
Classification Results
| Rotation Angle |
Accuracy |
Accuracy Drop |
| 0° (baseline) |
98.22% |
0.00% |
| 30° |
78.28% |
-19.94% |
| 60° |
33.58% |
-64.64% |
| 90° |
29.07% |
-69.15% |
| 180° |
30.95% |
-67.26% |
Segmentation Results
| Rotation Angle |
Accuracy |
Accuracy Drop |
| 0° (baseline) |
90.45% |
0.00% |
| 30° |
75.10% |
-15.35% |
| 60° |
55.46% |
-34.99% |
| 90° |
36.52% |
-53.92% |
| 180° |
29.23% |
-61.22% |
Interpretation
Key Findings:
- Severe Rotation Sensitivity: Both models show significant performance degradation with rotation. Even a 30° rotation causes a ~20% drop in classification accuracy and ~15% drop in segmentation accuracy.
- Root Cause: PointNet is not rotation-invariant. The model learns features in the original coordinate system, and rotations change the absolute positions of points, breaking the learned feature representations. The max pooling operation is permutation-invariant but not rotation-invariant.
- Implications: This demonstrates a major limitation of vanilla PointNet - it requires data augmentation with rotations during training, or the use of rotation-invariant features, to handle rotated inputs effectively.
Segmentation Visualization: Effect of Rotation
Below are visualizations showing how rotation affects segmentation quality. Each pair shows ground truth (top) and prediction (bottom) at different rotation angles:
0° (No Rotation)
Ground Truth
Prediction - Accuracy: 90.45%
30° Rotation
Ground Truth (same object)
Prediction - Accuracy: 75.10%
60° Rotation
Ground Truth (same object)
Prediction - Accuracy: 55.46%
90° Rotation
Ground Truth (same object)
Prediction - Accuracy: 36.52%
180° Rotation
Ground Truth (same object)
Prediction - Accuracy: 29.23%
Experiment 2: Number of Points
Tested model performance with different numbers of points per object: 10000, 5000, 2000, 1000, and 500. Points were sampled using nested subsets to ensure fair comparison.
Classification Results
| Number of Points |
Accuracy |
Accuracy Drop |
| 10000 (baseline) |
98.22% |
0.00% |
| 5000 |
98.11% |
-0.10% |
| 2000 |
97.48% |
-0.73% |
| 1000 |
97.38% |
-0.84% |
| 500 |
96.96% |
-1.26% |
Segmentation Results
| Number of Points |
Accuracy |
Accuracy Drop |
| 10000 (baseline) |
90.45% |
0.00% |
| 5000 |
90.39% |
-0.05% |
| 2000 |
90.22% |
-0.23% |
| 1000 |
89.74% |
-0.71% |
| 500 |
88.55% |
-1.89% |
Interpretation
Key Findings:
- Robust to Point Reduction: Both models show remarkable robustness to reducing the number of input points. Even with only 500 points (5% of original), classification maintains 96.96% accuracy and segmentation maintains 88.55% accuracy.
- Why It Works: PointNet's max pooling operation is particularly effective here - it extracts the most salient features regardless of how many points contribute. As long as the key discriminative points are present, the model can make accurate predictions.
- Practical Implications: This robustness is valuable for real-world applications where point cloud density may vary, or where computational efficiency requires downsampling.
Segmentation Visualization: Effect of Number of Points
Below are visualizations showing how segmentation quality is maintained even with significantly fewer points. Each pair shows ground truth (top) and prediction (bottom) at different point counts:
10000 Points (Baseline)
Ground Truth
Prediction - Accuracy: 90.45%
5000 Points
Ground Truth (same object)
Prediction - Accuracy: 90.39%
2000 Points
Ground Truth (same object)
Prediction - Accuracy: 90.22%
1000 Points
Ground Truth (same object)
Prediction - Accuracy: 89.74%
500 Points
Ground Truth (same object)
Prediction - Accuracy: 88.55%
Q4. Bonus Question - Locality (20 points)
Implemented simplified PointNet++.
Model Implemented: PointNet++
PointNet++ addresses PointNet's limitation of lacking local structure modeling by:
- Hierarchical Feature Learning: Uses Set Abstraction (SA) layers that sample representative points and group local neighborhoods
- Local Aggregation: For each sampled point, aggregates features from k nearest neighbors using PointNet-style MLPs
- Multi-Scale Processing: Processes point clouds at multiple scales (e.g., 10000→512→128 points) to capture both local and global features
Architecture Details
Classification Model (PointNet++):
- SA1: Samples 512 centers, groups k=32 neighbors, outputs 128-dim features
- SA2: Samples 128 centers, groups k=32 neighbors, outputs 512-dim features
- Global max pooling + MLP classifier (512→256→128→3)
Segmentation Model (PointNet++):
- Per-point MLP: 3→64→64
- Local aggregation: k-NN (k=16) with relative coordinates
- Global feature concatenation: local(128) + global(128) = 256
- Decoder: 256→256→128→6
Comparison Results
Classification Task
| Model |
Accuracy |
Improvement |
| PointNet |
98.22% |
- |
| PointNet++ |
98.64% |
+0.42% (+0.43%) |
Segmentation Task
| Model |
Accuracy |
Improvement |
| PointNet |
90.45% |
- |
| PointNet++ |
88.44% |
-2.01% (-2.22%) |
Analysis
Classification Results:
- PointNet++ achieves a modest improvement (+0.42%) over PointNet for classification. The hierarchical local feature learning helps capture more discriminative features, especially for objects with complex geometric structures.
- The improvement is relatively small because PointNet already performs very well (98.22%), leaving little room for improvement. However, PointNet++'s local aggregation may help with edge cases.
Segmentation Results:
- Our implementation of PointNet++ performs slightly worse (-2.01%) than PointNet for segmentation. This is unexpected and can be attributed to several factors:
- Simplified Architecture: The implemented PointNet++ segmentation model differs significantly from the original paper (Qi et al., 2017). While the classification model follows the hierarchical Set Abstraction architecture from the paper, the segmentation model uses a simplified approach that does NOT implement the full feature propagation (FP) architecture.
- Missing Components: The original PointNet++ segmentation uses an hourglass architecture: SA↓ → SA↓ → SA↓ → FP↑ → FP↑ → FP↑. Our implementation instead uses: per-point MLP → local k-NN aggregation → global feature → decoder. This means we're missing:
- Hierarchical downsampling with Set Abstraction layers
- Feature Propagation (upsampling) layers with interpolation
- Skip connections between encoder and decoder
- Multi-scale grouping at different resolutions
- Why This Simplification? The full PointNet++ segmentation is complex to implement without custom CUDA kernels for ball query and efficient FPS. Our simplified version captures the "spirit" of locality through k-NN grouping but doesn't have the hierarchical structure.
- Training Issues: The more complex architecture with local neighborhoods requires more careful hyperparameter tuning. The model may not have been fully optimized for this specific task.
Key Takeaway: The classification PointNet++ follows the paper's architecture closely and shows improvement, while the segmentation PointNet++ is a simplified "local-enhanced PointNet" rather than true hierarchical PointNet++, explaining its lower performance.
Visualization Comparison: PointNet vs. PointNet++
Below are side-by-side comparisons showing qualitative differences between PointNet and PointNet++ segmentation:
Object 1
Ground Truth
Ground truth segmentation
PointNet
PointNet prediction
PointNet++
PointNet++ prediction
Object 3
Ground Truth
Ground truth segmentation
PointNet
PointNet prediction
PointNet++
PointNet++ prediction
Object 4
Ground Truth
Ground truth segmentation
PointNet
PointNet prediction
PointNet++
PointNet++ prediction