Assignment 5: Point Cloud Processing¶
Name: Simson D'Souza, Andrew ID: sjdsouza, Email: sjdsouza@andrew.cmu.edu¶
1. Classification Model (40 points)¶
Model Architecture
class cls_model(nn.Module): def init(self, num_classes=3): super(cls_model, self).init() self.conv1 = nn.Conv1d(3, 64, 1) self.bn1 = nn.BatchNorm1d(64) self.conv2 = nn.Conv1d(64, 64, 1) self.bn2 = nn.BatchNorm1d(64) self.conv3 = nn.Conv1d(64, 128, 1) self.bn3 = nn.BatchNorm1d(128) self.conv4 = nn.Conv1d(128, 1024, 1) self.bn4 = nn.BatchNorm1d(1024)
self.fc1 = nn.Linear(1024, 512) self.bn_fc1 = nn.BatchNorm1d(512) self.dropout1 = nn.Dropout(p=0.4) self.fc2 = nn.Linear(512, 256) self.bn_fc2 = nn.BatchNorm1d(256) self.dropout2 = nn.Dropout(p=0.4) self.fc3 = nn.Linear(256, num_classes) def forward(self, points): ''' points: tensor of size (B, N, 3) where B is batch size and N is the number of points (default N=10000) output: tensor of size (B, num_classes) ''' B, N, D = points.size() x = points.transpose(2, 1) x = F.relu(self.bn1(self.conv1(x))) # (B, 64, N) x = F.relu(self.bn2(self.conv2(x))) # (B, 64, N) x = F.relu(self.bn3(self.conv3(x))) # (B, 128, N) x = F.relu(self.bn4(self.conv4(x))) # (B, 1024, N) x = torch.max(x, 2, keepdim=False)[0] x = F.relu(self.bn_fc1(self.fc1(x))) x = self.dropout1(x) x = F.relu(self.bn_fc2(self.fc2(x))) x = self.dropout2(x) output = self.fc3(x) return output
Model training Hyperparameters
- Learning Rates
Optimizer: AdamLearning Rate: 0.001Epochs: 250Batch Size: 32
Test Accuracy: 0.9800629590766002
Successful Predictions
| Point Cloud Visualization | Ground Truth | Prediction |
|---|---|---|
|
Chair | Chair |
|
Lamp | Lamp |
|
Vase | Vase |
Failure Predictions
| Point Cloud Visualization | Ground Truth | Prediction |
|---|---|---|
|
Chair | Lamp |
|
Lamp | Vase |
|
Vase | Lamp |
Interpretation
- The PointNet model demonstrates high accuracy for objects exhibiting canonical shapes. In these successful cases, the Max Pooling layer effectively extracts a unique global feature vector that strongly matches the target class.
- Failures often occur when an object possesses features commonly found in another class. For instance, the Chair (bar stool) was misclassified as a Lamp due to its simplified geometry, emphasizing verticality and cylindrical symmetry, which aligns with the features of many lamps.
- The model shows sensitivity to global symmetry. Confusion between Lamp and Vase occurred when one object exhibited the primary geometric feature of the other (e.g., a bulky, cylindrical lamp base resembling a wide vase).
- The introduction of complex, external features or "noise" (like the plant structure within the Vase point cloud) disrupted the overall global feature extraction. This resulted in a failure, as the added detail skewed the global descriptor, leading to misclassification as a more intricate object like a Lamp.
2. Segmentation Model (40 points)¶
Model Architecture
class seg_model(nn.Module):
def __init__(self, num_seg_classes = 6):
super(seg_model, self).__init__()
self.conv1 = nn.Conv1d(3, 64, 1)
self.bn1 = nn.BatchNorm1d(64)
self.conv2 = nn.Conv1d(64, 64, 1)
self.bn2 = nn.BatchNorm1d(64)
self.conv3 = nn.Conv1d(64, 128, 1)
self.bn3 = nn.BatchNorm1d(128)
self.conv4 = nn.Conv1d(128, 1024, 1)
self.bn4 = nn.BatchNorm1d(1024)
self.seg_conv1 = nn.Conv1d(1024 + 128, 512, 1)
self.seg_bn1 = nn.BatchNorm1d(512)
self.seg_conv2 = nn.Conv1d(512, 256, 1)
self.seg_bn2 = nn.BatchNorm1d(256)
self.seg_conv3 = nn.Conv1d(256, 128, 1)
self.seg_bn3 = nn.BatchNorm1d(128)
self.seg_conv4 = nn.Conv1d(128, num_seg_classes, 1)
def forward(self, points):
'''
points: tensor of size (B, N, 3)
, where B is batch size and N is the number of points per object (N=10000 by default)
output: tensor of size (B, N, num_seg_classes)
'''
B, N, D = points.size()
x = points.transpose(2, 1)
x = F.relu(self.bn1(self.conv1(x)))
x = F.relu(self.bn2(self.conv2(x)))
local_features = F.relu(self.bn3(self.conv3(x))) # (B, 128, N)
x = F.relu(self.bn4(self.conv4(local_features))) # (B, 1024, N)
global_feature = torch.max(x, 2, keepdim=True)[0] # (B, 1024, 1)
global_feature_expanded = global_feature.repeat(1, 1, N) # (B, 1024, N)
concat_features = torch.cat([global_feature_expanded, local_features], 1) # (B, 1152, N)
# Segmentation Head (Per-point Prediction)
x = F.relu(self.seg_bn1(self.seg_conv1(concat_features)))
x = F.relu(self.seg_bn2(self.seg_conv2(x)))
x = F.relu(self.seg_bn3(self.seg_conv3(x)))
output = self.seg_conv4(x) # (B, num_seg_classes, N)
output = output.transpose(2, 1) # (B, N, num_seg_classes)
return output
Model training Hyperparameters
- Learning Rates
Optimizer: AdamLearning Rate: 0.001Epochs: 250Batch Size: 32
Test Accuracy: 0.9044588330632091
Segmentation Results of 6 Objects
Successful Predictions with better accuracy
| Ground Truth | Prediction | Test Accuracy |
|---|---|---|
![]() |
![]() |
0.946 |
![]() |
![]() |
0.987 |
![]() |
![]() |
0.955 |
Failure Predictions with lower accuracy
| Ground Truth | Prediction | Test Accuracy |
|---|---|---|
![]() |
![]() |
0.531 |
![]() |
![]() |
0.594 |
![]() |
![]() |
0.548 |
Interpretation
- The model performs exceptionally well on chairs with clear geometric separation between parts (back, seat, legs). This is evident in the models achieving accuracies of 0.946, 0.987, and 0.955. The key to this success is that PointNet effectively uses the global feature (context of the whole chair) to reinforce the local features of each point, allowing it to accurately delineate edges, such as the thin boundary between the red Seat and the blue Legs/Frame.
- The most common failure mode is poor definition in objects where components are merged (like soft seating) or have rounded edges. The lack of sharp geometric distinction makes it nearly impossible for the per-point MLPs to accurately assign semantic boundaries, leading to significant bleeding between segments.
- In the third failure (Accuracy: 0.548), the model appears to fundamentally misclassify the base structure. The Ground Truth shows clear segment layers (e.g., base/seat), but the prediction replaces a large vertical section with a single color (Blue). This suggests the model failed to combine local features with the global context correctly, likely interpreting the lower portion as a single, uniform volume rather than distinct chair segments.
- The low accuracy confirms that when local feature clues are ambiguous due to complex geometry (like thick cushions, round surfaces), the network's reliance on the global feature vector is insufficient to distinguish boundaries at the point level, resulting in volumetric misassignments.
3. Robustness Analysis (20 points)¶
Experiment 1: Rotate input point clouds by certain degrees¶
Procedure: To evaluate the model's invariance to orientation, I tested the pre-trained models on the standard test dataset while applying a rotation transformation to the input point clouds. I systematically rotated each object around the vertical axis (z-axis) with angles ranging from $10^{\circ}$ to $180^{\circ}$ and recorded the drop in accuracy for both classification and segmentation tasks.
Classfication Results
The table shows the accuracy comparison relative to the baseline (Q1).
| Rotation (Degree) | 0 (Baseline) | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 | 100 | 110 | 120 | 130 | 140 | 150 | 160 | 170 | 180 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Test Accuracy | 0.980 (Baseline) | 0.969 | 0.912 | 0.823 | 0.686 | 0.475 | 0.285 | 0.288 | 0.376 | 0.459 | 0.493 | 0.492 | 0.487 | 0.478 | 0.512 | 0.601 | 0.673 | 0.696 | 0.642 |
The following are few visualizations (includes successful or bad visualizations):
| Rotation (Degree) | Ground Truth | Prediction | Point Cloud Visualization |
|---|---|---|---|
| 10 | Chair | Chair |
![]() |
| 40 | Chair | Vase |
![]() |
| 70 | Lamp | Chair |
![]() |
Segmentation Results
The table shows the accuracy comparison relative to the baseline (Q2).
| Rotation (Degree) | 0 (Baseline) | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 | 100 | 110 | 120 | 130 | 140 | 150 | 160 | 170 | 180 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Test Accuracy | 0.904 (Baseline) | 0.887 | 0.846 | 0.794 | 0.735 | 0.663 | 0.569 | 0.447 | 0.322 | 0.227 | 0.167 | 0.133 | 0.124 | 0.137 | 0.169 | 0.224 | 0.292 | 0.344 | 0.359 |
The following are few visualizations (includes bad visualizations):
| Rotation (Degree) | Ground Truth | Prediction | Test Accuracy |
|---|---|---|---|
| 30 |
![]() |
![]() |
0.361 |
| 100 |
![]() |
![]() |
0.167 |
| 180 |
![]() |
![]() |
0.359 |
Interpretation
Classification:The model lacks rotation invariance and is highly sensitive to the orientation of the input data. As shown in the results, accuracy is high at small perturbations ($10^{\circ}$) but plummets drastically as rotation approaches $60^{\circ}-90^{\circ}$ (dropping from $\sim97\%$ to $\sim28\%$). This indicates that the PointNet architecture relies heavily on the canonical alignment of the training set and fails to generalize when absolute coordinates shift.Segmentation:Similar to classification, segmentation performance degrades rapidly as rotation increases, dropping from $\sim88\%$ accuracy at $10^{\circ}$ to $\sim12\%$ at $120^{\circ}$. Because the model learns to associate specific parts with specific spatial locations, rotating the object disrupts these learned spatial priors.
Experiment 2: Different number of points per object¶
Procedure: To analyze how dependent the model is on high-resolution data, I varied the number of points sampled from the test objects during inference. While the model was trained on 10,000 points, I evaluated it on inputs ranging from extremely sparse clouds (10 points) to dense clouds (9,000 points) to identify the critical sampling density required for accurate performance.
Classfication Results
The table shows the accuracy comparison relative to the baseline (Q1).
| Number of Points per object | 10 | 100 | 1000 | 2500 | 5000 | 7500 | 9000 | 10000 (Baseline) |
|---|---|---|---|---|---|---|---|---|
| Test Accuracy | 0.257 | 0.927 | 0.972 | 0.979 | 0.979 | 0.979 | 0.980 | 0.980 (Baseline) |
The following are few visualizations (Bad visualizations):
| Number of Points per object | Ground Truth | Prediction | Point Cloud Visualization |
|---|---|---|---|
| 10 | Chair | Lamp |
![]() |
| 1000 | Chair | Vase |
![]() |
| 7500 | Lamp | Chair |
![]() |
Segmentation Results
The table shows the accuracy comparison relative to the baseline (Q2).
| Number of Points per object | 10 | 100 | 1000 | 2500 | 5000 | 7500 | 9000 | 10000 (Baseline) |
|---|---|---|---|---|---|---|---|---|
| Test Accuracy | 0.614 | 0.829 | 0.904 | 0.905 | 0.905 | 0.905 | 0.906 | 0.906 (Baseline) |
The following are few visualizations (includes Bad visualizations):
| Number of Points per object | Ground Truth | Prediction | Test Accuracy |
|---|---|---|---|
| 10 |
![]() |
![]() |
0.3 |
| 2500 |
![]() |
![]() |
0.464 |
| 7500 |
![]() |
![]() |
0.478 |
Interpretation
Classification:The model demonstrates remarkable robustness to varying point densities. Accuracy saturates early around 1,000 points (~97%), implying that the PointNet architecture effectively captures the global shape using only a sparse "critical set" of points. The performance only collapses at extremely low densities (10 points, ~25%), where the point cloud becomes too sparse to represent the object's underlying geometry, rendering it unrecognizable.Segmentation:Similar to classification, segmentation performance remains stable and high (>90%) once the point count exceeds 1,000, with negligible gains from adding more points (up to 9,000). The model remains surprisingly effective even at 100 points (83%), suggesting it relies on rough structural cues rather than fine-grained density. However, at 10 points, the accuracy drops significantly because there is insufficient local context to distinguish between different object parts.
4. Bonus Question - Locality (20 points)¶
Model Implemented
- I have implemented
Transformer-based Point Encoderfor both classification and segmentation. - Classification Model: Input points are projected to embeddings with added sinusoidal positional encodings. The core architecture uses a Transformer Encoder layer with Multi-Head Self-Attention to capture relationships between points. Features are aggregated via Max Pooling into a global vector and passed through an MLP head for final classification.
- Utilizes the same embedding and Transformer Encoder backbone to enrich point features with context via self-attention. Unlike the classification model, it skips global pooling and applies a 1D Convolutional head directly to the sequence of encoded features to generate per-point segmentation masks.
- This satisfies the locality requirement by using Self-Attention, which allows the model to aggregate information from neighboring points dynamically, rather than processing each point in isolation like PointNet.
Both models are trained on same hyperparameters (Same as baseline) Model training Hyperparameters
- Learning Rates
Optimizer: AdamLearning Rate: 0.001Epochs: 250Batch Size: 32
Classification Model Architecture
class PositionalEncoding(nn.Module):
def __init__(self, dim: int):
super().__init__()
self.dim = dim
def forward(self, x):
device = x.device
N = x.size(1)
pos = torch.arange(N, dtype=torch.float32, device=device).unsqueeze(1)
dim = torch.arange(self.dim, device=device).float()
angle = pos / (10000 ** (2 * (dim // 2) / self.dim))
pe = torch.zeros(N, self.dim, device=device)
pe[:, 0::2] = torch.sin(angle[:, 0::2])
pe[:, 1::2] = torch.cos(angle[:, 1::2])
return pe.unsqueeze(0)
class cls_model_transformer(nn.Module):
def init(self, num_classes=3, d_model=32, nhead=2, num_layers=1):
super().init()
self.embedding = nn.Linear(3, d_model)
self.pos_encoding = PositionalEncoding(d_model)
encoder_layer = nn.TransformerEncoderLayer(
d_model=d_model, nhead=nhead, dropout=0.1, batch_first=True
)
self.transformer = nn.TransformerEncoder(encoder_layer, num_layers)
self.head = nn.Sequential(
nn.Linear(d_model, 64),
nn.BatchNorm1d(64),
nn.ReLU(inplace=True),
nn.Dropout(0.3),
nn.Linear(64, num_classes)
)
def forward(self, points, return_logits=False):
x = self.embedding(points) + self.pos_encoding(points)
x = self.transformer(x)
x = x.max(dim=1).values
x = self.head(x)
if return_logits:
return x
return F.log_softmax(x, dim=1)
Test Accuracy: 0.988 (Compared to baseline (PointNet) there is an improvement of 0.82% in test accuracy)
Model Performance Comaprison
| Model | Test Accuracy |
|---|---|
| PointNet (Baseline) | 0.980 |
| Transformer | 0.988 |
Classification Visulizations
The following visualizations show objects that were misclassified by the PointNet model but correctly identified by the transformer model.
| PointCloud Visualization | Ground Truth | PointNet (Baseline) Model Prediction (Failure) | Transformer Model Prediction (Success) |
|---|---|---|---|
![]() |
Chair | Lamp | Chair |
![]() |
Lamp | Vase | Lamp |
![]() |
Vase | Lamp | Vase |
Segmentation Model Architecture
class seg_model_transformer(nn.Module):
def __init__(self, num_seg_classes=6, d_model=32, nhead=2, num_layers=1):
super().__init__()
self.embedding = nn.Linear(3, d_model)
self.pos_encoding = PositionalEncoding(d_model)
encoder_layer = nn.TransformerEncoderLayer(
d_model=d_model, nhead=nhead, dropout=0.1, batch_first=True
)
self.transformer = nn.TransformerEncoder(encoder_layer, num_layers)
self.conv_head = nn.Sequential(
nn.Conv1d(d_model, 64, 1),
nn.BatchNorm1d(64),
nn.ReLU(inplace=True),
nn.Conv1d(64, num_seg_classes, 1)
)
def forward(self, points, return_logits=False):
# points: (B, N, 3)
x = self.embedding(points) + self.pos_encoding(points)
x = self.transformer(x)
x = x.transpose(1, 2)
x = self.conv_head(x)
x = x.transpose(1, 2)
if return_logits:
return x
return F.log_softmax(x, dim=2)
Test Accuracy: 0.923 (Compared to baseline (PointNet) there is an improvement of 2.10% in test accuracy)
Model Performance Comaprison
| Model | Test Accuracy |
|---|---|
| PointNet (Baseline) | 0.904 |
| Transformer | 0.923 |
Segmentation Visualizations
The following visualizations show objects for which the PointNet model achieved lower test accuracy, while the Transformer model performed better.
| Ground Truth | PointNet (Baseline) Model Visualization | Transformer Model Visualization | PointNet (Baseline Model Test Accuracy | Transformer Model Test Accuracy | |
|---|---|---|---|---|---|
![]() |
![]() |
![]() |
0.409 | 0.892 | |
![]() |
![]() |
![]() |
0.338 | 0.876 | |
![]() |
![]() |
![]() |
0.464 | 0.923 |
Interpretation
- The Transformer-based implementation achieved slightly higher test accuracy compared to the baseline PointNet model across both tasks. This performance gain highlights the benefit of the self-attention mechanism, which effectively captures local context and relationships between points that the standard PointNet architecture fails to perceive.









































