Assignment 5: Point Cloud Processing¶

Name: Simson D'Souza, Andrew ID: sjdsouza, Email: sjdsouza@andrew.cmu.edu¶


1. Classification Model (40 points)¶

Model Architecture


class cls_model(nn.Module): def init(self, num_classes=3): super(cls_model, self).init() self.conv1 = nn.Conv1d(3, 64, 1) self.bn1 = nn.BatchNorm1d(64) self.conv2 = nn.Conv1d(64, 64, 1) self.bn2 = nn.BatchNorm1d(64) self.conv3 = nn.Conv1d(64, 128, 1) self.bn3 = nn.BatchNorm1d(128) self.conv4 = nn.Conv1d(128, 1024, 1) self.bn4 = nn.BatchNorm1d(1024)

    self.fc1 = nn.Linear(1024, 512)
    self.bn_fc1 = nn.BatchNorm1d(512)
    self.dropout1 = nn.Dropout(p=0.4)

    self.fc2 = nn.Linear(512, 256)
    self.bn_fc2 = nn.BatchNorm1d(256)
    self.dropout2 = nn.Dropout(p=0.4)

    self.fc3 = nn.Linear(256, num_classes)

def forward(self, points):
    '''
    points: tensor of size (B, N, 3)
            where B is batch size and N is the number of points (default N=10000)
    output: tensor of size (B, num_classes)
    '''
    B, N, D = points.size()
    
    x = points.transpose(2, 1)

    x = F.relu(self.bn1(self.conv1(x)))  # (B, 64, N)
    x = F.relu(self.bn2(self.conv2(x)))  # (B, 64, N)
    x = F.relu(self.bn3(self.conv3(x)))  # (B, 128, N)
    x = F.relu(self.bn4(self.conv4(x)))  # (B, 1024, N)

    x = torch.max(x, 2, keepdim=False)[0]

    x = F.relu(self.bn_fc1(self.fc1(x)))
    x = self.dropout1(x)

    x = F.relu(self.bn_fc2(self.fc2(x)))
    x = self.dropout2(x)

    output = self.fc3(x)
    return output




Model training Hyperparameters

  • Learning Rates
    • Optimizer: Adam
    • Learning Rate: 0.001
    • Epochs: 250
    • Batch Size: 32

Test Accuracy: 0.9800629590766002

Successful Predictions

Point Cloud Visualization Ground Truth Prediction
No description has been provided for this image Chair Chair
No description has been provided for this image Lamp Lamp
No description has been provided for this image Vase Vase



Failure Predictions

Point Cloud Visualization Ground Truth Prediction
No description has been provided for this image Chair Lamp
No description has been provided for this image Lamp Vase
No description has been provided for this image Vase Lamp




Interpretation

  • The PointNet model demonstrates high accuracy for objects exhibiting canonical shapes. In these successful cases, the Max Pooling layer effectively extracts a unique global feature vector that strongly matches the target class.
  • Failures often occur when an object possesses features commonly found in another class. For instance, the Chair (bar stool) was misclassified as a Lamp due to its simplified geometry, emphasizing verticality and cylindrical symmetry, which aligns with the features of many lamps.
  • The model shows sensitivity to global symmetry. Confusion between Lamp and Vase occurred when one object exhibited the primary geometric feature of the other (e.g., a bulky, cylindrical lamp base resembling a wide vase).
  • The introduction of complex, external features or "noise" (like the plant structure within the Vase point cloud) disrupted the overall global feature extraction. This resulted in a failure, as the added detail skewed the global descriptor, leading to misclassification as a more intricate object like a Lamp.

2. Segmentation Model (40 points)¶

Model Architecture

class seg_model(nn.Module):
    def __init__(self, num_seg_classes = 6):
        super(seg_model, self).__init__()
        self.conv1 = nn.Conv1d(3, 64, 1)
        self.bn1 = nn.BatchNorm1d(64)

    
    self.conv2 = nn.Conv1d(64, 64, 1)
    self.bn2 = nn.BatchNorm1d(64)
    
    self.conv3 = nn.Conv1d(64, 128, 1)
    self.bn3 = nn.BatchNorm1d(128)
    
    self.conv4 = nn.Conv1d(128, 1024, 1)
    self.bn4 = nn.BatchNorm1d(1024)
    
    self.seg_conv1 = nn.Conv1d(1024 + 128, 512, 1)
    self.seg_bn1 = nn.BatchNorm1d(512)
    
    self.seg_conv2 = nn.Conv1d(512, 256, 1)
    self.seg_bn2 = nn.BatchNorm1d(256)
    
    self.seg_conv3 = nn.Conv1d(256, 128, 1)
    self.seg_bn3 = nn.BatchNorm1d(128)
    
    self.seg_conv4 = nn.Conv1d(128, num_seg_classes, 1)

def forward(self, points):
    '''
    points: tensor of size (B, N, 3)
            , where B is batch size and N is the number of points per object (N=10000 by default)
    output: tensor of size (B, N, num_seg_classes)
    '''
    B, N, D = points.size()
    x = points.transpose(2, 1) 

    x = F.relu(self.bn1(self.conv1(x)))
    x = F.relu(self.bn2(self.conv2(x)))
    local_features = F.relu(self.bn3(self.conv3(x))) # (B, 128, N)
    
    x = F.relu(self.bn4(self.conv4(local_features))) # (B, 1024, N)

    global_feature = torch.max(x, 2, keepdim=True)[0] # (B, 1024, 1)
    
    global_feature_expanded = global_feature.repeat(1, 1, N) # (B, 1024, N)

    concat_features = torch.cat([global_feature_expanded, local_features], 1) # (B, 1152, N)

    # Segmentation Head (Per-point Prediction)
    x = F.relu(self.seg_bn1(self.seg_conv1(concat_features)))
    x = F.relu(self.seg_bn2(self.seg_conv2(x)))
    x = F.relu(self.seg_bn3(self.seg_conv3(x)))
    
    output = self.seg_conv4(x) # (B, num_seg_classes, N)
    
    output = output.transpose(2, 1) # (B, N, num_seg_classes)

    return output




Model training Hyperparameters

  • Learning Rates
    • Optimizer: Adam
    • Learning Rate: 0.001
    • Epochs: 250
    • Batch Size: 32

Test Accuracy: 0.9044588330632091

Segmentation Results of 6 Objects

Successful Predictions with better accuracy

Ground Truth Prediction Test Accuracy
No description has been provided for this image
No description has been provided for this image
0.946
No description has been provided for this image
No description has been provided for this image
0.987
No description has been provided for this image
No description has been provided for this image
0.955




Failure Predictions with lower accuracy

Ground Truth Prediction Test Accuracy
No description has been provided for this image
No description has been provided for this image
0.531
No description has been provided for this image
No description has been provided for this image
0.594
No description has been provided for this image
No description has been provided for this image
0.548

Interpretation

  • The model performs exceptionally well on chairs with clear geometric separation between parts (back, seat, legs). This is evident in the models achieving accuracies of 0.946, 0.987, and 0.955. The key to this success is that PointNet effectively uses the global feature (context of the whole chair) to reinforce the local features of each point, allowing it to accurately delineate edges, such as the thin boundary between the red Seat and the blue Legs/Frame.
  • The most common failure mode is poor definition in objects where components are merged (like soft seating) or have rounded edges. The lack of sharp geometric distinction makes it nearly impossible for the per-point MLPs to accurately assign semantic boundaries, leading to significant bleeding between segments.
  • In the third failure (Accuracy: 0.548), the model appears to fundamentally misclassify the base structure. The Ground Truth shows clear segment layers (e.g., base/seat), but the prediction replaces a large vertical section with a single color (Blue). This suggests the model failed to combine local features with the global context correctly, likely interpreting the lower portion as a single, uniform volume rather than distinct chair segments.
  • The low accuracy confirms that when local feature clues are ambiguous due to complex geometry (like thick cushions, round surfaces), the network's reliance on the global feature vector is insufficient to distinguish boundaries at the point level, resulting in volumetric misassignments.

3. Robustness Analysis (20 points)¶

Experiment 1: Rotate input point clouds by certain degrees¶

Procedure: To evaluate the model's invariance to orientation, I tested the pre-trained models on the standard test dataset while applying a rotation transformation to the input point clouds. I systematically rotated each object around the vertical axis (z-axis) with angles ranging from $10^{\circ}$ to $180^{\circ}$ and recorded the drop in accuracy for both classification and segmentation tasks.

Classfication Results

The table shows the accuracy comparison relative to the baseline (Q1).

Rotation (Degree) 0 (Baseline) 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180
Test Accuracy 0.980 (Baseline) 0.969 0.912 0.823 0.686 0.475 0.285 0.288 0.376 0.459 0.493 0.492 0.487 0.478 0.512 0.601 0.673 0.696 0.642




The following are few visualizations (includes successful or bad visualizations):

Rotation (Degree) Ground Truth Prediction Point Cloud Visualization
10 Chair Chair No description has been provided for this image
40 Chair Vase No description has been provided for this image
70 Lamp Chair No description has been provided for this image




Segmentation Results

The table shows the accuracy comparison relative to the baseline (Q2).

Rotation (Degree) 0 (Baseline) 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180
Test Accuracy 0.904 (Baseline) 0.887 0.846 0.794 0.735 0.663 0.569 0.447 0.322 0.227 0.167 0.133 0.124 0.137 0.169 0.224 0.292 0.344 0.359




The following are few visualizations (includes bad visualizations):

Rotation (Degree) Ground Truth Prediction Test Accuracy
30 No description has been provided for this image
No description has been provided for this image
0.361
100 No description has been provided for this image
No description has been provided for this image
0.167
180 No description has been provided for this image
No description has been provided for this image
0.359




Interpretation

  • Classification: The model lacks rotation invariance and is highly sensitive to the orientation of the input data. As shown in the results, accuracy is high at small perturbations ($10^{\circ}$) but plummets drastically as rotation approaches $60^{\circ}-90^{\circ}$ (dropping from $\sim97\%$ to $\sim28\%$). This indicates that the PointNet architecture relies heavily on the canonical alignment of the training set and fails to generalize when absolute coordinates shift.
  • Segmentation: Similar to classification, segmentation performance degrades rapidly as rotation increases, dropping from $\sim88\%$ accuracy at $10^{\circ}$ to $\sim12\%$ at $120^{\circ}$. Because the model learns to associate specific parts with specific spatial locations, rotating the object disrupts these learned spatial priors.

Experiment 2: Different number of points per object¶

Procedure: To analyze how dependent the model is on high-resolution data, I varied the number of points sampled from the test objects during inference. While the model was trained on 10,000 points, I evaluated it on inputs ranging from extremely sparse clouds (10 points) to dense clouds (9,000 points) to identify the critical sampling density required for accurate performance.

Classfication Results

The table shows the accuracy comparison relative to the baseline (Q1).

Number of Points per object 10 100 1000 2500 5000 7500 9000 10000 (Baseline)
Test Accuracy 0.257 0.927 0.972 0.979 0.979 0.979 0.980 0.980 (Baseline)

The following are few visualizations (Bad visualizations):

Number of Points per object Ground Truth Prediction Point Cloud Visualization
10 Chair Lamp No description has been provided for this image
1000 Chair Vase No description has been provided for this image
7500 Lamp Chair No description has been provided for this image




Segmentation Results

The table shows the accuracy comparison relative to the baseline (Q2).

Number of Points per object 10 100 1000 2500 5000 7500 9000 10000 (Baseline)
Test Accuracy 0.614 0.829 0.904 0.905 0.905 0.905 0.906 0.906 (Baseline)




The following are few visualizations (includes Bad visualizations):

Number of Points per object Ground Truth Prediction Test Accuracy
10 No description has been provided for this image
No description has been provided for this image
0.3
2500 No description has been provided for this image
No description has been provided for this image
0.464
7500 No description has been provided for this image
No description has been provided for this image
0.478




Interpretation

  • Classification: The model demonstrates remarkable robustness to varying point densities. Accuracy saturates early around 1,000 points (~97%), implying that the PointNet architecture effectively captures the global shape using only a sparse "critical set" of points. The performance only collapses at extremely low densities (10 points, ~25%), where the point cloud becomes too sparse to represent the object's underlying geometry, rendering it unrecognizable.
  • Segmentation: Similar to classification, segmentation performance remains stable and high (>90%) once the point count exceeds 1,000, with negligible gains from adding more points (up to 9,000). The model remains surprisingly effective even at 100 points (83%), suggesting it relies on rough structural cues rather than fine-grained density. However, at 10 points, the accuracy drops significantly because there is insufficient local context to distinguish between different object parts.

4. Bonus Question - Locality (20 points)¶

Model Implemented

  • I have implemented Transformer-based Point Encoder for both classification and segmentation.
  • Classification Model: Input points are projected to embeddings with added sinusoidal positional encodings. The core architecture uses a Transformer Encoder layer with Multi-Head Self-Attention to capture relationships between points. Features are aggregated via Max Pooling into a global vector and passed through an MLP head for final classification.
  • Utilizes the same embedding and Transformer Encoder backbone to enrich point features with context via self-attention. Unlike the classification model, it skips global pooling and applies a 1D Convolutional head directly to the sequence of encoded features to generate per-point segmentation masks.
  • This satisfies the locality requirement by using Self-Attention, which allows the model to aggregate information from neighboring points dynamically, rather than processing each point in isolation like PointNet.

Both models are trained on same hyperparameters (Same as baseline) Model training Hyperparameters

  • Learning Rates
    • Optimizer: Adam
    • Learning Rate: 0.001
    • Epochs: 250
    • Batch Size: 32

Classification Model Architecture

class PositionalEncoding(nn.Module):
    def __init__(self, dim: int):
        super().__init__()
        self.dim = dim

def forward(self, x):
    device = x.device
    N = x.size(1)

    pos = torch.arange(N, dtype=torch.float32, device=device).unsqueeze(1)
    dim = torch.arange(self.dim, device=device).float()

    angle = pos / (10000 ** (2 * (dim // 2) / self.dim))
    pe = torch.zeros(N, self.dim, device=device)
    pe[:, 0::2] = torch.sin(angle[:, 0::2])
    pe[:, 1::2] = torch.cos(angle[:, 1::2])

    return pe.unsqueeze(0)

class cls_model_transformer(nn.Module): def init(self, num_classes=3, d_model=32, nhead=2, num_layers=1): super().init()

    self.embedding = nn.Linear(3, d_model)
    self.pos_encoding = PositionalEncoding(d_model)

    encoder_layer = nn.TransformerEncoderLayer(
        d_model=d_model, nhead=nhead, dropout=0.1, batch_first=True
    )
    self.transformer = nn.TransformerEncoder(encoder_layer, num_layers)

    self.head = nn.Sequential(
        nn.Linear(d_model, 64),
        nn.BatchNorm1d(64),
        nn.ReLU(inplace=True),
        nn.Dropout(0.3),
        nn.Linear(64, num_classes)
    )

def forward(self, points, return_logits=False):
    x = self.embedding(points) + self.pos_encoding(points)
    x = self.transformer(x)           
    x = x.max(dim=1).values           
    x = self.head(x)

    if return_logits:
        return x
    return F.log_softmax(x, dim=1)
    




Test Accuracy: 0.988 (Compared to baseline (PointNet) there is an improvement of 0.82% in test accuracy)

Model Performance Comaprison

Model Test Accuracy
PointNet (Baseline) 0.980
Transformer 0.988

Classification Visulizations

The following visualizations show objects that were misclassified by the PointNet model but correctly identified by the transformer model.

PointCloud Visualization Ground Truth PointNet (Baseline) Model Prediction (Failure) Transformer Model Prediction (Success)
No description has been provided for this image
Chair Lamp Chair
No description has been provided for this image
Lamp Vase Lamp
No description has been provided for this image
Vase Lamp Vase





Segmentation Model Architecture

class seg_model_transformer(nn.Module):
    def __init__(self, num_seg_classes=6, d_model=32, nhead=2, num_layers=1):
        super().__init__()

    self.embedding = nn.Linear(3, d_model)
    self.pos_encoding = PositionalEncoding(d_model)

    encoder_layer = nn.TransformerEncoderLayer(
        d_model=d_model, nhead=nhead, dropout=0.1, batch_first=True
    )
    self.transformer = nn.TransformerEncoder(encoder_layer, num_layers)

    self.conv_head = nn.Sequential(
        nn.Conv1d(d_model, 64, 1),
        nn.BatchNorm1d(64),
        nn.ReLU(inplace=True),
        nn.Conv1d(64, num_seg_classes, 1)
    )

def forward(self, points, return_logits=False):
    # points: (B, N, 3)
    x = self.embedding(points) + self.pos_encoding(points)
    x = self.transformer(x)

    x = x.transpose(1, 2)               
    x = self.conv_head(x)
    x = x.transpose(1, 2)                

    if return_logits:
        return x
    return F.log_softmax(x, dim=2)
    




Test Accuracy: 0.923 (Compared to baseline (PointNet) there is an improvement of 2.10% in test accuracy)

Model Performance Comaprison

Model Test Accuracy
PointNet (Baseline) 0.904
Transformer 0.923

Segmentation Visualizations

The following visualizations show objects for which the PointNet model achieved lower test accuracy, while the Transformer model performed better.

Ground Truth PointNet (Baseline) Model Visualization Transformer Model Visualization PointNet (Baseline Model Test Accuracy Transformer Model Test Accuracy
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
0.409 0.892
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
0.338 0.876
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
0.464 0.923




Interpretation

  • The Transformer-based implementation achieved slightly higher test accuracy compared to the baseline PointNet model across both tasks. This performance gain highlights the benefit of the self-attention mechanism, which effectively captures local context and relationships between points that the standard PointNet architecture fails to perceive.