Assignment 5: Point Cloud Processing¶

Name: Simson D'Souza, Andrew ID: sjdsouza, Email: sjdsouza@andrew.cmu.edu¶

1. Classification Model (40 points)¶

Model Architecture


class cls_model(nn.Module):
def init(self, num_classes=3):
super(cls_model, self).init()
self.conv1 = nn.Conv1d(3, 64, 1)
self.bn1 = nn.BatchNorm1d(64)
self.conv2 = nn.Conv1d(64, 64, 1)
self.bn2 = nn.BatchNorm1d(64)
self.conv3 = nn.Conv1d(64, 128, 1)
self.bn3 = nn.BatchNorm1d(128)
self.conv4 = nn.Conv1d(128, 1024, 1)
self.bn4 = nn.BatchNorm1d(1024)
    self.fc1 = nn.Linear(1024, 512)
    self.bn_fc1 = nn.BatchNorm1d(512)
    self.dropout1 = nn.Dropout(p=0.4)

    self.fc2 = nn.Linear(512, 256)
    self.bn_fc2 = nn.BatchNorm1d(256)
    self.dropout2 = nn.Dropout(p=0.4)

    self.fc3 = nn.Linear(256, num_classes)

def forward(self, points):
    '''
    points: tensor of size (B, N, 3)
            where B is batch size and N is the number of points (default N=10000)
    output: tensor of size (B, num_classes)
    '''
    B, N, D = points.size()
    
    x = points.transpose(2, 1)

    x = F.relu(self.bn1(self.conv1(x)))  # (B, 64, N)
    x = F.relu(self.bn2(self.conv2(x)))  # (B, 64, N)
    x = F.relu(self.bn3(self.conv3(x)))  # (B, 128, N)
    x = F.relu(self.bn4(self.conv4(x)))  # (B, 1024, N)

    x = torch.max(x, 2, keepdim=False)[0]

    x = F.relu(self.bn_fc1(self.fc1(x)))
    x = self.dropout1(x)

    x = F.relu(self.bn_fc2(self.fc2(x)))
    x = self.dropout2(x)

    output = self.fc3(x)
    return output

Model training Hyperparameters

Learning Rates
- Optimizer: Adam
- Learning Rate: 0.001
- Epochs: 250
- Batch Size: 32

Test Accuracy: 0.9800629590766002

Successful Predictions

Point Cloud Visualization	Ground Truth	Prediction
	Chair	Chair
	Lamp	Lamp
	Vase	Vase

Failure Predictions

Point Cloud Visualization	Ground Truth	Prediction
	Chair	Lamp
	Lamp	Vase
	Vase	Lamp

Interpretation

The PointNet model demonstrates high accuracy for objects exhibiting canonical shapes. In these successful cases, the Max Pooling layer effectively extracts a unique global feature vector that strongly matches the target class.
Failures often occur when an object possesses features commonly found in another class. For instance, the Chair (bar stool) was misclassified as a Lamp due to its simplified geometry, emphasizing verticality and cylindrical symmetry, which aligns with the features of many lamps.
The model shows sensitivity to global symmetry. Confusion between Lamp and Vase occurred when one object exhibited the primary geometric feature of the other (e.g., a bulky, cylindrical lamp base resembling a wide vase).
The introduction of complex, external features or "noise" (like the plant structure within the Vase point cloud) disrupted the overall global feature extraction. This resulted in a failure, as the added detail skewed the global descriptor, leading to misclassification as a more intricate object like a Lamp.

2. Segmentation Model (40 points)¶

Model Architecture

class seg_model(nn.Module):
    def __init__(self, num_seg_classes = 6):
        super(seg_model, self).__init__()
        self.conv1 = nn.Conv1d(3, 64, 1)
        self.bn1 = nn.BatchNorm1d(64)

    
    self.conv2 = nn.Conv1d(64, 64, 1)
    self.bn2 = nn.BatchNorm1d(64)
    
    self.conv3 = nn.Conv1d(64, 128, 1)
    self.bn3 = nn.BatchNorm1d(128)
    
    self.conv4 = nn.Conv1d(128, 1024, 1)
    self.bn4 = nn.BatchNorm1d(1024)
    
    self.seg_conv1 = nn.Conv1d(1024 + 128, 512, 1)
    self.seg_bn1 = nn.BatchNorm1d(512)
    
    self.seg_conv2 = nn.Conv1d(512, 256, 1)
    self.seg_bn2 = nn.BatchNorm1d(256)
    
    self.seg_conv3 = nn.Conv1d(256, 128, 1)
    self.seg_bn3 = nn.BatchNorm1d(128)
    
    self.seg_conv4 = nn.Conv1d(128, num_seg_classes, 1)

def forward(self, points):
    '''
    points: tensor of size (B, N, 3)
            , where B is batch size and N is the number of points per object (N=10000 by default)
    output: tensor of size (B, N, num_seg_classes)
    '''
    B, N, D = points.size()
    x = points.transpose(2, 1) 

    x = F.relu(self.bn1(self.conv1(x)))
    x = F.relu(self.bn2(self.conv2(x)))
    local_features = F.relu(self.bn3(self.conv3(x))) # (B, 128, N)
    
    x = F.relu(self.bn4(self.conv4(local_features))) # (B, 1024, N)

    global_feature = torch.max(x, 2, keepdim=True)[0] # (B, 1024, 1)
    
    global_feature_expanded = global_feature.repeat(1, 1, N) # (B, 1024, N)

    concat_features = torch.cat([global_feature_expanded, local_features], 1) # (B, 1152, N)

    # Segmentation Head (Per-point Prediction)
    x = F.relu(self.seg_bn1(self.seg_conv1(concat_features)))
    x = F.relu(self.seg_bn2(self.seg_conv2(x)))
    x = F.relu(self.seg_bn3(self.seg_conv3(x)))
    
    output = self.seg_conv4(x) # (B, num_seg_classes, N)
    
    output = output.transpose(2, 1) # (B, N, num_seg_classes)

    return output

Model training Hyperparameters

Learning Rates
- Optimizer: Adam
- Learning Rate: 0.001
- Epochs: 250
- Batch Size: 32

Test Accuracy: 0.9044588330632091

Segmentation Results of 6 Objects

Successful Predictions with better accuracy

Ground Truth	Prediction	Test Accuracy
		0.946
		0.987
		0.955

Failure Predictions with lower accuracy

Ground Truth	Prediction	Test Accuracy
		0.531
		0.594
		0.548

Interpretation

The model performs exceptionally well on chairs with clear geometric separation between parts (back, seat, legs). This is evident in the models achieving accuracies of 0.946, 0.987, and 0.955. The key to this success is that PointNet effectively uses the global feature (context of the whole chair) to reinforce the local features of each point, allowing it to accurately delineate edges, such as the thin boundary between the red Seat and the blue Legs/Frame.
The most common failure mode is poor definition in objects where components are merged (like soft seating) or have rounded edges. The lack of sharp geometric distinction makes it nearly impossible for the per-point MLPs to accurately assign semantic boundaries, leading to significant bleeding between segments.
In the third failure (Accuracy: 0.548), the model appears to fundamentally misclassify the base structure. The Ground Truth shows clear segment layers (e.g., base/seat), but the prediction replaces a large vertical section with a single color (Blue). This suggests the model failed to combine local features with the global context correctly, likely interpreting the lower portion as a single, uniform volume rather than distinct chair segments.
The low accuracy confirms that when local feature clues are ambiguous due to complex geometry (like thick cushions, round surfaces), the network's reliance on the global feature vector is insufficient to distinguish boundaries at the point level, resulting in volumetric misassignments.

3. Robustness Analysis (20 points)¶

Experiment 1: Rotate input point clouds by certain degrees¶

Procedure: To evaluate the model's invariance to orientation, I tested the pre-trained models on the standard test dataset while applying a rotation transformation to the input point clouds. I systematically rotated each object around the vertical axis (z-axis) with angles ranging from $10^{\circ}$ to $180^{\circ}$ and recorded the drop in accuracy for both classification and segmentation tasks.

Classfication Results

The table shows the accuracy comparison relative to the baseline (Q1).

Rotation (Degree)	0 (Baseline)	10	20	30	40	50	60	70	80	90	100	110	120	130	140	150	160	170	180
Test Accuracy	0.980 (Baseline)	0.969	0.912	0.823	0.686	0.475	0.285	0.288	0.376	0.459	0.493	0.492	0.487	0.478	0.512	0.601	0.673	0.696	0.642

The following are few visualizations (includes successful or bad visualizations):

Rotation (Degree)	Ground Truth	Prediction
10	Chair	Chair
40	Chair	Vase
70	Lamp	Chair

Segmentation Results

The table shows the accuracy comparison relative to the baseline (Q2).

Rotation (Degree)	0 (Baseline)	10	20	30	40	50	60	70	80	90	100	110	120	130	140	150	160	170	180
Test Accuracy	0.904 (Baseline)	0.887	0.846	0.794	0.735	0.663	0.569	0.447	0.322	0.227	0.167	0.133	0.124	0.137	0.169	0.224	0.292	0.344	0.359

The following are few visualizations (includes bad visualizations):

Rotation (Degree)	Ground Truth	Prediction	Test Accuracy
30			0.361
100			0.167
180			0.359

Interpretation

Classification: The model lacks rotation invariance and is highly sensitive to the orientation of the input data. As shown in the results, accuracy is high at small perturbations ($10^{\circ}$) but plummets drastically as rotation approaches $60^{\circ}-90^{\circ}$ (dropping from $\sim97\%$ to $\sim28\%$). This indicates that the PointNet architecture relies heavily on the canonical alignment of the training set and fails to generalize when absolute coordinates shift.
Segmentation: Similar to classification, segmentation performance degrades rapidly as rotation increases, dropping from $\sim88\%$ accuracy at $10^{\circ}$ to $\sim12\%$ at $120^{\circ}$. Because the model learns to associate specific parts with specific spatial locations, rotating the object disrupts these learned spatial priors.

Experiment 2: Different number of points per object¶

Procedure: To analyze how dependent the model is on high-resolution data, I varied the number of points sampled from the test objects during inference. While the model was trained on 10,000 points, I evaluated it on inputs ranging from extremely sparse clouds (10 points) to dense clouds (9,000 points) to identify the critical sampling density required for accurate performance.

Classfication Results

The table shows the accuracy comparison relative to the baseline (Q1).

Number of Points per object	10	100	1000	2500	5000	7500	9000	10000 (Baseline)
Test Accuracy	0.257	0.927	0.972	0.979	0.979	0.979	0.980	0.980 (Baseline)

The following are few visualizations (Bad visualizations):

Number of Points per object	Ground Truth	Prediction
10	Chair	Lamp
1000	Chair	Vase
7500	Lamp	Chair

Segmentation Results

The table shows the accuracy comparison relative to the baseline (Q2).

Number of Points per object	10	100	1000	2500	5000	7500	9000	10000 (Baseline)
Test Accuracy	0.614	0.829	0.904	0.905	0.905	0.905	0.906	0.906 (Baseline)

The following are few visualizations (includes Bad visualizations):

Number of Points per object	Ground Truth	Prediction	Test Accuracy
10			0.3
2500			0.464
7500			0.478

Interpretation

Classification: The model demonstrates remarkable robustness to varying point densities. Accuracy saturates early around 1,000 points (~97%), implying that the PointNet architecture effectively captures the global shape using only a sparse "critical set" of points. The performance only collapses at extremely low densities (10 points, ~25%), where the point cloud becomes too sparse to represent the object's underlying geometry, rendering it unrecognizable.
Segmentation: Similar to classification, segmentation performance remains stable and high (>90%) once the point count exceeds 1,000, with negligible gains from adding more points (up to 9,000). The model remains surprisingly effective even at 100 points (83%), suggesting it relies on rough structural cues rather than fine-grained density. However, at 10 points, the accuracy drops significantly because there is insufficient local context to distinguish between different object parts.

4. Bonus Question - Locality (20 points)¶

Model Implemented

I have implemented Transformer-based Point Encoder for both classification and segmentation.
Classification Model: Input points are projected to embeddings with added sinusoidal positional encodings. The core architecture uses a Transformer Encoder layer with Multi-Head Self-Attention to capture relationships between points. Features are aggregated via Max Pooling into a global vector and passed through an MLP head for final classification.
Utilizes the same embedding and Transformer Encoder backbone to enrich point features with context via self-attention. Unlike the classification model, it skips global pooling and applies a 1D Convolutional head directly to the sequence of encoded features to generate per-point segmentation masks.
This satisfies the locality requirement by using Self-Attention, which allows the model to aggregate information from neighboring points dynamically, rather than processing each point in isolation like PointNet.

Both models are trained on same hyperparameters (Same as baseline) Model training Hyperparameters

Learning Rates
- Optimizer: Adam
- Learning Rate: 0.001
- Epochs: 250
- Batch Size: 32

Classification Model Architecture

class PositionalEncoding(nn.Module):
    def __init__(self, dim: int):
        super().__init__()
        self.dim = dim

def forward(self, x):
    device = x.device
    N = x.size(1)

    pos = torch.arange(N, dtype=torch.float32, device=device).unsqueeze(1)
    dim = torch.arange(self.dim, device=device).float()

    angle = pos / (10000 ** (2 * (dim // 2) / self.dim))
    pe = torch.zeros(N, self.dim, device=device)
    pe[:, 0::2] = torch.sin(angle[:, 0::2])
    pe[:, 1::2] = torch.cos(angle[:, 1::2])

    return pe.unsqueeze(0)
class cls_model_transformer(nn.Module):
def init(self, num_classes=3, d_model=32, nhead=2, num_layers=1):
super().init()
    self.embedding = nn.Linear(3, d_model)
    self.pos_encoding = PositionalEncoding(d_model)

    encoder_layer = nn.TransformerEncoderLayer(
        d_model=d_model, nhead=nhead, dropout=0.1, batch_first=True
    )
    self.transformer = nn.TransformerEncoder(encoder_layer, num_layers)

    self.head = nn.Sequential(
        nn.Linear(d_model, 64),
        nn.BatchNorm1d(64),
        nn.ReLU(inplace=True),
        nn.Dropout(0.3),
        nn.Linear(64, num_classes)
    )

def forward(self, points, return_logits=False):
    x = self.embedding(points) + self.pos_encoding(points)
    x = self.transformer(x)           
    x = x.max(dim=1).values           
    x = self.head(x)

    if return_logits:
        return x
    return F.log_softmax(x, dim=1)

Test Accuracy: 0.988 (Compared to baseline (PointNet) there is an improvement of 0.82% in test accuracy)

Model Performance Comaprison

Model	Test Accuracy
PointNet (Baseline)	0.980
Transformer	0.988

Classification Visulizations

The following visualizations show objects that were misclassified by the PointNet model but correctly identified by the transformer model.

Ground Truth	PointNet (Baseline) Model Prediction (Failure)	Transformer Model Prediction (Success)
Chair	Lamp	Chair
Lamp	Vase	Lamp
Vase	Lamp	Vase

Segmentation Model Architecture

class seg_model_transformer(nn.Module):
    def __init__(self, num_seg_classes=6, d_model=32, nhead=2, num_layers=1):
        super().__init__()

    self.embedding = nn.Linear(3, d_model)
    self.pos_encoding = PositionalEncoding(d_model)

    encoder_layer = nn.TransformerEncoderLayer(
        d_model=d_model, nhead=nhead, dropout=0.1, batch_first=True
    )
    self.transformer = nn.TransformerEncoder(encoder_layer, num_layers)

    self.conv_head = nn.Sequential(
        nn.Conv1d(d_model, 64, 1),
        nn.BatchNorm1d(64),
        nn.ReLU(inplace=True),
        nn.Conv1d(64, num_seg_classes, 1)
    )

def forward(self, points, return_logits=False):
    # points: (B, N, 3)
    x = self.embedding(points) + self.pos_encoding(points)
    x = self.transformer(x)

    x = x.transpose(1, 2)               
    x = self.conv_head(x)
    x = x.transpose(1, 2)                

    if return_logits:
        return x
    return F.log_softmax(x, dim=2)

Test Accuracy: 0.923 (Compared to baseline (PointNet) there is an improvement of 2.10% in test accuracy)

Model Performance Comaprison

Model	Test Accuracy
PointNet (Baseline)	0.904
Transformer	0.923

Segmentation Visualizations

The following visualizations show objects for which the PointNet model achieved lower test accuracy, while the Transformer model performed better.

Ground Truth	PointNet (Baseline) Model Visualization	Transformer Model Visualization	PointNet (Baseline Model Test Accuracy	Transformer Model Test Accuracy
			0.409	0.892
			0.338	0.876
			0.464	0.923

Interpretation

The Transformer-based implementation achieved slightly higher test accuracy compared to the baseline PointNet model across both tasks. This performance gain highlights the benefit of the self-attention mechanism, which effectively captures local context and relationships between points that the standard PointNet architecture fails to perceive.