Assignment 5 – PointNet on Point Clouds

Q1. Classification Model

1.1 Model Architecture Summary

class cls_model(nn.Module):
    def __init__(self, num_classes=3):
        super(cls_model, self).__init__()
        self.point_features = nn.Sequential(
            nn.Conv1d(3, 64, 1),
            nn.ReLU(),
            nn.Conv1d(64, 128, 1),
            nn.ReLU(),
            nn.Conv1d(128, 1024, 1),
        )
        self.classifier = nn.Sequential(
            nn.Linear(1024, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, num_classes),
        )

    def forward(self, points):
        '''
        points: tensor of size (B, N, 3)
                , where B is batch size and N is the number of points per object (N=10000 by default)
        output: tensor of size (B, num_classes)
        '''
        # transpose points to (B, 3, N)
        points_transposed = points.transpose(1, 2)

        #per-point features
        x = self.point_features(points_transposed)

        # max pooling to get global features
        x_global = torch.max(x, dim=2)[0] # (B, 1024)

        # classify
        output = self.classifier(x_global)
        return output

1.2 Training Setup

1.3 Test Accuracy

Final Test Accuracy: 0.9717

1.4 Visualizations of Predictions

Correct Predictions

Failure Cases

1.5 Interpretation

The failure case are likely cause by global pooling of extract the most prominent feature of a categories. The miscategorized shape does have some features that is ambiguous of both categories, such as having long supporting legs for both lamp and chair, have a container-like shape for both lamp and vase. The model is likely having a hard time distinguishing simply based on global pooling layers with only global awareness.


Q2. Segmentation Model

2.1 Model Architecture Summary

class seg_model(nn.Module):
    def __init__(self, num_seg_classes=6):
        super(seg_model, self).__init__()
        
        # 1. Encoder (Same as Classification)
        self.point_features = nn.Sequential(
            nn.Conv1d(3, 64, 1),
            nn.ReLU(),
            nn.Conv1d(64, 128, 1),
            nn.ReLU(),
            nn.Conv1d(128, 1024, 1)
        )
        
        # 2. Decoder (The "Concatenation Trick")
        # Input: 1024 (Global) + 1024 (Local) = 2048
        self.decoder = nn.Sequential(
            nn.Conv1d(2048, 512, 1),
            nn.ReLU(),
            nn.Conv1d(512, 256, 1),
            nn.ReLU(),
            nn.Conv1d(256, 128, 1),
            nn.ReLU(),
            nn.Conv1d(128, num_seg_classes, 1)
        )

    def forward(self, points):
        '''
        points: tensor of size (B, N, 3)
        output: tensor of size (B, N, num_seg_classes)
        '''
        batch_size = points.size(0)
        num_points = points.size(1)
        
        # Transpose: (B, N, 3) -> (B, 3, N)
        points_transposed = points.transpose(1, 2)
        
        # Encoder: (B, 3, N) -> (B, 1024, N)
        local_features = self.point_features(points_transposed)
        
        # Max Pooling (Global Feature): (B, 1024, N) -> (B, 1024, 1)
        global_feature = torch.max(local_features, dim=2, keepdim=True)[0]
        
        # Expansion: (B, 1024, 1) -> (B, 1024, N)
        global_feature_repeated = global_feature.repeat(1, 1, num_points)
        
        # Concatenation: (B, 1024, N) + (B, 1024, N) -> (B, 2048, N)
        combined_features = torch.cat([local_features, global_feature_repeated], dim=1)
        
        # Decoder: (B, 2048, N) -> (B, num_seg_classes, N)
        logits = self.decoder(combined_features)
        
        # Transpose back: (B, num_seg_classes, N) -> (B, N, num_seg_classes)
        return logits.transpose(1, 2)

2.2 Training Setup

2.3 Test Accuracy

Final Test Accuracy: 0.8995

2.4 Visualizations

For each object: show predicted segmentation + ground truth.

Object 1

Object 2

Object 3

Object 4 — Bad Prediction

Object 5 — Bad Prediction

2.5 Interpretation

The model achieves near-perfect accuracy on standard chair topologies where functional parts—such as legs, seats, and backrests—are geometrically distinct and spatially separated. However, performance degrades significantly on complex shapes like armchairs or continuous curved designs (Object 4 and 5), where the boundaries between semantic parts are ambiguous or merged. In these failure cases, the model struggles to classify adjacent points that share similar local geometry, causing labels to incorrectly "bleed" across transition zones where parts are not clearly separated.


Q3. Robustness Analysis

Experiment 1

Procedure

Down-sampled each point cloud to 10k, 1k, 100, and 50 points, then reran the pretrained classification and segmentation models to quantify sensitivity to sparsity.

Accuracy Comparison

Visualizations

Interpretation

The point density doesn’t have big impact for prediction accuracy for classification task because even if points are as sparse as 50 points, general structure are still visible. Segmentation tasks was impacted more when points are becoming sparser because points for specific parts can come to a point where is not legible any more and specific points become ambiguous even with naked eye of which parts it belongs to.


Experiment 2

Procedure

Applied rigid rotations of 0°, 30°, 90°, and 180° around a principal axis before running inference, isolating orientation robustness without retraining.

Accuracy Comparison

Visualizations

Interpretation

Both tasks suffer large accuracy losses once the viewpoint differs from training orientation, with 90° being worst, likely because 90 degrees deviates away the most from the original structure. This indicates the models rely on canonical alignment and need rotation augmentation or equivariant architecture to generalize.