Assignment 2 - Single view to 3D

Nanaki Singh

Q1.1 - Fitting a voxel grid

Left: Predicted Right: Target

fitted ground truth

Q1.2 - Fitting a point cloud

Left: Predicted Right: Target

fitted ground truth

Q1.3 - Fitting a mesh

Left: Predicted Right: Target

fitted ground truth

Q2.1 - Image to point cloud

Below are visuals of three different datapoints

The point cloud model was trained for 2000 epochs using the chamfer distance loss. I trained on a cpu using the learned features.

The model structure is shown below.

picture1 picture2 picture3

In [ ]:
self.decoder = nn.Sequential(
                nn.Linear(512, 1024),
                nn.ReLU(),
                nn.Linear(1024, 2048),
                nn.ReLU(),
                nn.Linear(2048, 4096),
                nn.ReLU(),
                nn.Linear(4096, self.n_point * 3),
                nn.Tanh(),
                nn.Unflatten(1, (self.n_point, 3)),
                )

Q2.2 - Image to mesh

Below are visuals of three different datapoints

The mesh model was trained for 10000 epochs on a cpu with the learned features as input using the laplacian smoothness loss.

The model structure is shown below.

picture1 picture2 picture3

In [ ]:
vertex_shape_0 = self.mesh_pred.verts_list()[0].shape[0]
# Decoder predicts 3D offsets for each vertex
self.decoder = nn.Sequential(
    nn.Linear(512, 1024),
    nn.BatchNorm1d(1024),
    nn.ReLU(inplace=True),
    nn.Linear(1024, 2048),
    nn.BatchNorm1d(2048),
    nn.ReLU(inplace=True),
    nn.Linear(2048, 4096),
    nn.BatchNorm1d(4096),
    nn.ReLU(inplace=True),
    nn.Linear(4096, 8192),
    nn.BatchNorm1d(8192),
    nn.ReLU(inplace=True),
    nn.Linear(8192, vertex_shape_0 * 3),
    )

Q2.3 - Image to voxel

Below are visuals of three different datapoints.

The voxel model was trained for 20000 epochs on a cpu with the learned features as input using the voxel loss.

The model structure is shown below.

picture1 picture2 picture3

In [ ]:
self.decoder = nn.Sequential(
                nn.Linear(512, 2 * 2 * 2 * 64),  # Desired shape
                nn.ReLU(),
                nn.Unflatten(1, (64, 2, 2, 2)),  # B x 64 x 2 x 2 x 2
                nn.ConvTranspose3d(
                    in_channels=64, out_channels=32, kernel_size=4, stride=2, padding=1
                ),  # B x 32 x 4 x 4 x 4
                nn.BatchNorm3d(32),
                nn.ReLU(),
                nn.ConvTranspose3d(
                    in_channels=32, out_channels=16, kernel_size=4, stride=2, padding=1
                ),  # B x 16 x 8 x 8 x 8
                nn.BatchNorm3d(16),
                nn.ReLU(),
                nn.ConvTranspose3d(
                    in_channels=16, out_channels=8, kernel_size=4, stride=2, padding=1
                ),  # B x 8 x 16 x 16 x 16
                nn.BatchNorm3d(8),
                nn.ReLU(),
                nn.ConvTranspose3d(
                    in_channels=8, out_channels=1, kernel_size=4, stride=2, padding=1
                ),  # B x 1 x 32 x 32 x32
                nn.BatchNorm3d(1),
                nn.Sigmoid(),  # For probability
            )

Q2.4 - Quantitative comparisons

Quantitatively: The f-1 score is highest for voxel grid, achieving the highest f1 score of 73 at threshold = 0.05, while the point cloud model scores 71 and the mesh model scores 69 at the same threshold. We observe the same trend across all 3 graphs: as the threshold increases, the f-1 score steadily increases. Visually, the increase in f-1 score is linearly increasing for the point cloud and mesh models, while we notice a dip in the curve for voxel grids at a threshold of 0.3.

Conclusions: Since the voxel based model performs the best, we can see depsite being a relatively coarse structure, voxel grids are able to capture occupancy in a 3D grid structure decently well. The point cloud model also achieves a similar level of accuracy, suggesting that such models can recreate surface level geometry relatively well, but lose a degree of more fine-grained structure and detail as the threshold gets tighter. Mesh models, which learn to deform a template mesh by predicting the vertex points, perform the worst out of the 3. This indicates that minor geometric errors in the prediction of the vertex greatly impacts the surface topology that is visualized, and impacts the f1-score or structure of the predicted mesh the most.

Below are the graphs for threshold vs. f1-score for the voxels, point cloud, and mesh models

Point Cloud Voxel Mesh

Q2.5 - Analyze effects of hyperparam variations

The hyperparameter I chose to vary was w-chamfer or the weight given to the Chamfer loss when training a point cloud model for 200 iterations. I used a weighted average of the chamfer and Hausdorff losses to penalize the model. Note that the chamfer loss is used to measures the nearest neighbour distance between points and focuses on recreating the overall shape of the structure accurately. In comparison, the Hausdorff loss penalizes the maximum deviations of the points from the necessary point cloud structure, which should theoretically mean that the predicted point cloud should be more sensitive to outliers (no random points predicted far from the surface of the object). However, because there is less weight applied to the Chamfer distance, it is expected to see a degradation in the overall shape.

I began by writing the code for the Hausdorff loss.

In [ ]:
def chamfer_loss_helper(point_cloud_src, point_cloud_tgt):
    # Broadcast
    extracted_point_cloud_src = point_cloud_src[:, :, None, :]  # B x N x 1 x 3
    extracted_point_cloud_tgt = point_cloud_tgt[:, None, :, :]  # B x 1 x N x 3

    # Take diff
    diff = extracted_point_cloud_src - extracted_point_cloud_tgt  # B x N x N x 3
    diff = diff**2  # B x N x N x 3

    # sum over last axis
    diff = diff.sum(dim=-1)  # B x N x N

    # minimum
    min_ssd_1 = torch.min(diff, dim=2)  # B x N
    min_ssd_0 = torch.min(diff, dim=1)  # B x N x 1

    # sum across last axis
    summed_squared_diff1 = torch.sum(  # (B, )
        min_ssd_1.values,
        dim=1,
    )
    summed_squared_diff0 = torch.sum(  # B x 1
        min_ssd_0.values,
        dim=1,
    )

    chamfer_distance = summed_squared_diff1 + summed_squared_diff0  # B x 1
    chamfer_distance = chamfer_distance.sum(dim=-1)  # (B,)

    return chamfer_distance

def hausdorff_loss_(point_cloud_src, point_cloud_tgt):
    # Take diff
    diff = (
        point_cloud_src[:, :, None, :] - point_cloud_tgt[:, None, :, :]
    )  # B x N x N x 3
    diff = (diff**2).sum(dim=-1)  # B x N x N

    # Dir ->
    min_values_one, _ = diff.min(dim=2)  # B x N
    max_min_val_one = min_values_one.max(dim=1).values  # B,

    # Dir <-
    min_values_two, _ = diff.min(dim=1)  # B x N
    max_min_val_two = min_values_two.max(dim=1).values  # B,

    # Take max of both direction
    hausdorff_loss = torch.maximum(max_min_val_one, max_min_val_two)  # B,
    hausdorff_loss_mean = hausdorff_loss.mean()

    return hausdorff_loss_mean

def chamfer_loss(point_cloud_src, point_cloud_tgt):
    chamfer_loss_weight = 0.5
    other_loss = 1 - chamfer_loss_weight

    chamfer_loss = chamfer_loss_helper(point_cloud_src, point_cloud_tgt)
    hausdorff_loss = hausdorff_loss_(point_cloud_src, point_cloud_tgt)

    output_loss = chamfer_loss_weight * chamfer_loss + other_loss * hausdorff_loss

    return output_loss

Below are generated plots for varying the w_chamfer parameter to have the following values: [1.0, 0.9, 0.7, 0.5]

4 point cloud models for were trained 200 epochs to generate the following plots. Note that, visually, there are very minute difference, but the average f1 score for the models increased and then decreased as the w_chamfer increased:

w_chamfer = 1.0 - average f1 score = 0.84; w_chamfer = 0.9 - average f1 score = 0.87; w_chamfer = 0.7 - average f1 score = 0.79; w_chamfer = 0.5 - average f1 score = 0.74;

I noticed that as the w_chamfer distance increased, there were fewer outlier points (or points were predicted closer to the true surface of the chair.)

w_chamfer = 1.0 w_chamfer = 0.9 w_chamfer = 0.7 w_chamfer = 0.5

Q2.6 - Interpret Model

I was interested in understanding what structural information the feature vector encoded and how the model interpreted it. I experimented with interpolating between two latent feature vectors — one representing a flatter, more compact chair, and the other a taller, elongated one.

The original point cloud are shown below:

Flatter, more compact chair Taller, elongated chair

I created a new latent vector by linearly blending the two using a weight parameter w which varied from 0.0 to 1.0. Each interpolated vector was then passed through a trained point cloud decoder to visualize the output.

w=0.0 w=0.2 w=0.4 w=0.6 w=0.8 w=1.0

As can be seen from the images, as w increased, there was a smooth transition between the two shapes. The point cloud reconstruction gradually shifted from a taller, thinner shape to a wider chair, before becoming a flatter distribution of points. It clearly highlights that the model's latent space can capture and decode structural information about points - the latent vector also seems to primarily store information about the generic shape of the chair rather than more fine grained details, otherwise we were more likely to have seen artifacts or more random points appear in the representations.

Note that the thinner, more structured chair's shape did dominate for w = 0.0, 0.2, 0.4 and 0.6, and was only completely lost at w = 1.0.