| Input image | Predicted voxel | Target mesh |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Input image | Predicted point cloud | Target mesh |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Input image | Predicted mesh | Target mesh |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The three models achieved similar performance scores, but they required very different amount of computation to reach those results. The voxel model was trained for 15K iterations, the point cloud model for 2K iterations, and the mesh model for 1.5K iterations. I think the difference in the amount of training required accross representations is due to the fact the point cloud and mesh models used the chamfer distance as a loss function. This encourages predicted points (or sampled points, in the case of the mesh) to be close to target points. This is very similar to what the F-1 score measure. Meanwhile, the voxel model used binary cross entropy as a loss function, which has more considerations (for example, predict 1s inside the object, and do not predict 1s outside the object even if you are close to it).
When predicting a voxel from the encoder output, we have different options on architecture for the decoder. The simplest one is an MLP that predicts a tensor of size 32,768 that we then reshape to a voxel grid of size 1x32x32x32. Alternatively, we could reshape the output of the encoder to a 1x8x8x8 voxel grid and then use a series of transposed convolutions to upsample it to the desired voxel resolution (1x32x32x32). Transposed convolutions have the advantage of having a stronger inductive bias on the dependence of nearby voxel pixels, which should allow them to learn better 3D representations with less training. However, they also require that the encoder learns a more structured latent space representation of the input image.
The advantage of increasing the inductive bias in the decoder by using transposed convolutions will outweight the disanvantage of having a less flexible latent space representation coming from the encoder.
I trained 4 models to predict voxels from single images. The first one used an MLP decoder, while the other models used transposed convolutional decoders. Each model applied a different reshaping of the output from the encoder before passing it to the decoder. The MLP kept the encoder output unchanged. The other models reshaped it to 64x2x2x2, 8x4x4x4 and 1x8x8x8 dimensions, respectively. These models were trained for 1-5k iterations with a batch size of 32. They were evaluated using the minimum training loss achieved. Using a validation F-1 score metric would have been ideal, but the trained models were still predicting empty meshes. It should be noted that the number of parameters is not the same across the networks. The model with the MLP decoder has around 84M parameters in total, while the models with the transposed convolutional decoders have around 11M parameters.
The plot below shows the minimum training loss achieved by each architecture. The first circle corresponds to the MLP decoder model. In this experiment, the MLP decoder achieved a lower training error than the transposed convolutional decoders. This suggests that the stronger inductive bias of the transposed convolutional decoder was not beneficial enough for the task of voxel prediction from single images in this setting.
The model we have trained has the capacity to reconstruct 3D shapes from single images. These reconstructions include segments of the object that are not visible in the input image. This could mean that the model has learned to identify patterns from the image or that it has just memorized a set of shapes and is trying to match the input image to one of them. To investigate this, I am running the model on views of images that are very different from the training set. If the model is able to reconstruct a reasonable 3D shape from these images, then it is likely that it has learned to identify patterns from the image. If the model produces a shape that is similar to one of the training shapes, then it is likely that it has just memorized a set of shapes.
| Input image | Predicted point cloud | Target mesh |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
I replicated the network architecture from the paper "AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation". Based on their results, I decided to use 25 manifolds to represent each object. This means, I trained 25 different MLPs that each take as input the concatenation of the encoder output and a 2D point sampled from a square. Each of these MLPs predicts a 3D point, and we then join all predicted points to generate the final pointcloud predicted by the model.
| Input image | Predicted point cloud | Target mesh |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|