Question 1: Exploring Loss Functions¶
1.1: Fitting a Voxel Grid
Order: (1) GT (2) Prediction
1.2: Fitting a Point Cloud
Order: (1) GT (2) Prediction
1.3: Fitting a Mesh
Order: (1) GT (2) Prediction
Question 2: Reconstructing 3D From Single View¶
Because of GPU Issues, the following were run on CPU and with load_feat set to True
2.1: Image to Voxel Grid
Order: (1) RGB (2) GT (3) Prediction
2.2: Image to Point Cloud
Order: (1) RGB (2) GT (3) Prediction
2.3: Image to Mesh
Order: (1) RGB (2) GT (3) Prediction
2.4: Quantitative Analysis
The above plots show that as the threshold increases, the F1-Score also increases for all three representation types. This intuitively makes sense because the higher the threshold, the higher the margin for error between the ground truth and prediction.
Amongst all the methods, voxels performed the best. Even though occupancy grids are pretty low resolution, using a 32x32x32 grid provides high enough detail to extract a mesh that can most accurately represent the ground truth. Furthermore, because the voxel grid is much more granular than point clouds and meshes, the network may have taken advantage of the simplicity of representation to better learn the image to 3d conversion. Also, the network was trained on a batch size of 32, allowing the gradients to converge better than in the other networks.
The second-best method was meshes. As seen in the above outputs, the meshes have very sharp faces and can't capture fine details such as the spaces between the chair's back or the armrests. The appearance of really sharp edges may be because the smoothness loss didn't have a high enough weight in the mesh loss function. The network also may have struggled with learning the mesh deformation since the initial mesh was an ico-sphere. An ico-sphere is very far off from the average chair mesh representation. Thus, learning to deform the ico-sphere vertices is much more difficult than learning to deform a shape that is more similar to the final output. Such shapes could include a cube mesh or even a generic chair mesh. Furthermore, the network was trained on a batch size of 2 which leads to very noisy gradients throughout the epoch. I tried to control this noise with a batch norm layer, but still observed very noisy losses during training.
The worst method was point clouds. As seen in the above outputs, the point clouds don't form well-defined shapes - there's a lot of noise from the points. The points seem to be very clustered towards the center of the object, leading to loss of fine-grained detail on the chair legs and and armrests. The main reason is because of batch size. Due to computational constraints as in the mesh model training, I had to use a batch size of 2. This led to very noisy gradients that made model convergence much more difficult. Even though the batch size was the same as in mesh, point clouds performed worse. This may be due to the actual chamfer loss. This loss function states that on average, the points in the ground truth should be close to the points in the prediction and vice versa. Thus, the model can learn to place most of its points in an area where there are many points in the ground truth. This leads to inabilities in capturing more sparse areas like the armrest and legs.
2.5: Analyze Effects of Hyperparameter Variation
Given my discussion in 2.4 about the disadvantages of using ico_spheres, I tried modifying the inital shape for mesh deformation. I tried using a high dimensional cube. Since an actual cube only has 8 vertices and 12 faces, I used pytorch's SubdivideMeshes() function to increase the number of vertices of the cube. This gives me more representation power since the model is now able to deform more vertices.
Order: (1) Ico-sphere initial mesh (2) Cube mesh
In the above F-1 plots and visualizations, we see that the initial cube mesh performs much worse than the ico-sphere. The main reason is representational power. While the ico-sphere has 2562 vertices and 5120 faces, the cube has 98 vertices and 192 faces. With the cube mesh, the model isn't able to deform that many vertices and faces to get an output that's close to the ground truth. Even though the cube is intuitively the better initial shape, more vertices need to be added to allow the model to correctly deform it. If I had more time, I would have tried using a higher dimensional initial cube mesh so that its number of vertices and faces parallel those of the ico-sphere.
2.6: Interpret Your Model
For my cube mesh model, I used a standard architecture of Linear Layer, BatchNorm, and then LeakyReLU Activation as 1 block. I repeated this block several times and finished off with a Linear Layer. To better understand how the deformation happens at each step, I decided to take the output at the end of each block and the output after the final Linear Layer.
Below, we see the ground truth on the left. The next five images are the model outputs after each block/final layer.
Above, we can see how the model slowly transforms the initial mesh to look more and more like the ground truth mesh. In the first three layers, the mesh progressively becomes less random. By the fourth layer, the mesh becomes much smaller and the shape seems to become more cube-ish. In the final layer, the model finally deforms the vertices so that we get some semblance of the ground truth mesh.
Question 3: Extended Dataset for Training¶
3.3: Extended Dataset for Training
Left: Partial dataset, Right: Full dataset
Looking at the F1 Score, the model trained on the partial dataset has twice the F1 score as the model trained on the full dataset. In their visualizations, we see that the partial dataset model captures most of the input image's details but the full dataset model fails to capture the legs of the chair. These results show that training on multiple classes leads the model to focus on finding more generic details such as the base of the shape (ex: back of the chair). For example, both of the full dataset model's meshes have very similar backrests even though the 2nd chair is much wider. Also, focusing on generic details led the model to be unable to capture more specific details like the legs or armrest. This may be because we are using the exact same representational power for both models but one is focusing on several types of shapes while the other is focusing on a single shape. This leads to this tradeoff between more classes / more general detail and single class / better overall detail.