Source first, then target

Source first, then target

Source first, then target

All model training and evaluation in this assignment uses the load_feat argument. The images are in order of input image, groundtruth, prediction



The images are in order of input image, groundtruth, prediction



The images are in order of input image, groundtruth, prediction



From the graphs below, we can see that all three models follow a similar pattern in F1-Scores when varying the thresholds. They all start with a low F1-Score at threshold 0.01, and steadily increase to achieve the highest F1-Score when using a threshold of 0.05.
The best performing model is the pointcloud predictor, with an F1-Score of 77. The next best is the voxel predictor, with an F1-Score of 74. Then, the mesh predictor has an F1-Score of 70.
The performance of these models can be explained by their strengths and limitations.
Voxel F1-Score Curve
Pointcloud F1-Score Curve
Mesh F1-Score Curve

I changed the number of points predicted in the point cloud prediction. Below are the F1-Score curves and qualitative results.
Analysis: Using fewer points makes our model more time and memory efficient, but the predicted shapes tend to be coarse and miss fine details such as thin structures or edges, leading to higher reconstruction error. On the other hand, predicting more points provides denser coverage of the object surface, allowing the model to capture small geometric details and achieve lower error metrics, though this comes at the cost of higher computational and memory demands. This is shown through the quantiative and qualitative metrics.
500 points
1000 points
3000 points
5000 points

500 points result
1000 points result
3000 points result
5000 points result

I converted the groundtruth meshes into voxel grids and compared predicted and ground-truth voxels by calculating the distance from each predicted voxel to its closest groundtruth voxel surface. I coloring them based on distance from the mesh surface, highlighting where predictions diverged most from the ground truth surface. From the qualitative results, we can see that our model is able to accurately predict the details of the top of the chair, but it loses its geometric precision towards the bottom of the chair. This may be due to the fact that many of the input images were taken with high elevation angles. This elevation shows that 3D reconstruction performs well in areas of the scene that are captured well from the training images and performs poorly in areas that aren't captured thoroughly.
Here are the qualitative results below. The order is GT mesh, predicted mesh, error map.



I ran an experiment, training my voxel model on chair, car, and plane. It was trained with the same number of iterations, learning rate, model architecture, and all other hyperparameters, compared to the training scheme in Q2.1. The only difference was that it was trained on the full dataset, and tested on the mini, chair-only dataset.
The quantitative and qualitative results on the test set show that the Q2.1 model performs better. This can be due to the fact that the training dataset had the same distribution as the testing dataset in Q2.1's model. On the other hand, Q3.3 was trained on other types of objects like cars and planes, which meant that the distributions of the training and testing dataset were very different. If we were to increase the model's capacity for Q3.3's experiment, have more chair data in the full dataset, or tune other hyperparameters, then we can reduce this performance gap.
From the quantitative results, we can see that Q3.3 had a much lower F-Score than Q2.1's model at all thresholds. The qualitative results also show that Q3.3 produces results that are have less quality than Q2.1's model.
Q2.1 Quantitative Results
Q3.3 Quantitative Results

Qualitative Results In order of GT mesh --> Q2.1 Result --> Q3.3 Result


