16-825 Learning for 3D Vision
Rohan Nagabhirava
Optimized Voxel Grid
Ground Truth Voxel Grid
Loss Function: Binary Cross Entropy (BCE) Loss
We use a binary cross-entropy loss which compares the probability of a certain voxel being occupied or not occupied with the ground truth and compares the similarity of the distributions of the two, providing a loss over that.
Optimized Point Cloud
Ground Truth Point Cloud
Loss Function: Chamfer Distance
We implemented chamfer distance which provides a loss that takes the L2 distance (the minimum L2 distance) over one set of points combined with the minimum L2 distance for another set. Essentially, giving us a loss of L2 distance of the nearest point from the predicted dataset to the ground truth dataset and vice versa.
Optimized Mesh
Ground Truth Mesh
Loss Function: Chamfer Distance + Smoothness Loss
We had this chamfer_loss for the vertices but also added a smoothness_loss implementation which utilizes Laplacian smoothing. Essentially creating a loss for having your adjacent vertex points be very different, and having the center vertices be as close to the average of your adjacent and connected vertices.
Input RGB
Predicted Voxel Grid
Ground Truth Voxel Grid
Input RGB
Predicted Voxel Grid
Ground Truth Voxel Grid
Input RGB
Predicted Voxel Grid
Ground Truth Voxel Grid
Decoder Architecture: The decoder is a sequential structure. The decoder starts with a linear network increasing the latent space from 512 to 4096. Then we have four convolutional transpose 3D layers, with a batch norm 3D after each one and a ReLU after each one.
Input RGB
Predicted Point Cloud
Ground Truth Point Cloud
Input RGB
Predicted Point Cloud
Ground Truth Point Cloud
Input RGB
Predicted Point Cloud
Ground Truth Point Cloud
Decoder Architecture: For the point cloud decoder, we have a linear network with four linear layers. Each layer increases the output dimension by 2x from the previous one (512 to 1024 to 2048 to 4096), until the last layer which increases it to the total number of vertex points multiplied by 3 for xyz coordinates. There is a ReLU activation between each layer.
Input RGB
Predicted Mesh
Ground Truth Mesh
Input RGB
Predicted Mesh
Ground Truth Mesh
Input RGB
Predicted Mesh
Ground Truth Mesh
Decoder Architecture: For the mesh decoder, we have four linear layers. The first one increases the output dimension from the 512 latent space to 1024. The middle two linear layers keep a constant 1024 dimension space. The last linear layer outputs the number of vertices multiplied by 3 for the xyz coordinates of all of them. There is a ReLU activation between each layer, and the final output passes through a Tanh activation.
Mesh Initialization: We initialize the mesh using an icosphere with subdivision level 4.
Voxel Grid F1 Scores
Point Cloud F1 Scores
Mesh F1 Scores
Between the three approaches, we see a pretty similar trend for the F1 score. As the threshold increases, they all achieve much better F1 scores, which is expected since a larger threshold allows for more tolerance in the distance between predicted and ground truth points.
Point Cloud achieves the highest F1 scores overall. Looking at the graphs, the point cloud reaches almost 80% F1 score at the highest threshold of 0.05, whereas voxel and mesh both top out at around 75%. At the 0.03 threshold, point cloud was much higher at 53%, whereas mesh was at 50% and voxel was at 47%.
Why these differences? Voxel is a high-dimensional representation (32³ grid) that is more difficult to learn, resulting in lower F1 scores throughout the curve. Point clouds are the easiest to learn because all you need to learn is the XYZ coordinates for each point. Mesh falls in between - while it also learns vertex coordinates like point clouds, it additionally needs to respect the topology and structure of the faces, making it more constrained and slightly harder to optimize than free-form point clouds.
We varied the initial mesh topology to analyze how different starting geometries affect reconstruction quality.
Baseline (ico_sphere level 4)
Variation 1 (torus initialization)
Variation 2 (ico_sphere level 3)
Input RGB
Baseline
Torus Init
Coarser Sphere
In this experiment, we explored hyperparameter variations of the initial mesh for mesh-based 3D inference. Our baseline was an icosphere of subdivision level 4. We compared two variations: (1) a torus-based initialization, and (2) a coarser sphere (icosphere of level 3).
Quantitative Results: Looking at the F1 scores, all three variations reached about the same F1 scores overall, and the curves are pretty similar. However, at the 0.03 threshold, the coarser sphere performs a little bit lower than both the mesh torus and the baseline icosphere. This is because it has fewer faces than the torus initialization and the baseline icosphere, limiting its representational capacity.
Torus Initialization: The torus initialization is particularly interesting because it has a different topology than spheres - it contains a hole in the center. This means it can potentially learn some shapes better, especially those with similar topological features. However, for topological shapes with zero holes (like most chairs), it might struggle to learn the correct topology and could introduce artifacts.
Qualitative Comparison: Looking at the same chair across different initializations, we can see clear differences. The baseline really captured the bottom legs well, showing the differentiation between each leg clearly. The torus got pretty close - you can see the legs a bit and some clarity in the chair - but it definitely wasn't able to create details. The coarser sphere showed a much worse result than the baseline sphere, which makes sense because you have fewer faces to train on, resulting in a coarser reconstruction.
Conclusion: The initial mesh that you start off with is a very crucial parameter. The topology is very important, and having a similar topology to the shapes you're training on is highly important. While different initializations can converge to similar quantitative scores, the qualitative results reveal that topology and mesh resolution significantly impact the fine-grained geometric details of the reconstruction.
We analyze the learned weights of the point cloud and mesh decoders to understand how different 3D representations affect the network's learned parameters. By comparing weight distributions and layer-wise statistics, we can identify differences in learning strategies between the two decoder architectures.
Weight Distribution Histograms
Layer-wise Statistics
Similar Weight Distributions: The weights are very similar in distribution between the two model architectures. This makes sense as the architectures are similar and they're learning a very similar problem, so it shows a lot of the same type of transformations that are needed to learn the point cloud and mesh structure when you have a similar type of network. Both decoders converged to nearly identical Gaussian-like distributions centered around zero across all layers, indicating they learned comparable feature mappings.
Over-parameterization: We see so many weight values at 0, and you can see the sparsity is very high throughout the network. We may have over-parameterized this network and could have gone with a much simpler network to learn similar features. The high sparsity and concentration of weights near zero suggest that many parameters are contributing minimally to the output, indicating the networks are likely larger than necessary for this task.
Model Type: Mesh-based reconstruction
Dataset: Extended dataset with 3 classes (chair, car, plane) using split_3c.json
Comparison: Single-class training (chair only) vs Multi-class training (chair, car, plane)
Single Class Training (Chair only)
Three Class Training (Chair, Car, Plane)
Comparing single-class specialist vs multi-class generalist on chair reconstruction
Input RGB
Single-Class Training
Multi-Class Training
Single-class model cannot generalize to planes
Input RGB
Single-Class Training
Multi-Class Training
Single-class model cannot generalize to cars
Input RGB
Single-Class Training
Multi-Class Training
F1 Score Comparison: Based on the F1 graphs, you can see pretty clearly that the three-class training dataset is able to reach a much higher F1 score. At a threshold of 0.05, the three-class training was able to reach over 80% F1 score (approximately 85%), vs the single-class was only reaching around 75%. Notably, the multi-class model was only trained for 25,000 iterations compared to the single-class which was trained for 100,000 iterations, yet still achieved significantly better performance.
Qualitative Comparison - Chair: Looking at the chair comparison (the only object type the single-class model was trained on), you can see that the multi-class training had quite a bit of noise and was not as crisp and clear as the single-class model. This demonstrates that specialization on a single class produces cleaner results for that specific object type.
Generalization Test - Plane: When evaluating the single-class training on the full dataset with an input of a plane, it does much worse compared to the multi-class which was able to get much cleaner results. This is pretty obvious - the multi-class can generalize to more object types that it's learned on, while specialized training on only chairs does poorly generalizing to other objects which it hasn't seen before.
Generalization Test - Car: Looking at the car example, both the single-class training and the multi-class training struggled on this particular object. It looks like quite a difficult object to reconstruct, so it's hard to say which performed better because they both gave quite poor outputs.
Conclusion: Based on our plane example, you can pretty clearly see that the multi-class training dataset generalizes much better to multiple classes, while the single-class does much better at its specialized class. Multi-class training improves generalization but hurts specialization. There is a fundamental trade-off between training a specialist model that excels at one object type versus a generalist model that can handle diverse object categories with reasonable performance across all of them.