Assignment 2: Single View to 3D¶
1. Exploring loss functions¶
1.1. Fitting a voxel grid
| GroundTruth Voxel Grid |
Target Voxel Grid |
 |
 |
1.2. Fitting a point cloud
| GroundTruth Point Cloud |
Target Point Cloud |
 |
 |
1.3. Fitting a mesh
| GroundTruth Mesh |
Target Mesh |
 |
 |
2. Reconstructing 3D from single view¶
2.1. Image to voxel grid
| RGB Image |
GroundTruth Mesh |
GroundTruth Voxel Grid |
Predicted Voxel Grid |
 |
 |
 |
 |
| RGB Image |
GroundTruth Mesh |
GroundTruth Voxel Grid |
Predicted Voxel Grid |
 |
 |
 |
 |
| RGB Image |
GroundTruth Mesh |
GroundTruth Voxel Grid |
Predicted Voxel Grid |
 |
 |
 |
 |
2.2. Image to point cloud
| RGB Image |
GroundTruth Mesh |
GroundTruth Point Cloud |
Predicted Point Cloud |
 |
 |
 |
 |
| RGB Image |
GroundTruth Mesh |
GroundTruth Point Cloud |
Predicted Point Cloud |
 |
 |
 |
 |
| RGB Image |
GroundTruth Mesh |
GroundTruth Point Cloud |
Predicted Point Cloud |
 |
 |
 |
 |
2.3. Image to mesh
| RGB Image |
GroundTruth Mesh |
Predicted Mesh |
 |
 |
 |
| RGB Image |
GroundTruth Mesh |
Predicted Mesh |
 |
 |
 |
| RGB Image |
GroundTruth Mesh |
Predicted Mesh |
 |
 |
 |
2.4. Quantitative comparisions
Voxels
The voxel model achieved the lowest F1 score mainly due to its coarse resolution (32×32×32), which limits its ability to capture fine geometric details. Although 3D convolutions provide good spatial reasoning, the block-based representation and fixed grid size restrict reconstruction precision. As a result, despite appearing visually reasonable, voxel outputs remain less accurate compared to point cloud and mesh models.
Point Clouds
The point cloud model achieved the highest F1 score, both quantitatively and visually, due to its flexibility and simplicity. Unlike meshes, point clouds do not enforce vertex connectivity, allowing each point to be optimized independently. This independence makes the model easier to train, less sensitive to local errors, and more capable of capturing fine geometric details. The lack of topological constraints also enables faster convergence and better alignment with the ground truth. Overall, the point cloud representation strikes the best balance between accuracy, efficiency, and generalization.
Meshes
The mesh model achieved moderate performance, performing better than voxels but lower than point clouds. Its structured vertex connectivity provides geometric consistency, but the fixed icosphere topology limits its ability to represent complex shapes with holes or non-manifold structures. Predicting accurate face orientations and connectivity from single-view images is inherently difficult, further reducing reconstruction accuracy. Additionally, the mismatch between sampled points and mesh faces makes the loss less informative. While meshes offer smoother and more structured surfaces, their topological constraints prevent them from achieving the same flexibility and precision as point clouds.
In conclusion, the quantitative results clearly show that the point cloud model achieves the highest F1 score, followed by the mesh model, while the voxel model performs the lowest. This trend aligns with the inherent strengths of each representation — point clouds offer flexibility and high-resolution detail without connectivity constraints. Meshes, though structurally consistent, are limited by fixed topology and difficulties in predicting connectivity. Voxels, on the other hand, suffer from coarse spatial resolution, restricting fine detail capture despite strong spatial reasoning through 3D convolutions.
Overall, the F1 scores reflect that point clouds provide the most effective and adaptable representation for 3D reconstruction tasks.
2.5. Analyse effects of hyperparams variations
The predicted mesh surfaces often appeared jagged or disconnected, so I decided to explore the role of w_smooth in controlling surface continuity. By varying w_smooth, I wanted to see how aggressively the model would smooth out surfaces versus preserving fine geometric details.
Low values tended to leave surfaces noisy and fragmented, while higher values produced smoother, more cohesive meshes but sometimes over-smoothed delicate structures like thin chair legs or intricate patterns.
2.6. Interpret your model
1. Point Cloud Limitations in Capturing Hollow Structures
While training and looking at the results, I realized that point clouds have a hard time capturing the exact structure of objects, especially when there are holes or thin parts—like the gaps in a chair’s backrest.
The point cloud often has very few or no points in those empty regions, so the model doesn’t get enough information to learn the actual shape. Since point clouds only show where points exist, not where they don’t, the model ends up guessing and often “fills in” the holes
On top of that, networks for point clouds usually aggregate features across the whole shape, which can smooth out fine details like gaps. Sparse sampling, noise, and the limited number of points make it even harder for the model to capture hollow or thin structures. That’s why, in my results, the predicted chair backs often look solid instead of showing the slatted gaps.
| GroundTruth Mesh |
GroundTruth Point Cloud ` |
Predicted Point Cloud |
 |
 |
 |
| GroundTruth Mesh |
GroundTruth Point Cloud ` |
Predicted Point Cloud |
 |
 |
 |
| GroundTruth Mesh |
GroundTruth Point Cloud ` |
Predicted Point Cloud |
 |
 |
 |
2. Meshes Struggle with Thin Structures
Further, I realized that meshes often fail to capture very thin parts of objects, like the legs of a chair, and instead produce thicker, bulkier versions.
This happens because meshes represent surfaces as connected triangles, and creating very thin triangles that accurately follow delicate structures is difficult, especially when the underlying data is sparse or noisy.
The network generating the mesh has to interpolate between points, and small errors get amplified, causing the thin regions to “inflate.” Additionally, regularization techniques (like smoothing or minimizing surface energy) used during mesh reconstruction tend to favor smoother, thicker surfaces over extremely narrow ones.
Limited resolution and a finite number of vertices further constrain how much detail the mesh can encode, so intricate designs—like carvings or structural details on chair legs—are not reproduced. As a result, all the fine designs and delicate structures on the chair legs are completely lost in the predicted mesh.
| GroundTruth Mesh |
Predicted Mesh |
 |
 |
| GroundTruth Mesh |
Predicted Mesh |
 |
 |
| GroundTruth Mesh |
Predicted Mesh |
 |
 |
3. Exploring other architectures / datasets.¶
3.3 Extended dataset for training
| GroundTruth Mesh |
GroundTruth Point Cloud ` |
Predicted Point Cloud |
 |
 |
 |
| GroundTruth Mesh |
GroundTruth Point Cloud ` |
Predicted Point Cloud |
 |
 |
 |
When the model is trained on a single class, like chairs, the model seems to really specialize in that class. It focuses on the unique shapes, geometry, and structural patterns of chairs, which seems to help a lot. As a result, the F1-score for chair reconstruction (shown in the left-side graph) tends to be higher.
On the other hand, when I trained the model on the extended dataset with multiple classes (chairs, cars, and planes), it had to learn features that work across all of them. For chairs specifically, I noticed the F1-score dropped slightly (shown in right-side graph). This makes sense because the network’s capacity is now shared across several classes, so it can’t focus as much on the fine details of just one class.
| Training on One Class |
Training on Three Classes |
 |
 |