Assignment 2 - Single View to 3D
Author: Kailash Jagadeesh
Course: 16-825 Learning for 3D Vision — Carnegie Mellon University
Overview
In this assignment we implement and evaluate a Single-View to 3D Reconstruction network that predicts a full 3D shape (voxel grid, point cloud, or mesh) from a single RGB image.
The model follows an encoder–decoder structure based on a ResNet-18 backbone and modality-specific decoders.
Section 1: Exploring Loss Functions
Section 1.1: Fitting a Voxel Grid
| Predicted Voxel |
Ground Truth |
 |
 |
Section 1.2: Fitting a Point Cloud
| Predicted Point Cloud |
Ground Truth |
 |
 |
Section 1.3: Fitting a Mesh
| Predicted Mesh |
Ground Truth |
 |
 |
Section 2: Reconstructing 3D from single View
Section 2.1: Image to Voxel Grid
F1@0.05 Score: ~66
| Input Image |
Grouth Truth |
Predicted Voxel Grid |
 |
 |
 |
 |
 |
 |
 |
 |
 |
Section 2.2: Image to Point Cloud
F1@0.05 Score: ~82 (n_points=1k)
| Input Image |
Grouth Truth |
Predicted Point Cloud |
 |
 |
 |
 |
 |
 |
 |
 |
 |
Section 2.3: Image to Mesh
F1@0.05 Score: ~72
| Input Image |
Grouth Truth |
Predicted Mesh |
 |
 |
 |
 |
 |
 |
 |
 |
 |
Section 2.4: Quantitative Comparison
F1 Score Table
| Voxel Grid |
Point Cloud |
Mesh |
 |
 |
 |
The point cloud model gives the best F1 score because it represents only the surface of the object, not the empty space around it. This lets it capture fine details and shapes much more accurately, without being limited by a fixed grid or structure. On the other hand, the voxel model works with a coarse 32×32×32 cube, where most of the volume is empty. That makes it hard to represent thin parts or small gaps, so the reconstructed shapes look blocky and less precise.
The mesh model does a bit better than voxels because it creates smooth, continuous surfaces, but it’s restricted by its fixed template (like a deformed sphere). It struggles with objects that have holes or separate parts. Overall, the point cloud gives a cleaner and more flexible representation of 3D shapes, which is why it scores higher.
Section 2.5: Analysis of Hyperparameters
| Input Image |
Number of Points = 1K Points |
Number of Points = 3K Points |
Number of Points = 10K Points |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
When varying the number of points in the point-cloud decoder (1 k, 3 k, and 10 k), the quality of reconstruction improved noticeably as the number of points increased. With only 1 k points, the predicted shapes captured the overall structure but missed finer surface details and had gaps in thin regions. At 3 k points, the reconstructions became denser and more complete, balancing accuracy and training stability. At 10 k points, the shapes were the most detailed but also slightly noisier and slower to train, since the network had to predict many more coordinates. Overall, increasing the number of points gives higher geometric fidelity up to a point, after which the gain in detail comes at the cost of longer training time and potential overfitting to small surface noise.
F1 Score Table
| n_points = 1k |
n_points = 3k |
n_points = 10k |
 |
 |
 |
Section 2.6: Interpret the Model
To better understand how the voxel model represents object geometry, I varied the isovalue used for visualizing the predicted voxel grids. Lower isovalues (like 0.1–0.3) made the reconstructed shapes appear thicker and more filled in, since more voxels were considered occupied. Higher isovalues (0.6–0.9) produced thinner or even incomplete shapes, highlighting only the most confident regions predicted by the model. This variation helped reveal how the network encodes uncertainty in occupancy — lower-confidence areas correspond to softer voxel activations around surface boundaries. Adjusting the isovalue essentially changes the threshold for what the model “believes” is solid, giving insight into how confident and precise its 3D predictions are.
| Input Image |
Iso = 0.1 |
Iso = 0.2 |
Iso = 0.3 |
Iso = 0.4 |
Iso = 0.5 |
Iso = 0.6 |
Iso = 0.7 |
Iso = 0.8 |
Iso = 0.9 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
Section 3: Exploring other architectures / datasets.
Section 3.3: Extended dataset for training
When the model trained on a single class (chair) was used for inference on all three object categories, it failed to generalize beyond the training distribution. The network overfitted to the chair class and produced similar chair-like reconstructions regardless of the actual input category. In contrast, the model trained jointly on all three classes showed a clear improvement in both qualitative predictions and quantitative metrics. It captured distinct structural cues from each object type, which is reflected in the higher F1 scores across categories, indicating better class discrimination and more balanced reconstruction performance.
Trained on Single Class:
| Input Image |
Grouth Truth |
Predicted Point Cloud |
 |
 |
 |
 |
 |
 |
 |
 |
 |
Trained on Multi-Class:
| Input Image |
Grouth Truth |
Predicted Point Cloud |
 |
 |
 |
 |
 |
 |
 |
 |
 |