Assignment 2 - Single View to 3D

Author: Kailash Jagadeesh

Course: 16-825 Learning for 3D Vision — Carnegie Mellon University

Overview

In this assignment we implement and evaluate a Single-View to 3D Reconstruction network that predicts a full 3D shape (voxel grid, point cloud, or mesh) from a single RGB image.
The model follows an encoder–decoder structure based on a ResNet-18 backbone and modality-specific decoders.

Section 1: Exploring Loss Functions

Section 1.1: Fitting a Voxel Grid

Predicted Voxel	Ground Truth

Section 1.2: Fitting a Point Cloud

Predicted Point Cloud	Ground Truth

Section 1.3: Fitting a Mesh

Predicted Mesh	Ground Truth

Section 2: Reconstructing 3D from single View

Section 2.1: Image to Voxel Grid

F1@0.05 Score: ~66

Input Image	Grouth Truth	Predicted Voxel Grid

Section 2.2: Image to Point Cloud

F1@0.05 Score: ~82 (n_points=1k)

Input Image	Grouth Truth	Predicted Point Cloud

Section 2.3: Image to Mesh

F1@0.05 Score: ~72

Input Image	Grouth Truth	Predicted Mesh

Section 2.4: Quantitative Comparison

F1 Score Table

Voxel Grid	Point Cloud	Mesh

The point cloud model gives the best F1 score because it represents only the surface of the object, not the empty space around it. This lets it capture fine details and shapes much more accurately, without being limited by a fixed grid or structure. On the other hand, the voxel model works with a coarse 32×32×32 cube, where most of the volume is empty. That makes it hard to represent thin parts or small gaps, so the reconstructed shapes look blocky and less precise.

The mesh model does a bit better than voxels because it creates smooth, continuous surfaces, but it’s restricted by its fixed template (like a deformed sphere). It struggles with objects that have holes or separate parts. Overall, the point cloud gives a cleaner and more flexible representation of 3D shapes, which is why it scores higher.

Section 2.5: Analysis of Hyperparameters

Input Image	Number of Points = 1K Points	Number of Points = 3K Points	Number of Points = 10K Points

When varying the number of points in the point-cloud decoder (1 k, 3 k, and 10 k), the quality of reconstruction improved noticeably as the number of points increased. With only 1 k points, the predicted shapes captured the overall structure but missed finer surface details and had gaps in thin regions. At 3 k points, the reconstructions became denser and more complete, balancing accuracy and training stability. At 10 k points, the shapes were the most detailed but also slightly noisier and slower to train, since the network had to predict many more coordinates. Overall, increasing the number of points gives higher geometric fidelity up to a point, after which the gain in detail comes at the cost of longer training time and potential overfitting to small surface noise.

F1 Score Table

n_points = 1k	n_points = 3k	n_points = 10k

Section 2.6: Interpret the Model

To better understand how the voxel model represents object geometry, I varied the isovalue used for visualizing the predicted voxel grids. Lower isovalues (like 0.1–0.3) made the reconstructed shapes appear thicker and more filled in, since more voxels were considered occupied. Higher isovalues (0.6–0.9) produced thinner or even incomplete shapes, highlighting only the most confident regions predicted by the model. This variation helped reveal how the network encodes uncertainty in occupancy — lower-confidence areas correspond to softer voxel activations around surface boundaries. Adjusting the isovalue essentially changes the threshold for what the model “believes” is solid, giving insight into how confident and precise its 3D predictions are.

Input Image	Iso = 0.1	Iso = 0.2	Iso = 0.3	Iso = 0.4	Iso = 0.5	Iso = 0.6	Iso = 0.7	Iso = 0.8	Iso = 0.9

Section 3: Exploring other architectures / datasets.

Section 3.3: Extended dataset for training

When the model trained on a single class (chair) was used for inference on all three object categories, it failed to generalize beyond the training distribution. The network overfitted to the chair class and produced similar chair-like reconstructions regardless of the actual input category. In contrast, the model trained jointly on all three classes showed a clear improvement in both qualitative predictions and quantitative metrics. It captured distinct structural cues from each object type, which is reflected in the higher F1 scores across categories, indicating better class discrimination and more balanced reconstruction performance.

Trained on Single Class:

Input Image	Grouth Truth	Predicted Point Cloud

Trained on Multi-Class:

Input Image	Grouth Truth	Predicted Point Cloud