16-825 Assignment 2: Learning Single View to 3D

Abhishek Mathur - armathur

1. Exploring Loss Functions

1.1 Fitting a Voxel Grid

Source Voxel Grid
Source Voxel Grid
Target Voxel Grid
Target Voxel Grid

1.2 Fitting a Point Cloud

Source Point Cloud
Source Point Cloud
Target Point Cloud
Target Point Cloud

1.3 Fitting a Mesh

Source Mesh
Source Mesh
Target Mesh
Target Mesh

2. Single View to 3D

2.1 Image to Voxel Grid

Example 1:

Input RGB
Input RGB
Predicted Voxel
Predicted Voxel
Ground Truth
Ground Truth

Example 2:

Input RGB
Input RGB
Predicted Voxel
Predicted Voxel
Ground Truth
Ground Truth

Example 3:

Input RGB
Input RGB
Predicted Voxel
Predicted Voxel
Ground Truth
Ground Truth

2.2 Image to Point Cloud

Example 1:

Input RGB
Input RGB
Predicted Point Cloud
Predicted Point Cloud
Ground Truth
Ground Truth

Example 2:

Input RGB
Input RGB
Predicted Point Cloud
Predicted Point Cloud
Ground Truth
Ground Truth

Example 3:

Input RGB
Input RGB
Predicted Point Cloud
Predicted Point Cloud
Ground Truth
Ground Truth

2.3 Image to Mesh

Example 1:

Input RGB
Input RGB
Predicted Mesh
Predicted Mesh
Ground Truth
Ground Truth

Example 2:

Input RGB
Input RGB
Predicted Mesh
Predicted Mesh
Ground Truth
Ground Truth

Example 3:

Input RGB
Input RGB
Predicted Mesh
Predicted Mesh
Ground Truth
Ground Truth

2.4 Quantitative Comparisons

Voxel F1 Score
Voxel F1 Score
Point Cloud F1 Score
Point Cloud F1 Score
Mesh F1 Score
Mesh F1 Score

Quantitative F1 score comparisons highlight key differences between 3D representations. Point clouds achieve the highest F1 scores, benefiting from flexible spatial arrangement and direct geometric optimization via chamfer loss. Voxel grids perform moderately well, limited by 32³ resolution but offering stable and efficient reconstructions. Meshes show the most variability; while sometimes scoring highly, their F1 scores do not always reflect visual or topological quality due to optimization complexity and metric limitations.

2.5 Hyperparameter Analysis

Num Points = 1000

Predicted Point Cloud (1000 points)
Predicted Point Cloud (1000 points)
Ground Truth (1000 points)
Ground Truth (1000 points)

Num Points = 5000

Predicted Point Cloud (5000 points)
Predicted Point Cloud (5000 points)
Ground Truth (5000 points)
Ground Truth (5000 points)

The comparison between 1000 and 5000 points reveals important insights about the relationship between representation density and reconstruction quality. With 1000 points, the model produces sparser reconstructions that capture the general shape but may miss important geometric details, particularly in regions of high curvature or fine structure. The increased density with 5000 points provides significantly better coverage of complex surfaces and allows for more faithful representation of detailed features like object boundaries and internal structure. However, this improvement comes with important trade-offs that extend beyond simple computational cost. More points require the network to learn more complex spatial relationships, potentially making training more challenging and requiring larger model capacity. Additionally, the chamfer loss computation scales quadratically with point count, making evaluation significantly more expensive. In practice, the choice of point density often depends on the specific application requirements - applications prioritizing speed might accept the reduced detail of fewer points, while those requiring high geometric fidelity would justify the computational overhead of denser representations. My data suggest that 5000 points represent a reasonable balance for this dataset, providing clear quality improvements while remaining computationally manageable.

2.6 Model Interpretation

Model Understanding

The point cloud reconstruction approach initializes with randomly distributed 3D points in space and progressively learns to organize them into coherent geometric structures through Chamfer distance optimization. This bidirectional loss function computes the nearest neighbor distances between predicted and ground truth point sets in both directions, ensuring comprehensive surface coverage rather than simple point clustering. The optimization process demonstrates a clear evolution from random spatial distribution to structured object geometry.

The voxel-based methodology operates through spatial discretization, partitioning 3D space into a regular 32×32×32 grid and performing binary classification on each voxel as either occupied or empty. While this approach imposes resolution constraints compared to point cloud representations, it provides guaranteed volumetric completeness and spatial consistency. The network effectively learns a 3D occupancy function using Binary Cross Entropy loss, determining object presence within each discrete spatial unit.

Mesh reconstruction presents the most complex optimization challenge among the three approaches. Beginning with an initial icosphere topology, the method deforms vertex positions while preserving mesh connectivity constraints. The requirement to maintain topological consistency prevents arbitrary vertex displacement and necessitates careful optimization strategies. Smoothness regularization ensures neighboring vertices remain spatially coherent, producing realistic surface curvature. These topological constraints significantly increase optimization complexity, resulting in higher variance in reconstruction quality compared to point cloud and voxel approaches.

3. Additional Explorations

3.1 Implicit Network

Example 1:

Input RGB
Input RGB
Predicted Mesh
Predicted Mesh
Ground Truth
Ground Truth

The implicit network approach shifts from explicit 3D prediction to learning a continuous occupancy function. Instead of directly generating voxels or meshes, the network predicts occupancy for any 3D coordinate, allowing flexible, resolution-independent reconstructions. Surfaces are extracted using marching cubes, ensuring topological consistency and smoothness.