Single View to 3D Reconstruction

Assignment 2: Learning 3D Representations from Single RGB Images

1.1 Fitting a Voxel Grid

Fitted a voxel grid representation to a target 3D object by optimizing random initialization using binary cross-entropy loss.

Ground Truth Voxel Grid
GT Voxel Grid
Fitted Voxel Grid
Fitted Voxel Grid

1.2 Fitting a Point Cloud

Fitted a point cloud to a target mesh by optimizing random point positions using Chamfer distance loss.

Ground Truth Points
GT Point Cloud
Fitted Point Cloud
Fitted Point Cloud

1.3 Fitting a Mesh

Deformed an initial icosphere mesh to match a target mesh using Chamfer loss and smoothness regularization.

Ground Truth Mesh
GT Mesh
Fitted Mesh
Fitted Mesh

2.1 Image to Voxel Grid

Trained a neural network decoder to predict 32×32×32 binary voxel grids from single RGB images using ResNet18 encoder.

For each example: Input RGB, predicted voxel grid render, and ground truth mesh render.

Example 1
Input RGB
Input 1
Predicted Voxel Grid
Pred Voxel 1
Ground Truth Mesh
GT Mesh 1
Example 2
Input RGB
Input 2
Predicted Voxel Grid
Pred Voxel 2
Ground Truth Mesh
GT Mesh 2
Example 3
Input RGB
Input 3
Predicted Voxel Grid
Pred Voxel 3
Ground Truth Mesh
GT Mesh 3

2.2 Image to Point Cloud

Trained a neural network to predict 3D point clouds directly from single RGB images.

For each example: Input RGB, predicted point cloud render, and ground truth mesh render.

Example 1
Input RGB
Input 1
Predicted Point Cloud
Pred Points 1
Ground Truth Mesh
GT Mesh 1
Example 2
Input RGB
Input 2
Predicted Point Cloud
Pred Points 2
Ground Truth Mesh
GT Mesh 2
Example 3
Input RGB
Input 3
Predicted Point Cloud
Pred Points 3
Ground Truth Mesh
GT Mesh 3

2.3 Image to Mesh

Trained a neural network to deform an initial icosphere mesh into the target shape using vertex offsets.

For each example: Input RGB, predicted mesh render, and ground truth mesh render.

Example 1
Input RGB
Input 1
Predicted Mesh
Pred Mesh 1
Ground Truth Mesh
GT Mesh 1
Example 2
Input RGB
Input 2
Predicted Mesh
Pred Mesh 2
Ground Truth Mesh
GT Mesh 2
Example 3
Input RGB
Input 3
Predicted Mesh
Pred Mesh 3
Ground Truth Mesh
GT Mesh 3

2.4 Quantitative Comparisons

F1-score comparison across different 3D representations at various distance thresholds.

Point Cloud F1
Point Cloud F1 Plot
Mesh F1
Mesh F1 Plot
Voxel Grid F1
Voxel F1 Plot
Representation F1@0.05 Average
Voxel Grid 70.00 70.00%
Point Cloud 87.43 87.43%
Mesh 80.76 80.76%

Intuitive Explanation:

Point clouds achieved the best F1 score (87.43%) compared to voxel grids (70%) and meshes (80.76%). This is because point clouds provide a continuous representation without the discretization artifacts of voxels, while being more flexible than meshes which are constrained by their initial topology. Point clouds can freely distribute points where needed most, making them particularly effective for capturing fine geometric details.

However, the superior visual quality of voxel reconstructions despite their lower quantitative scores suggests that the F1 metric may not fully capture perceptual quality. Voxels produce watertight, coherent surfaces that are visually appealing, while point clouds, though more accurate numerically, can appear sparse or incomplete. The marching cubes algorithm used for voxel visualization creates smooth, continuous surfaces that better match human perception of 3D shapes, even when the underlying voxel grid lacks fine detail. This highlights the discrepancy between quantitative metrics and qualitative assessment in 3D reconstruction.

2.5 Hyperparameter Variations

Experiment: Number of Points in Point Cloud

The number of points used to represent the 3D model affects performance. Using different point counts reveals a trade-off between representation capacity and model limitations.

Number of Points F1@0.05
2500 92.91
5000 94.18
7500 91.20

Conclusion: The results suggest 5000 points works well for this task. Using too many points (7500) may exceed model capacity, while too few points (2500) may be insufficient for capturing detailed structure. This reveals an important balance between representation richness and the model's learning capacity.

2.6 Model Interpretation

To understand what the models have learned, we perform latent space interpolation. This tests whether the models have learned smooth representations or are simply memorizing training examples.

Latent Space Interpolation

We linearly interpolate the encoded features of two different images (chair and sofa) with step size 0.1, then decode them to generate intermediate 3D structures.

Observations:

• All three models (voxel, point cloud, and mesh) generate smooth transitions from chair to sofa structures

• This suggests the models learn continuous representations rather than discrete memorization

• The smooth interpolation indicates the latent space captures geometric variations reasonably well

• The encoder-decoder architecture appears to learn a continuous mapping between image space and 3D shape space

Interpretation: The latent space interpolation experiments show that the models learn meaningful representations. The ability to generate plausible intermediate shapes between different object categories suggests the models develop some understanding of 3D structure beyond simple input-output memorization.

3.1 Implicit Network

Implemented an implicit decoder that takes 3D locations as input and outputs occupancy values.

Ground Truth Mesh
GT Mesh
Predicted Implicit Surface
Pred Implicit