Single View to 3D Reconstruction

2.1 Image to Voxel Grid

Trained a neural network decoder to predict 32×32×32 binary voxel grids from single RGB images using ResNet18 encoder.

For each example: Input RGB, predicted voxel grid render, and ground truth mesh render.

Example 1

Input RGB

Predicted Voxel Grid

Ground Truth Mesh

Example 2

Input RGB

Predicted Voxel Grid

Ground Truth Mesh

Example 3

Input RGB

Predicted Voxel Grid

Ground Truth Mesh

2.2 Image to Point Cloud

Trained a neural network to predict 3D point clouds directly from single RGB images.

For each example: Input RGB, predicted point cloud render, and ground truth mesh render.

Example 1

Input RGB

Predicted Point Cloud

Ground Truth Mesh

Example 2

Input RGB

Predicted Point Cloud

Ground Truth Mesh

Example 3

Input RGB

Predicted Point Cloud

Ground Truth Mesh

2.3 Image to Mesh

Trained a neural network to deform an initial icosphere mesh into the target shape using vertex offsets.

For each example: Input RGB, predicted mesh render, and ground truth mesh render.

Example 1

Input RGB

Predicted Mesh

Ground Truth Mesh

Example 2

Input RGB

Predicted Mesh

Ground Truth Mesh

Example 3

Input RGB

Predicted Mesh

Ground Truth Mesh

2.4 Quantitative Comparisons

F1-score comparison across different 3D representations at various distance thresholds.

Point Cloud F1

Mesh F1

Voxel Grid F1

Representation	F1@0.05	Average
Voxel Grid	70.00	70.00%
Point Cloud	87.43	87.43%
Mesh	80.76	80.76%

Intuitive Explanation:

Point clouds achieved the best F1 score (87.43%) compared to voxel grids (70%) and meshes (80.76%). This is because point clouds provide a continuous representation without the discretization artifacts of voxels, while being more flexible than meshes which are constrained by their initial topology. Point clouds can freely distribute points where needed most, making them particularly effective for capturing fine geometric details.

However, the superior visual quality of voxel reconstructions despite their lower quantitative scores suggests that the F1 metric may not fully capture perceptual quality. Voxels produce watertight, coherent surfaces that are visually appealing, while point clouds, though more accurate numerically, can appear sparse or incomplete. The marching cubes algorithm used for voxel visualization creates smooth, continuous surfaces that better match human perception of 3D shapes, even when the underlying voxel grid lacks fine detail. This highlights the discrepancy between quantitative metrics and qualitative assessment in 3D reconstruction.

2.5 Hyperparameter Variations

Experiment: Number of Points in Point Cloud

The number of points used to represent the 3D model affects performance. Using different point counts reveals a trade-off between representation capacity and model limitations.

Number of Points	F1@0.05
2500	92.91
5000	94.18
7500	91.20

Conclusion: The results suggest 5000 points works well for this task. Using too many points (7500) may exceed model capacity, while too few points (2500) may be insufficient for capturing detailed structure. This reveals an important balance between representation richness and the model's learning capacity.

2.6 Model Interpretation

To understand what the models have learned, we perform latent space interpolation. This tests whether the models have learned smooth representations or are simply memorizing training examples.

Latent Space Interpolation

We linearly interpolate the encoded features of two different images (chair and sofa) with step size 0.1, then decode them to generate intermediate 3D structures.

Observations:

• All three models (voxel, point cloud, and mesh) generate smooth transitions from chair to sofa structures

• This suggests the models learn continuous representations rather than discrete memorization

• The smooth interpolation indicates the latent space captures geometric variations reasonably well

• The encoder-decoder architecture appears to learn a continuous mapping between image space and 3D shape space

Interpretation: The latent space interpolation experiments show that the models learn meaningful representations. The ability to generate plausible intermediate shapes between different object categories suggests the models develop some understanding of 3D structure beyond simple input-output memorization.

Single View to 3D Reconstruction

1.1 Fitting a Voxel Grid

1.2 Fitting a Point Cloud

1.3 Fitting a Mesh

2.1 Image to Voxel Grid

2.2 Image to Point Cloud

2.3 Image to Mesh

2.4 Quantitative Comparisons

2.5 Hyperparameter Variations

Experiment: Number of Points in Point Cloud

2.6 Model Interpretation

Latent Space Interpolation

3.1 Implicit Network