Assignment 2: Learning 3D Representations from Single RGB Images
Fitted a voxel grid representation to a target 3D object by optimizing random initialization using binary cross-entropy loss.
Fitted a point cloud to a target mesh by optimizing random point positions using Chamfer distance loss.
Deformed an initial icosphere mesh to match a target mesh using Chamfer loss and smoothness regularization.
Trained a neural network decoder to predict 32×32×32 binary voxel grids from single RGB images using ResNet18 encoder.
For each example: Input RGB, predicted voxel grid render, and ground truth mesh render.
Trained a neural network to predict 3D point clouds directly from single RGB images.
For each example: Input RGB, predicted point cloud render, and ground truth mesh render.
Trained a neural network to deform an initial icosphere mesh into the target shape using vertex offsets.
For each example: Input RGB, predicted mesh render, and ground truth mesh render.
F1-score comparison across different 3D representations at various distance thresholds.
| Representation | F1@0.05 | Average |
|---|---|---|
| Voxel Grid | 70.00 | 70.00% |
| Point Cloud | 87.43 | 87.43% |
| Mesh | 80.76 | 80.76% |
Intuitive Explanation:
Point clouds achieved the best F1 score (87.43%) compared to voxel grids (70%) and meshes (80.76%). This is because point clouds provide a continuous representation without the discretization artifacts of voxels, while being more flexible than meshes which are constrained by their initial topology. Point clouds can freely distribute points where needed most, making them particularly effective for capturing fine geometric details.
However, the superior visual quality of voxel reconstructions despite their lower quantitative scores suggests that the F1 metric may not fully capture perceptual quality. Voxels produce watertight, coherent surfaces that are visually appealing, while point clouds, though more accurate numerically, can appear sparse or incomplete. The marching cubes algorithm used for voxel visualization creates smooth, continuous surfaces that better match human perception of 3D shapes, even when the underlying voxel grid lacks fine detail. This highlights the discrepancy between quantitative metrics and qualitative assessment in 3D reconstruction.
The number of points used to represent the 3D model affects performance. Using different point counts reveals a trade-off between representation capacity and model limitations.
| Number of Points | F1@0.05 |
|---|---|
| 2500 | 92.91 |
| 5000 | 94.18 |
| 7500 | 91.20 |
Conclusion: The results suggest 5000 points works well for this task. Using too many points (7500) may exceed model capacity, while too few points (2500) may be insufficient for capturing detailed structure. This reveals an important balance between representation richness and the model's learning capacity.
To understand what the models have learned, we perform latent space interpolation. This tests whether the models have learned smooth representations or are simply memorizing training examples.
We linearly interpolate the encoded features of two different images (chair and sofa) with step size 0.1, then decode them to generate intermediate 3D structures.
Observations:
• All three models (voxel, point cloud, and mesh) generate smooth transitions from chair to sofa structures
• This suggests the models learn continuous representations rather than discrete memorization
• The smooth interpolation indicates the latent space captures geometric variations reasonably well
• The encoder-decoder architecture appears to learn a continuous mapping between image space and 3D shape space
Interpretation: The latent space interpolation experiments show that the models learn meaningful representations. The ability to generate plausible intermediate shapes between different object categories suggests the models develop some understanding of 3D structure beyond simple input-output memorization.
Implemented an implicit decoder that takes 3D locations as input and outputs occupancy values.