Homework 2¶

By Zhewen Zheng (zhewenz)

1. Exploring loss functions¶

1.1. Fitting a voxel grid¶

src	target

1.2 Fitting a point cloud¶

src	target

1.3 Fitting a mesh¶

src	target

2. Reconstructing 3D from single view¶

2.1 Image to voxel grid¶

img	prediction	gt

2.2 Image to point cloud¶

img	prediction	gt

2.3 Image to mesh¶

img	prediction	gt

2.4 Quantitative comparisons¶

Voxel Avg F1@0.05: 74.809	Pointcloud Avg F1@0.05: 79.738	Mesh Avg F1@0.05: 75.308

Despite achieving decent F1 scores, voxel reconstructions exhibit noticeable discontinuities around thin structures—likely a consequence of limited grid resolution and imprecisely learned occupancies that were filtered out by the marching cubes isovalue threshold.

The point cloud representation performs best both quantitatively and visually. Its lack of explicit connectivity grants it greater flexibility, allowing the model to better capture fine geometric variations and align closely with the ground truth shapes.

The mesh representation achieves slightly higher F1 scores than voxels and produces more continuous surfaces. However, its fixed initial topology (e.g., an icosphere) constrains deformation, making it difficult to accurately model complex or topologically distinct shapes, such as those containing holes or thin appendages.

In summary, each representation presents a trade-off between geometric fidelity and structural constraints: voxels offer regularity but suffer from resolution limits, meshes provide surface continuity but are topologically rigid, while point clouds balance simplicity and adaptability, yielding the most faithful reconstructions overall.

2.5 Analyse effects of hyperparams variations¶

img	n_points = 1000, Avg F1@0.05: 79.738	n_points = 2000, Avg F1@0.05: 84.403	n_points = 8000 Avg F1@0.05: 88.409	gt

Models trained with larger n_points consistently achieved higher F1 scores. Notably, in the third row, the model with n_points = 1000 predicts a similar but incorrect chair shape (with armrests that shouldn’t exist), which becomes correctly reconstructed as n_points increases. However, this improvement comes with a trade-off—higher point densities also introduce unwanted clutter in empty regions, as seen in the second-row examples, where points begin to appear in areas that should remain empty.

2.6 Interpret your model¶

No description has been provided for this image

To better understand what visual cues the model relies on, I visualized the saliency map for some entries. The saliency map highlights regions in the input that most strongly influence the model’s prediction, effectively showing where the network looks at when reconstructing the 3D shape.

Interestingly, the highlighted regions tend to align with object edges and boundaries, suggesting that the model focuses on high-frequency features such as silhouettes and sharp transitions, which are typical cues for shape understanding. However, we also observe some attention in empty or background regions, likely due to the receptive field and global averaging behavior of the ResNet backbone, which aggregates spatial context beyond local object boundaries.

Overall, the visualization provides evidence that the model captures boundary information but also exhibits diffuse attention, hinting at opportunities for better spatial localization in future designs.