16-825 Assignment 2: Single View to 3D¶
1. Exploring loss functions¶
1.1. Fitting a voxel grid (5 points)¶
%run fit_data.py --type 'vox'
![]() Source Voxel |
![]() Target Voxel |
1.2. Fitting a point cloud (5 points)¶
%run fit_data.py --type 'point'
![]() Source Cloud |
![]() Target Cloud |
1.3. Fitting a mesh (5 points)¶
%run fit_data.py --type 'mesh'
![]() Source Mesh |
![]() Target Mesh |
2. Reconstructing 3D from single view¶
2.1. Image to voxel grid (20 points)¶
%run python eval_model.py --type 'vox' --load_checkpoint
| Input Image | Ground Truth | Model Prediction |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
2.2. Image to point cloud (20 points)¶
%run python eval_model.py --type 'point' --load_checkpoint
| Input Image | Ground Truth | Model Prediction |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
2.3. Image to mesh (20 points)¶
%run python eval_model.py --type 'mesh' --load_checkpoint
| Input Image | Ground Truth | Model Prediction |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
2.4. Quantitative comparisions(10 points)¶
Voxels¶
The image to 3D voxel representation model achieves an average F1 score of 80 after 30k iterations, and as seen in the output images it captures global 3D structures well but suffers while capturing fine detail due to high memory usage.
Point Clouds¶
The image to 3D point cloud representation model achieves an average F1 score of 83 after only 5k iterations, and is efficient at modelling overall shape of the object but suffers with continuous surface representation.
Meshes¶
The image to 3D Mesh representation model achieves an average F1 score of 78 after 35k iterations modelling the geometry well but suffering by adding artifacts not belonging to the GT image.
2.5. Analyse effects of hyperparams variations (10 points)¶
%run eval_model.py --type 'point' --hyperparams_variation
Analysis of Sampling Density (n_points)¶
The decision to increase the number of sampled points in the Point Cloud model from 1k to 2.5k yielded a significant F1 score improvement, rising from an average of 83 to 90. This increase is primarily attributed to better surface coverage and the resulting stability of the nearest-neighbor calculations, which reduced the number of false negatives.
Specifically, denser sampling enables the reconstructed point cloud to fill gaps missed by sparse sampling, thereby directly improving recall. Furthermore, a denser point cloud ensures that the nearest-neighbor match for each Ground Truth (GT) point is more reliable, as the GT point is much more likely to be in close proximity to a predicted output point. This conclusive result confirms that, for point cloud representations, sampling density is a primary driver of geometric accuracy and the overall F1 metric.
| Input Image | Ground Truth | Model Prediction n_pts = 2500 | Model Prediction n_pts = 1000 |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
2.6. Interpret your model (15 points)¶
%run eval_model.py --type 'vox' --load_checkpoint --my_viz
%run eval_model.py --type 'point' --load_checkpoint --my_viz
%run eval_model.py --type 'mesh' --load_checkpoint --my_viz
I have compiled a few GIFs to explain how the three models (Voxels, Point Clouds, and Meshes) model 3D structures differently. These GIFs illustrate the reconstruction process by starting from a point in space, moving towards the reconstructed object, and then passing through it. This visualization technique effectively demonstrates how each representation handles geometry, particularly in the presence of holes and gaps within the Ground Truth (GT) 3D object.
Voxel Model Analysis¶
The Voxel model, while effective for capturing overall volumetric occupancy, struggles significantly with representing fine details, especially gaps and holes. As demonstrated in the examples below, the reconstructed voxel grids often fill in or blur these negative spaces, failing to precisely capture the object's geometry. This limitation stems from its fundamental nature as a discretized grid: voxels represent the presence or absence of material within fixed-size cubic cells. Modeling intricate holes would require an impractically high resolution (small voxel size), leading to excessive memory usage and computational cost. Thus, at typical resolutions, the voxel model prioritizes solid volume over detailed negative space.
| Input Image | Ground Truth (Mesh) | Model Prediction (Mesh) |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Point Cloud Model Analysis¶
The Point Cloud model demonstrates a superior ability to represent gaps and holes, as evidenced by the examples. For instance, in the chair's arms, the model accurately predicts and preserves the gap, unlike the voxel and mesh representations. Even where the overall reconstruction might not be flawless, point clouds provide a much closer approximation to the GT geometry and significantly outperform voxels in capturing complex shapes. This strength lies in the discrete, unordered nature of point clouds; they represent surfaces as a collection of individual points, naturally allowing for empty space between them without requiring a high-resolution grid. This makes them inherently more flexible for modeling non-manifold or disconnected surfaces and intricate negative spaces.
| Input Image | Ground Truth (Mesh) | Model Prediction (Mesh) |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Mesh Model Analysis¶
The Mesh model generally predicts the overall geometry of shapes well, offering a smooth and continuous surface reconstruction that often surpasses voxels. However, it too typically struggles with accurately representing gaps and holes. While capturing the broad structure of an object, meshes, being composed of connected vertices and faces, tend to either "seal off" or create erroneous connections across internal voids. This limitation arises because traditional mesh generation often prioritizes a manifold, closed surface. Accurately modeling holes requires careful topological manipulation and a sufficiently dense vertex distribution around the void, which can be challenging for a neural network to infer perfectly from a 2D image without introducing artifacts or closing off intended gaps.
| Input Image | Ground Truth (Mesh) | Model Prediction (Mesh) |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
3. Exploring other architectures / datasets. (Choose at least one! More than one is extra credit)¶
3.2 Parametric network (10 points)¶
I modeled a MLP to take a sampled 2D input (x,y) points as input and to learn a parametric function z = sin(pix) * cos(piy) and output its respective point in the new 3D space.
Training for 10,000 iterations on 1,000 points ensured full convergence, yielding a highly optimized model with a very low Mean Squared Error (MSE).
%run Parametric_Network.py
Input data shape: torch.Size([2000, 2]) Target data shape: torch.Size([2000, 3]) Training network... Epoch 0, Loss: 0.308620 Epoch 100, Loss: 0.006926 Epoch 200, Loss: 0.001367 Epoch 300, Loss: 0.000403 Epoch 400, Loss: 0.000197 Epoch 500, Loss: 0.000115 Epoch 600, Loss: 0.000082 Epoch 700, Loss: 0.000140 Epoch 800, Loss: 0.000059 Epoch 900, Loss: 0.000049 Epoch 1000, Loss: 0.000046 Epoch 1100, Loss: 0.000153 Epoch 1200, Loss: 0.000036 Epoch 1300, Loss: 0.000034 Epoch 1400, Loss: 0.000033 Epoch 1500, Loss: 0.000352 Epoch 1600, Loss: 0.000028 Epoch 1700, Loss: 0.000028 Epoch 1800, Loss: 0.000040 Epoch 1900, Loss: 0.000024 Epoch 2000, Loss: 0.000031 Epoch 2100, Loss: 0.000022 Epoch 2200, Loss: 0.000021 Epoch 2300, Loss: 0.000220 Epoch 2400, Loss: 0.000019 Epoch 2500, Loss: 0.000042 Epoch 2600, Loss: 0.000018 Epoch 2700, Loss: 0.000017 Epoch 2800, Loss: 0.000026 Epoch 2900, Loss: 0.000020 Epoch 3000, Loss: 0.000016 Epoch 3100, Loss: 0.000342 Epoch 3200, Loss: 0.000014 Epoch 3300, Loss: 0.000028 Epoch 3400, Loss: 0.000013 Epoch 3500, Loss: 0.000017 Epoch 3600, Loss: 0.000012 Epoch 3700, Loss: 0.000013 Epoch 3800, Loss: 0.000022 Epoch 3900, Loss: 0.000011 Epoch 4000, Loss: 0.000011 Epoch 4100, Loss: 0.000011 Epoch 4200, Loss: 0.000068 Epoch 4300, Loss: 0.000016 Epoch 4400, Loss: 0.000011 Epoch 4500, Loss: 0.000010 Epoch 4600, Loss: 0.000012 Epoch 4700, Loss: 0.000010 Epoch 4800, Loss: 0.000009 Epoch 4900, Loss: 0.000015 Epoch 5000, Loss: 0.000009 Epoch 5100, Loss: 0.000022 Epoch 5200, Loss: 0.000008 Epoch 5300, Loss: 0.000009 Epoch 5400, Loss: 0.000190 Epoch 5500, Loss: 0.000008 Epoch 5600, Loss: 0.000009 Epoch 5700, Loss: 0.000007 Epoch 5800, Loss: 0.000008 Epoch 5900, Loss: 0.000019 Epoch 6000, Loss: 0.000007 Epoch 6100, Loss: 0.000115 Epoch 6200, Loss: 0.000007 Epoch 6300, Loss: 0.000563 Epoch 6400, Loss: 0.000007 Epoch 6500, Loss: 0.000007 Epoch 6600, Loss: 0.000007 Epoch 6700, Loss: 0.000312 Epoch 6800, Loss: 0.000007 Epoch 6900, Loss: 0.000044 Epoch 7000, Loss: 0.000006 Epoch 7100, Loss: 0.000006 Epoch 7200, Loss: 0.000007 Epoch 7300, Loss: 0.000006 Epoch 7400, Loss: 0.000009 Epoch 7500, Loss: 0.000024 Epoch 7600, Loss: 0.000006 Epoch 7700, Loss: 0.000008 Epoch 7800, Loss: 0.000013 Epoch 7900, Loss: 0.000129 Epoch 8000, Loss: 0.000006 Epoch 8100, Loss: 0.000012 Epoch 8200, Loss: 0.000006 Epoch 8300, Loss: 0.000006 Epoch 8400, Loss: 0.000006 Epoch 8500, Loss: 0.000007 Epoch 8600, Loss: 0.000006 Epoch 8700, Loss: 0.000006 Epoch 8800, Loss: 0.000013 Epoch 8900, Loss: 0.000009 Epoch 9000, Loss: 0.000008 Epoch 9100, Loss: 0.000005 Epoch 9200, Loss: 0.000052 Epoch 9300, Loss: 0.000005 Epoch 9400, Loss: 0.000162 Epoch 9500, Loss: 0.000005 Epoch 9600, Loss: 0.000095 Epoch 9700, Loss: 0.000006 Epoch 9800, Loss: 0.000130 Epoch 9900, Loss: 0.000005 Training complete. Test Loss (MSE) on sample points: 0.000003 Sample predictions vs. Ground Truth: ---------------------------------------- 2D Input: [0.1 0.2] 3D Ground Truth: [0.1 0.2 0.25] 3D Predicted: [0.10195343 0.2014614 0.25354284] MSE Loss: 0.000006 ---------------------------------------- 2D Input: [-0.3 0.4] 3D Ground Truth: [-0.3 0.4 -0.25] 3D Predicted: [-0.29882124 0.40162012 -0.2510722 ] MSE Loss: 0.000002 ---------------------------------------- 2D Input: [ 0.5 -0.6] 3D Ground Truth: [ 0.5 -0.6 -0.30901715] 3D Predicted: [ 0.49976665 -0.60047054 -0.30707666] MSE Loss: 0.000001























































