16-825 Assignment 2: Single View to 3D

Andrew ID: abhinavm

Discussed ideas with: Karthik Pullalarevu (kpullala), Hashil Bhatia (harshilb)

0. Setup

I used AWS and spun an ec2 instance with the AMI (Deep Learning AMI GPU PyTorch 2.1.0 (Ubuntu 20.04) 20231103) with a g4dn.xlarge instance. Few set up tips if you’re using something similar, since the CUDA version of this image is 12.1, there will be some issues with cuda+torch+pytorch compatibility, so install torch, torchvision 2.5.1 (compatible with cuda 12.1) and VERY IMPORTANTLY use this following command to install pytorch3d:

pip install --extra-index-url https://miropsota.github.io/torch_packages_builder pytorch3d==0.7.8+pt2.5.1cu121

1. Exploring loss functions

1.1. Fitting a voxel grid (5 points)

Ground Truth Optimized
First Image Second Image

python train.py --type vox --batch_size 32

1.2 Fitting a point cloud (10 points)

Ground Truth Optomized
First Image Second Image

python train.py --type point --batch_size 32

1.3 Fitting a mesh (5 points)

Ground Truth Optomized
First Image Second Image

python train.py --type mesh --batch_size 32

2. Reconstructing 3D from single view

2.1. Image to voxel grid (15 points)

Input RGB Ground Truth Mesh Ground Truth Voxel Predicted Voxel
First Image Second Image third Image Fourth Image
First Image Second Image third Image Fourth Image
First Image Second Image third Image Fourth Image

Training command python train_model.py --type vox --max_iter 60001 --batch_size 16 --num_workers 4 --save_freq 1000 --lr 2e-4 --output_path outputs/2_2_max_iter60000_b16_num_workers_4_lr_2e-4

Evaluating Command (First get models from drive) python eval_model.py --type point --load_checkpoint --n_points 1000 --eval_chk_file <your_checkpoint_path> --output_path outputs/2_1 --vis_freq 50

Note: Took the most time to train, around 1 day to get the best results.

2.2 Image to point cloud (15 points)

Input RGB Ground Truth Mesh Ground Truth Cloud Predicted Cloud
First Image Second Image third Image Fourth Image
First Image Second Image third Image Fourth Image
First Image Second Image third Image Fourth Image

Training command python train_model.py --type point --n_points 1000 --max_iter 5001 --batch_size 16 --num_workers 4 --save_freq 1000 --lr 2e-4 --output_path outputs/2_2_max_iter5000_b16_num_workers_4_lr_2e-4

Evaluating Command (First get models from drive) python eval_model.py --type point --load_checkpoint --n_points 1000 --eval_chk_file <your_checkpoint_path> --output_path outputs/2_2 --vis_freq 50

Note: Took the least tiem to train, around 3 hours to get convergence

2.3 Image to mesh (15 points)

Input RGB Ground Truth Mesh Predicted Mesh
First Image Second Image Third Image
First Image Second Image Third Image
First Image Second Image Third Image

Training command python train_model.py --type mesh --max_iter 25001 --batch_size 16 --num_workers 4 --save_freq 1000 --lr 2e-5 --output_path outputs/2_3_max_iter25000_b16_num_workers_4_lr_2e-5

Evaluating Command (First get models from drive) python eval_model.py --type mesh --load_checkpoint --eval_chk_file <your_checkpoint_path> --output_path outputs/2_3 --vis_freq 50

Task 2.4: Quantitative comparisons (10 points)

Voxel Evaluation Point Cloud Evaluation Mesh Evaluation
Evaluation vox Evaluation point Evaluation mesh

Analysis

Looking at the F1 scores across different thresholds, we can see clear performance differences between the three representations:

Point clouds perform best, achieving the highest F1 scores across all thresholds. This makes sense because point clouds have the most flexibility. They can place points anywhere in 3D space without being constrained by a grid or connectivity structure. This freedom lets them capture fine details and complex surfaces more accurately.

Voxels and meshes perform similarly, both showing comparable (and notably lower) F1 scores.

Despite their different approaches, both representations give similar performance ceilings. Voxels are limited by grid resolution, while meshes are limited by template topology. N=Both these constraints bottleneck performance. Voxel notable took a lot more time to train to get notable results, if I trained my mesh model for longer it may’ve outperformed the voxel model (this however only remains a hypothesis).

2.5 Analyse effects of hyperparms variations (10 points)

Seeing the effect of n_points while training the point cloud model

n_points 500 1000 2000 5000
F1 curve
First Image
First Image
First Image
f1@0.05 65.00 79.70 80.20 84.5

Analysis

The results clearly show that increasing the number of points has a positive impact on reconstruction quality. From 500 to 1000, there is a substantial jump, then it slowly plateaus, as the structures need a certain minimum number of points to accurate represent it with appropriate densities.

Looking at the visualizations, the difference is quite noticeable. With just 500 points, the reconstructions look really sparse and it’s hard to make out the actual shape of the objects. As we move to 1000 and 2000 points, the models start to capture more detail and the overall structure becomes much clearer. At 5000 points, we get the densest and most detailed representations, where the object shapes are well-defined and recognizable.

The improvement from 500 to 1000 points is dramatic, the gains start to taper off as we continue increasing the point count. This diminishing returns is arising possibly because the model has already captured most of the important geometric features or because of limitations in the network’s capacity to leverage the additional points effectively.

Seeing the effect of ico_sphere(6) vs ico_sphere(4) while training the mesh model

ico_sphere level 4 6
F1 curve
f1@0.05 69.22 73.75

Analysis

The results show that using ico_sphere level 6 produces better reconstructions than level 4. The higher subdivision level provides more vertices and faces, allowing the model to capture finer details and represent curved surfaces more smoothly, which is visible in the reconstructions where level 6 meshes look less blocky and more refined.

However, this improvement comes with increased computational cost since level 6 meshes have significantly more vertices to predict and optimize during training.

2.6 Interpret your model (15 points)

Lets Visualise the Saliency Maps of our voxel model

Ground Truth Voxel Predicted Voxels Saliency Analysis

Analysis

The saliency maps were computed using Integrated Gradients, a gradient-based attribution method that measures the contribution of each input pixel to the model’s prediction. This technique interpolates between a baseline image (black image) and the actual input across 50 steps, computing gradients at each interpolation step and averaging them. The final saliency is obtained by multiplying these averaged gradients with the difference between the input and baseline, then taking the absolute value and summing across color channels. This is based off the work “Axiomatic Attribution for Deep Networks” (I did use ChatGPT to help get this implemented, I initially struggled with visualising the last Conv Layer activations, and it may have been because by the last Conv layer, the spatial correlation mayve broken down and I had to resort to using more approximate methods such as these.)

The saliency maps reveal which regions of the input image the model focuses on when making predictions. As shown in the visualizations, the model pays the most attention to the edges and contours of the objects, with brighter regions in the heatmap indicating higher importance. This makes sense because edges and silhouettes contain crucial geometric information needed to infer the 3D structure.

Let’s look at some Qualitative Results too

Input RGB Ground Truth Mesh Ground Truth Voxel Predicted Voxel
First Image Second Image third Image Fourth Image
First Image Second Image third Image Fourth Image
First Image Second Image third Image Fourth Image
First Image Second Image third Image Fourth Image

Analysis

The qualitative results reveal some limitations of the voxel model when dealing with complex chair geometries. The model struggles particularly with chairs that have unusual features like holes, very thin legs, or unconventional structures. In these challenging cases, the predictions tend to fall back to a more generic, simplified chair shape rather than accurately capturing the intricate details present in the ground truth.

This behavior suggests that the model has learned a strong prior for “typical” chair structures from the training data, which helps it make reasonable predictions for standard designs but limits its ability to generalize to edge cases. The voxel representation itself may also contribute to this issue—thin structures like narrow legs or armrests can be difficult to represent accurately in a discrete voxel grid, especially at lower resolutions, leading the model to either omit them entirely or thicken them into more voxel-friendly shapes.

python eval_model_deep.py --eval_chk_file karthik_checkpoints/voxel.pth --output_path outputs/2_6

3. (Extra Credit) Exploring some recent architectures.

3.1 Implicit network (10 points)

Input RGB Ground Truth Mesh Predicted Occupancy
First Image Second Image Third Image
First Image Second Image Third Image
First Image Second Image Third Image

F1 Curve

First Image

Analysis

Implementation details: Used focal loss instead of occupancy loss and used imbalanced sampling (considering chairs dont occupy 50% of the grid) instead of balanced sampling to get good results. The model initially got horrible results and outputted the prior with around 20 F1 score @0.05 threshold. However, with these changes, the score jumps to 52.20.

The implicit network approach shows decent results in generating smooth, continuous 3D representations from single images. Unlike voxel-based methods that discretize space into a fixed grid, the implicit network learns a continuous occupancy function that can be queried at any point in 3D space, resulting in smoother surfaces and more flexible representations.

However, the F1 score of 52.20 is notably lower compared to the explicit voxel predictions. This is because: Implicit networks are harder to train and require careful sampling strategies during both training and inference, the model may need more training time to converge properly, or the continuous representation might struggle with sharp edges. The visual results show that while the overall shape is captured reasonably well, there’s room for improvement in accurately representing geometric details, particularly in complex regions like chair legs and armrests.

python train_model_occnet.py --type 'occ' --max_iter 10001 --batch_size 16 --lr 2e-3 --num_workers 4 --save_freq 1000 --n_sample_pt 256 --output_path outputs/3_1_max_iter10000_b16_num_workers4_lr_2e-3

python eval_model_occ.py --type occ --load_checkpoint --eval_chk_file <your_checkpoint_path> --output_path outputs/3_1 --vis_freq 100

3.2 Parametric network (10 points)

Not Attempted

3.3 Extended dataset for training (10 points)

Results for the model trained on the entire dataset:

Input RGB Gorund Truth Voxels Predicted Voxels
First Image Second Image Third Image
First Image Second Image Third Image
First Image Second Image Third Image

Voxels model trained on whole dataset vs model trained only on chairs - evaluated on only chairs

Let M1 be model trained only on chairs Let M2 be model trained on whole dataset

Voxel Model M1 M2
F1 curve
Input RGB Gorund Truth Voxels Predicted Voxels (M1) Predicetd Voxels (M2)

Analysis

Training on multiple object categories (M2) versus a single category (M1) reveals multiple trade-offs. When evaluated specifically on chairs, M1 (trained only on chairs) significantly outperforms M2 (trained on chairs, cars, and planes), as evidenced by the F1 curves. This makes sense because M1 can dedicate its entire capacity to learning the specific geometric patterns and variations within the chair class, leading to more accurate and detailed reconstructions.

Looking at the qualitative comparisons, M1’s predictions show better structural accuracy and capture finer details like armrests and chair backs more faithfully. M2’s outputs, while still recognizable as chairs, tend to be more generic and sometimes miss subtle features. This is because M2 must learn shared representations across three very different object categories, which forces the model to find more general features at the expense of category-specific details. However, the advantage of M2 is its versatility, it can handle diverse object types (chairs, cars, planes) with a single model, whereas M1 is specialized for chairs only. M2 is not performing too well at reconstructing cars, but it has good performance on airplanes and chairs. Voxel model training requires significant time, and the model continues to improve even after apparent convergence. With additional training time, M2 would likely achieve better performance across all three categories.

Note: I had to mount additional stoarge in AWS, the guide given by the instructors was perfect (thank you).