16-825 Assignment 2: Single View to 3D

Andrew ID: kpullala

1. Exploring Loss Functions

1.1 Fitting a Voxel Grid 5 points

Initial voxel grid
Initial voxel grid
Optimized Voxel Grid
After optimization
Ground Truth
Target voxel grid
Loss Curve

1.2 Fitting a Point Cloud 5 points

Source Point Cloud
Initial point cloud
Optimized Point Cloud
After optimization
Ground Truth Point Cloud
Target point cloud
Loss Curve

1.3 Fitting a Mesh 5 points

Source Mesh
Initial mesh
Optimized Mesh
After optimization
Ground Truth Mesh
Target mesh
Loss Curve

2. Reconstructing 3D from Single View

2.1 Image to Voxel Grid 20 points

Example 1
Input RGB Image
Input
Predicted Voxel Grid
Prediction
Ground Truth Mesh
Ground Truth
Example 2
Input RGB Image
Input
Predicted Voxel Grid
Prediction
Ground Truth Mesh
Ground Truth
Example 3
Input RGB Image
Input
Predicted Voxel Grid
Prediction
Ground Truth Mesh
Ground Truth

2.2 Image to Point Cloud 20 points

Example 1
Input RGB Image
Input
Predicted Point Cloud
Prediction
Ground Truth Mesh
Ground Truth
Example 2
Input RGB Image
Input
Predicted Point Cloud
Prediction
Ground Truth Mesh
Ground Truth
Example 3
Input RGB Image
Input
Predicted Point Cloud
Prediction
Ground Truth Mesh
Ground Truth

2.3 Image to Mesh 20 points

Example 1
Input RGB Image
Input
Predicted Mesh
Prediction
Ground Truth Mesh
Ground Truth
Example 2
Input RGB Image
Input
Predicted Mesh
Prediction
Ground Truth Mesh
Ground Truth
Example 3
Input RGB Image
Input
Predicted Mesh
Prediction
Ground Truth Mesh
Ground Truth

2.4 Quantitative Comparisons 10 points

F1 score comparison across different 3D representations.

F1 Score Curve Voxel Grid
F1 Score Curve Point Cloud
F1 Score Curve Mesh

Analysis and Intuition

The F1-score comparison demonstrates that the model achieved nearly identical high performance (approaching an F1-score of 80) regardless of whether the output was represented as a Voxel, Point Cloud, or Mesh. The strong correlation between F1-score and Threshold across all plots indicates robust model learning, where the specific shape complexity doesn't critically stress the expressive power of any one format.

Efficiency and Scalability
While the final metric is similar, the Point Cloud representation offers distinct practical advantages. It is the most computationally efficient and scalable, as its memory and time complexity scale linearly with the number of points (O(N)). This contrasts sharply with the Voxel representation, which scales cubically (O(N 3 )), making high-resolution reconstructions infeasible due to extreme memory demands.

Point Clouds also excel at detail capture due to their adaptive, non-grid-bound nature, enabling effective training via sophisticated, differentiable losses like the Chamfer Distance. The Mesh representation, while offering explicit topology, involves more complex training pipelines and loss functions.

2.5 Hyperparameter Analysis 10 points

Effects of varying hyperparameters on model performance.

Hyperparameter Studied: Pointcloud Density

Methodology: I varied the point cloud density by adjusting the number of points predicted by the model. Initially, it was 1000 and later increased to 2000.

Input RGB Image
Pointcloud Density: 1000
Input RGB Image
Pointcloud Density: 2000
Comparison Chart
F1 Score: 1000
Comparison Chart
F1 Score: 2000

Conclusions

With density as 2000, the F1 Score increased by 1.5% compared to the 1000 density configuration. My analysis suggests that higher point cloud density allows for better representation of complex geometries, leading to improved model performance. But, in this case, I couldn't train for the same number of steps due to increased computational requirements.

Hyperparameter Studied: Pointcloud Density & Chamfer Distance weight

Methodology: I varied the point cloud density by adjusting the number of points as done above, and also experimented with different weights for the Chamfer Distance loss function. I made chamfer weight to 0.8 and increased smooth factor to 0.4.

Comparison Chart
F1 Score: 2000
Comparison Chart
F1 Score: 2000 + 0.8 Chamfer + 0.4 Smooth

Conclusions

As seen previously, increasing the points showed slightly improved F1 scores, indicating better model performance with higher point cloud density. But, by changing the Chamfer Distance weight and smooth factor, there was not much improvement observed. This suggested that slight change in chamfer weight does not significantly impact the overall performance. I plan to further investigate this by changing these parameters more drastically.

2.6 Model Interpretation 15 points

Voxel model analysis

Vis 1
Monte Carlo Dropout - 1000 steps
Vis 2
Monte Carlo Dropout - 60000 steps

Summary of Monte Carlo Dropout Visualization

This visualization uses Monte Carlo Dropout to assess prediction uncertainty in a single-view 3D reconstruction task by comparing the model's state at 1,000 steps (under-trained) and 60,000 steps (well-trained).


Key Observations by Training Stage:

1,000 Steps (Under-Trained, Left):

  • Mean Prediction: Sharp but unreliable structure with high contrast.
  • Uncertainty (Variance): Extremely high and uniform across all planes (bright yellow/orange everywhere). The model is essentially guessing randomly as it hasn't learned meaningful 3D patterns yet.

60,000 Steps (Well-Trained, Right):

  • Mean Prediction: Clear, well-defined L-shape structure with proper voxel occupancy.
  • Uncertainty (Variance): Low overall with strategic localization. High confidence in object interior (dark regions), with uncertainty concentrated only at boundaries and ambiguous regions (bright spots).

Primary Insight: From Uniform to Calibrated Uncertainty

The evolution from 1,000 to 60,000 steps demonstrates proper model calibration:

  • Early training: Uniform high uncertainty everywhere indicates the model hasn't learned what features matter for 3D reconstruction.
  • After training: Uncertainty becomes structured and localized, concentrated at:
    • Object boundaries where voxel occupancy transitions occur
    • Depth-ambiguous regions (particularly visible in XZ plane)

3. Exploring Other Architectures / Datasets

3.1 Implicit Network 10 points

Decoder that predicts occupancy values from 3D coordinates and image features.

Implementation Details

Input RGB
Input RGB
Predicted Occupancy
Occupancy prediction
Ground Truth
Ground truth

3.2 Parametric Network 10 points

Implementation Details

Architecture:
1. Multi-patch MLP system with multiple independent decoders (e.g., 10 patches) Each patch: 2D latent (u,v) → 4-layer MLP (512 units) → 3D coordinates (x,y,z) Batch normalization and ReLU activations throughout

2. Trained on airplane mesh from ShapeNet dataset
Sampling Strategy:
Random (u,v) coordinates sampled from [0,1] uniform distribution
Points divided equally across all patch decoders
Training: Chamfer loss against target point cloud samples
Inference: Generate points by sampling random 2D coordinates through trained decoders
Preprocessing: Point clouds centered and normalized to unit sphere This approach allows multiple surface patches to collectively represent complex 3D shapes like aircraft with wings and fuselage.

Code can be found in `bonus_3_2.py`.

Ground truth object
Ground truth
3D reconstruction
Reconstructed 3D points

3.3 Extended Dataset Training 10 points

Representation Chosen: Voxel

Training Setup F1 Score (Chair)
Single Class (Chair) 76.2 %
Three Classes 63.0 %
Single Class Results
Trained on one class
Multi-Class Results
Trained on three classes
Loss Curve
Loss Comparison
Input image
Input
Predicted voxel
Predicted
Ground truth voxel
Ground Truth
Input image
Input
Predicted voxel
Predicted
Ground truth voxel
Ground Truth

Analysis: Single Class vs Multi-Class Training

Quantitative Observations: F1 score decreased for chairs but was very high for aeroplanes (about 86.0 %). My understanding is that when we add more classes and train it for similar number of steps, the model may struggle to fit to the data due to increased complexity. Since chairs are harder to learn, the models is clearly struggling to learn the representation in the same number of steps.

Loss Curve Observations: The loss curve for the multi-class model shows a more erratic pattern compared to the single-class model, indicating that the multi-class model is having difficulty converging. It can also be seen that the loss for single class converges very quickly.

Qualitative Observations: The drop in F1 score is also reflective of the visualization differences between the single-class and multi-class models. The single-class model appears to have a more focused and accurate representation of chairs, while the multi-class model struggles with ambiguity and misclassification.

Conclusions: Multiple objects with diversity require more training data and potentially more complex models to achieve similar performance to a model trained on a single class.