Assignment 2: Single View to 3D

16-825 Learning for 3D Vision

Rohan Nagabhirava

Section 1: Exploring Loss Functions (15 Points)

1.1 Fitting a Voxel Grid (5 points)

Source Voxel Grid

Optimized Voxel Grid

Target Voxel Grid

Ground Truth Voxel Grid

Loss Function: Binary Cross Entropy (BCE) Loss

We use a binary cross-entropy loss which compares the probability of a certain voxel being occupied or not occupied with the ground truth and compares the similarity of the distributions of the two, providing a loss over that.

1.2 Fitting a Point Cloud (5 points)

Source Point Cloud

Optimized Point Cloud

Target Point Cloud

Ground Truth Point Cloud

Loss Function: Chamfer Distance

We implemented chamfer distance which provides a loss that takes the L2 distance (the minimum L2 distance) over one set of points combined with the minimum L2 distance for another set. Essentially, giving us a loss of L2 distance of the nearest point from the predicted dataset to the ground truth dataset and vice versa.

1.3 Fitting a Mesh (5 points)

Source Mesh

Optimized Mesh

Target Mesh

Ground Truth Mesh

Loss Function: Chamfer Distance + Smoothness Loss

We had this chamfer_loss for the vertices but also added a smoothness_loss implementation which utilizes Laplacian smoothing. Essentially creating a loss for having your adjacent vertex points be very different, and having the center vertices be as close to the average of your adjacent and connected vertices.

Section 2: Reconstructing 3D from Single View (95 Points)

2.1 Image to Voxel Grid (20 points)

Example 1

Input RGB 1

Input RGB

Predicted Voxel 1

Predicted Voxel Grid

Ground Truth 1

Ground Truth Voxel Grid

Example 2

Input RGB 2

Input RGB

Predicted Voxel 2

Predicted Voxel Grid

Ground Truth 2

Ground Truth Voxel Grid

Example 3

Input RGB 3

Input RGB

Predicted Voxel 3

Predicted Voxel Grid

Ground Truth 3

Ground Truth Voxel Grid

Decoder Architecture: The decoder is a sequential structure. The decoder starts with a linear network increasing the latent space from 512 to 4096. Then we have four convolutional transpose 3D layers, with a batch norm 3D after each one and a ReLU after each one.

2.2 Image to Point Cloud (20 points)

Example 1

Input RGB 1

Input RGB

Predicted Point Cloud 1

Predicted Point Cloud

Ground Truth 1

Ground Truth Point Cloud

Example 2

Input RGB 2

Input RGB

Predicted Point Cloud 2

Predicted Point Cloud

Ground Truth 2

Ground Truth Point Cloud

Example 3

Input RGB 3

Input RGB

Predicted Point Cloud 3

Predicted Point Cloud

Ground Truth 3

Ground Truth Point Cloud

Decoder Architecture: For the point cloud decoder, we have a linear network with four linear layers. Each layer increases the output dimension by 2x from the previous one (512 to 1024 to 2048 to 4096), until the last layer which increases it to the total number of vertex points multiplied by 3 for xyz coordinates. There is a ReLU activation between each layer.

2.3 Image to Mesh (20 points)

Example 1

Input RGB 1

Input RGB

Predicted Mesh 1

Predicted Mesh

Ground Truth 1

Ground Truth Mesh

Example 2

Input RGB 2

Input RGB

Predicted Mesh 2

Predicted Mesh

Ground Truth 2

Ground Truth Mesh

Example 3

Input RGB 3

Input RGB

Predicted Mesh 3

Predicted Mesh

Ground Truth 3

Ground Truth Mesh

Decoder Architecture: For the mesh decoder, we have four linear layers. The first one increases the output dimension from the 512 latent space to 1024. The middle two linear layers keep a constant 1024 dimension space. The last linear layer outputs the number of vertices multiplied by 3 for the xyz coordinates of all of them. There is a ReLU activation between each layer, and the final output passes through a Tanh activation.

Mesh Initialization: We initialize the mesh using an icosphere with subdivision level 4.

2.4 Quantitative Comparisons (10 points)

Voxel F1 Scores

Voxel Grid F1 Scores

Point Cloud F1 Scores

Point Cloud F1 Scores

Mesh F1 Scores

Mesh F1 Scores

Analysis and Comparison

Between the three approaches, we see a pretty similar trend for the F1 score. As the threshold increases, they all achieve much better F1 scores, which is expected since a larger threshold allows for more tolerance in the distance between predicted and ground truth points.

Point Cloud achieves the highest F1 scores overall. Looking at the graphs, the point cloud reaches almost 80% F1 score at the highest threshold of 0.05, whereas voxel and mesh both top out at around 75%. At the 0.03 threshold, point cloud was much higher at 53%, whereas mesh was at 50% and voxel was at 47%.

Why these differences? Voxel is a high-dimensional representation (32³ grid) that is more difficult to learn, resulting in lower F1 scores throughout the curve. Point clouds are the easiest to learn because all you need to learn is the XYZ coordinates for each point. Mesh falls in between - while it also learns vertex coordinates like point clouds, it additionally needs to respect the topology and structure of the faces, making it more constrained and slightly harder to optimize than free-form point clouds.

2.5 Analyze Effects of Hyperparameter Variations (10 points)

Hyperparameter Explored: Initial Mesh Topology

We varied the initial mesh topology to analyze how different starting geometries affect reconstruction quality.

Quantitative Comparison (F1 Scores)

Baseline

Baseline (ico_sphere level 4)

Variation 1

Variation 1 (torus initialization)

Variation 2

Variation 2 (ico_sphere level 3)

Qualitative Comparison (Example Object)

Input RGB

Input RGB

Baseline Prediction

Baseline

Torus Prediction

Torus Init

Coarser Sphere Prediction

Coarser Sphere

Analysis

In this experiment, we explored hyperparameter variations of the initial mesh for mesh-based 3D inference. Our baseline was an icosphere of subdivision level 4. We compared two variations: (1) a torus-based initialization, and (2) a coarser sphere (icosphere of level 3).

Quantitative Results: Looking at the F1 scores, all three variations reached about the same F1 scores overall, and the curves are pretty similar. However, at the 0.03 threshold, the coarser sphere performs a little bit lower than both the mesh torus and the baseline icosphere. This is because it has fewer faces than the torus initialization and the baseline icosphere, limiting its representational capacity.

Torus Initialization: The torus initialization is particularly interesting because it has a different topology than spheres - it contains a hole in the center. This means it can potentially learn some shapes better, especially those with similar topological features. However, for topological shapes with zero holes (like most chairs), it might struggle to learn the correct topology and could introduce artifacts.

Qualitative Comparison: Looking at the same chair across different initializations, we can see clear differences. The baseline really captured the bottom legs well, showing the differentiation between each leg clearly. The torus got pretty close - you can see the legs a bit and some clarity in the chair - but it definitely wasn't able to create details. The coarser sphere showed a much worse result than the baseline sphere, which makes sense because you have fewer faces to train on, resulting in a coarser reconstruction.

Conclusion: The initial mesh that you start off with is a very crucial parameter. The topology is very important, and having a similar topology to the shapes you're training on is highly important. While different initializations can converge to similar quantitative scores, the qualitative results reveal that topology and mesh resolution significantly impact the fine-grained geometric details of the reconstruction.

2.6 Interpret Your Model (15 points)

Visualization Approach: Weight Analysis

We analyze the learned weights of the point cloud and mesh decoders to understand how different 3D representations affect the network's learned parameters. By comparing weight distributions and layer-wise statistics, we can identify differences in learning strategies between the two decoder architectures.

Weight Visualizations

Weight Distributions

Weight Distribution Histograms

Layer Statistics

Layer-wise Statistics

Analysis and Insights

Similar Weight Distributions: The weights are very similar in distribution between the two model architectures. This makes sense as the architectures are similar and they're learning a very similar problem, so it shows a lot of the same type of transformations that are needed to learn the point cloud and mesh structure when you have a similar type of network. Both decoders converged to nearly identical Gaussian-like distributions centered around zero across all layers, indicating they learned comparable feature mappings.

Over-parameterization: We see so many weight values at 0, and you can see the sparsity is very high throughout the network. We may have over-parameterized this network and could have gone with a much simpler network to learn similar features. The high sparsity and concentration of weights near zero suggest that many parameters are contributing minimally to the output, indicating the networks are likely larger than necessary for this task.

Section 3: Exploring Other Architectures/Datasets

3.3 Extended Dataset for Training (10 points)

Training Setup

Model Type: Mesh-based reconstruction

Dataset: Extended dataset with 3 classes (chair, car, plane) using split_3c.json

Comparison: Single-class training (chair only) vs Multi-class training (chair, car, plane)

Quantitative Comparison

Single Class Results

Single Class Training (Chair only)

Three Class Results

Three Class Training (Chair, Car, Plane)

Qualitative Comparison

Chair Comparison

Comparing single-class specialist vs multi-class generalist on chair reconstruction

Chair Input

Input RGB

Single Class Chair Prediction

Single-Class Training

Multi Class Chair Prediction

Multi-Class Training

Generalization Test: Plane

Single-class model cannot generalize to planes

Plane Input

Input RGB

Single Class Plane

Single-Class Training

Multi Class Plane

Multi-Class Training

Generalization Test: Car

Single-class model cannot generalize to cars

Car Input

Input RGB

Single Class Car

Single-Class Training

Multi Class Car

Multi-Class Training

Analysis

F1 Score Comparison: Based on the F1 graphs, you can see pretty clearly that the three-class training dataset is able to reach a much higher F1 score. At a threshold of 0.05, the three-class training was able to reach over 80% F1 score (approximately 85%), vs the single-class was only reaching around 75%. Notably, the multi-class model was only trained for 25,000 iterations compared to the single-class which was trained for 100,000 iterations, yet still achieved significantly better performance.

Qualitative Comparison - Chair: Looking at the chair comparison (the only object type the single-class model was trained on), you can see that the multi-class training had quite a bit of noise and was not as crisp and clear as the single-class model. This demonstrates that specialization on a single class produces cleaner results for that specific object type.

Generalization Test - Plane: When evaluating the single-class training on the full dataset with an input of a plane, it does much worse compared to the multi-class which was able to get much cleaner results. This is pretty obvious - the multi-class can generalize to more object types that it's learned on, while specialized training on only chairs does poorly generalizing to other objects which it hasn't seen before.

Generalization Test - Car: Looking at the car example, both the single-class training and the multi-class training struggled on this particular object. It looks like quite a difficult object to reconstruct, so it's hard to say which performed better because they both gave quite poor outputs.

Conclusion: Based on our plane example, you can pretty clearly see that the multi-class training dataset generalizes much better to multiple classes, while the single-class does much better at its specialized class. Multi-class training improves generalization but hurts specialization. There is a fundamental trade-off between training a specialist model that excels at one object type versus a generalist model that can handle diverse object categories with reasonable performance across all of them.