Assignment 2: Single View to 3D Reconstruction

1. Exploring Loss Functions

In this section, we implemented and tested three different loss functions for fitting 3D representations: voxel grids, point clouds, and meshes.

1.1 Fitting a Voxel Grid

Implemented binary cross-entropy loss to fit a 3D binary voxel grid. This loss function is ideal for voxel occupancy prediction as it treats each voxel as an independent binary classification problem.

# Voxel Loss Implementation
loss = torch.nn.functional.cross_entropy(voxel_src, voxel_tgt)
            

Voxel grid optimization: Source (red) converging to target (blue)

1.2 Fitting a Point Cloud

Implemented Chamfer loss from scratch to fit 3D point clouds. The Chamfer distance measures the average nearest-neighbor distance between two point sets, providing bidirectional correspondence.

# Chamfer Loss Implementation
# For each point in source, find nearest in target
knn_src = knn_points(point_cloud_src, point_cloud_tgt, K=1)
loss_src = knn_src.dists[..., 0].mean()

# For each point in target, find nearest in source
knn_tgt = knn_points(point_cloud_tgt, point_cloud_src, K=1)
loss_tgt = knn_tgt.dists[..., 0].mean()

chamfer_loss = loss_src + loss_tgt
            

Point cloud optimization: Source (red) converging to target (blue)

1.3 Fitting a Mesh

Implemented smoothness loss to regularize mesh fitting. This loss penalizes differences in face normals between adjacent faces, encouraging smooth surfaces.

# Smoothness Loss Implementation #

loss_laplacian = mesh_laplacian_smoothing(mesh_src)

Mesh optimization: Source (red) converging to target (blue) with smoothness regularization

2. Reconstructing 3D from Single View

2.1 Image to Voxel Grid (20 points)

Trained a neural network decoder to predict 32×32×32 binary voxel grids from single RGB images. The decoder uses transposed convolutions to upsample from image features to 3D occupancy predictions.

                    Network Architecture:
                    Encoder: ResNet18 pretrained backbone → 512D feature vector
Decoder: FC layer → reshape → 3D transposed convolutions
Output: 32×32×32 voxel grid with binary occupancy

                

Input RGB Image

Prediction (red) vs Ground Truth (blue)

Input RGB Image

Prediction (red) vs Ground Truth (blue)

Input RGB Image

Prediction (red) vs Ground Truth (blue)

Observations: The voxel model successfully captures the overall shape and structure of chairs. However, we notice that the predicted voxel grids tend to be sparser than the ground truth—the model predicts fewer occupied voxels overall. This can be explained by:

Class Imbalance During Training: In a 32³ voxel grid, the majority are empty. This severe imbalance causes the model to be conservative in predicting occupied voxels to minimize loss.
Discretization Effects: The 32³ resolution forces the model to choose between occupied/empty for each voxel, and when uncertain, it errs on the side of predicting empty to avoid false positives.
Training Objective: Binary cross-entropy loss penalizes false positives and false negatives equally, but in practice, predicting fewer occupied voxels reduces the overall error rate given the class imbalance.

Despite this, the blocky appearance is inherent to the 32³ resolution limitation, and major features like backrest, and seat are clearly reconstructed with correct topology.

2.2 Image to Point Cloud (20 points)

Trained a decoder to directly predict 3D point coordinates (N×3) from image features. This representation provides higher resolution than voxels without the memory overhead.

                    Network Architecture:
                    Encoder: ResNet18 → 512D features
Decoder: FC layers with ReLU activations
Output: N points × 3 coordinates (trained with N=3000)
Loss: Chamfer distance between predicted and GT point sets

                

Ground Truth

Prediction (red) vs GT (blue)

Ground Truth

Prediction (red) vs GT (blue)

Ground Truth

Prediction (red) vs GT (blue)

Observations: Point cloud predictions show reconstructions that closely match the overal shape of the ground truth. The model captures fine details like armrests and leg structures better than voxel representations.

Point Distribution Pattern: We observe that predicted points tend to concentrate in two main regions (typically the seat and backrest areas). This occurs because:

Training with Chamfer Loss: Chamfer distance is computed as average nearest-neighbor distances, which doesn't explicitly enforce uniform point distribution. The model learns to densely sample large, visible surfaces to minimize reconstruction error.
Feature Salience: The network learns that seat and backrest are the most visually and functionally important features of chairs, leading to higher point density in these regions during prediction.

2.3 Image to Mesh (20 points)

Trained a mesh deformation network that starts from an icosphere and learns to deform it into the target shape. This approach leverages mesh topology for smooth, continuous surfaces.

                    Network Architecture:
                    Encoder: ResNet18 → 512D features
Decoder: FC layers predicting per-vertex deformations
Initial mesh: Icosphere (subdivided 2-3 times)
Loss: Chamfer loss + Smoothness regularization

                

Ground Truth

Prediction vs GT

Ground Truth

Prediction vs GT

Ground Truth

Prediction vs GT

Observations: The mesh model successfully captures the overall shape and structure of chairs. However, the predictions exhibit extremely spiky surfaces with many irregular protrusions. This spikiness can be explained by:

Insufficient Smoothness Regularization: The default smoothness weight (w_smooth) is too low, allowing the mesh to deform aggressively to minimize Chamfer loss without sufficient penalty for surface irregularities.
Icosphere Topology Constraints: Starting from a sphere requires large deformations to match chair geometry. The model struggles to stretch the spherical topology into thin structures (legs, armrests) while maintaining smooth surfaces.
Competing Loss Terms: Chamfer loss drives vertices toward the target shape, while smoothness loss tries to maintain surface quality. When smoothness weight is too low, Chamfer loss dominates, creating spikes as vertices aggressively move to minimize point-to-point distances.
Limited Vertex Budget: The icosphere has a fixed number of vertices. To capture fine details, individual vertices make extreme movements, creating spikes rather than smooth approximations.

As shown in Section 2.5 and 2.6, increasing the smoothness weight (w_smooth) slightly reduces spikiness, demonstrating the small trade-off between reconstruction accuracy and surface quality. The model also struggles with thin structures and sharp corners due to the spherical initialization topology.

2.4 Quantitative Comparisons (10 points)

We evaluated all three representations using F1 score at varying distance thresholds. F1 score measures the harmonic mean of precision and recall in terms of nearest-neighbor distances between predicted and ground truth point clouds.

Voxel Grid F1 Scores

Point Cloud F1 Scores

Mesh F1 Scores

Quantitative Analysis:

Model Type	F1@0.05	Strengths	Weaknesses
Point Cloud	~70-80%	Highest F1 score, flexible representation, good detail capture	No surface information, discrete points
Mesh	~40-50%	Smooth surfaces, continuous representation, rendering quality	Topology constraints, struggles with thin structures, spiky surfaces
Voxel Grid	~70%	Explicit occupancy, easy to render, GPU-friendly	Limited resolution (32³), memory intensive for higher res

Intuitive Explanation:

Point clouds perform best because they directly optimize what is being measured (point-to-point distances) without intermediate representations. The network has the most degrees of freedom to match the ground truth.
Meshes have medium performance due to: (1) the icosphere initialization constrains topology, (2) smoothness regularization prevents matching sharp features, (3) deformation-based approaches struggle with large shape variations from a sphere, and (4) spiky surfaces from insufficient smoothness regularization hurt reconstruction quality.
Voxel grids perform worst primarily because the model predicts very sparse occupancy (fewer occupied voxels than GT) due to class imbalance during training. The 32³ resolution also introduces discretization artifacts and quantization errors. When sampled to points for F1 evaluation, the sparse voxel predictions result in incomplete surface coverage, significantly hurting the F1 scores.

2.5 Hyperparameter Analysis (10 points)

We analyzed the effect of varying the smoothness weight (w_smooth) in mesh reconstruction, which controls the trade-off between fitting accuracy and surface smoothness.

w_smooth = 20.0 (Moderate)

w_smooth = 200.0 (High)

Alternative experiment: Voxel evaluation with N=3000 sample points

Hyperparameter Impact Analysis:

Smoothness Weight (w_smooth) in Mesh Reconstruction:

Low values (2.0, original result on 2.3): Mesh fits data more tightly but may have surface irregularities and spikes. Better Chamfer loss but visual artifacts.
Medium values (20.0): Balanced trade-off between fit quality and surface smoothness. Generally optimal for most cases.
High values (200.0): Very smooth surfaces but may fail to capture fine details and sharp features.

Key Insight: There's a fundamental trade-off between reconstruction accuracy (low w_smooth) and visual quality (high w_smooth). The optimal value depends on the application: use lower values for accurate geometry measurement, higher values for visual rendering.

Additional Finding: Increasing the number of points (n_points) for voxel sampling from 1000 to 3000 improves F1 scores by providing denser surface coverage, but has diminishing returns beyond a certain threshold due to GPU memory constraints.

2.6 Model Interpretation (15 points)

To better understand what the mesh model learns during training, we implemented loss component visualization that tracks individual loss terms over time.

Loss Components: w_smooth=2.0

Loss Components: w_smooth=20.0

Loss Components: w_smooth=200.0

Interpretation Insights:

Loss Component Visualization: By plotting weighted Chamfer loss and weighted smoothness loss separately, we can observe:

Training Dynamics: Chamfer loss decreases rapidly in early training, then plateaus. Smoothness loss initially increases (mesh deforms away from smooth sphere), then stabilizes.
Weight Impact: Higher w_smooth values cause the smoothness term to dominate, limiting how much the mesh can deform to fit the data. This explains why high w_smooth leads to lower F1 scores but smoother visual appearance.

Why This Matters: This visualization reveals that the "spiky mesh" problem isn't a bug—it's the model correctly minimizing Chamfer loss without sufficient smoothness regularization. Understanding this trade-off helps us tune hyperparameters more intelligently.

3. Advanced Architectures & Datasets

3.1 Implicit Network (10 points)

Implemented an implicit occupancy network that takes 3D coordinates and image features as input and predicts occupancy values. This continuous representation can be queried at any resolution.

                    Architecture Design:
                    Input: Concatenation of 512D image features + 3D coordinates (x,y,z)
Network: 5-layer MLP with 256 hidden units and ReLU activations
Output: Single occupancy logit per 3D location
Training: Query 32³ grid in normalized space [-0.5, 0.5]³
Key Innovation: Continuous representation that can be evaluated at arbitrary resolutions

                

# Implicit Decoder Forward Pass
def forward(self, image_features, coordinates):
    # image_features: (B, 512)
    # coordinates: (B, N, 3) - 3D query locations
    
    # Expand features to match each coordinate
    expanded_features = image_features.unsqueeze(1).expand(B, N, -1)
    
    # Concatenate features with coordinates
    combined = torch.cat([expanded_features, coordinates], dim=-1)
    
    # Pass through MLP to predict occupancy
    occupancy_logits = self.network(combined)  # (B, N, 1)
    
    return occupancy_logits
                

Input RGB Image

Implicit Network Prediction vs GT

Results & Analysis:

Performance: Unfortunately, the implicit network performs poorly, producing mostly filled, slanted reconstructions that fail to capture the overall chair shape. The predictions appear as dense, blob-like structures rather than recognizable furniture.

Why This Happens:

Weak Feature Conditioning: Simply concatenating 512D image features with 3D coordinates may not provide sufficient spatial guidance. The network struggles to learn the complex mapping from image features to spatially-varying occupancy.
Training Difficulty: Implicit networks require careful hyperparameter tuning and often need more training steps than direct prediction methods. The model may have converged to a poor local minimum.
Class Imbalance Effects: Despite using pos_weight in BCE loss, the severe imbalance (>95% empty voxels) makes it difficult for the network to learn fine-grained occupancy patterns. The model tends toward over-prediction to avoid missing positive samples.
Limited Architecture: A simple 5-layer MLP may lack the representational capacity needed for this task. More sophisticated architectures with skip connections, positional encodings, or hierarchical features would likely perform better.

Technical Challenges Addressed:

Coordinate System Mismatch: Fixed coordinates from [-1,1]³ to [-0.5,0.5]³ to align with GT voxels.
Initialization: Added negative bias (-2.0) to final layer to encourage sparsity.
Class Imbalance: Applied pos_weight to BCE loss.

Conclusion: While the implicit representation offers theoretical advantages (continuous querying, resolution-independence), achieving good reconstruction quality requires more sophisticated network architectures and training strategies than implemented here. This demonstrates that architectural choices significantly impact 3D reconstruction performance.

3.2 Extended Dataset Training (10 points)

Trained point cloud models on both single-class (chair only) and multi-class (chair, car, plane) datasets to analyze the impact of dataset diversity on reconstruction quality.

                    Experimental Setup:
                    Model: Point cloud decoder (N=3000 points)
Dataset 1: 6,780 chair samples (single class)
Dataset 2: ~20,000 samples (chair + car + plane)
Evaluation: F1 scores on chair test set for both models

                

Qualitative Results: Visual Comparison

Input Image 1

Input Image 2

Input Image 3

Sample 1 Comparisons:

Chair-Only Model vs GT

Multi-Class Model vs GT

Chair-Only vs Multi-Class

Sample 2 Comparisons:

Chair-Only Model vs GT

Multi-Class Model vs GT

Chair-Only vs Multi-Class

Sample 3 Comparisons:

Chair-Only Model vs GT

Multi-Class Model vs GT

Chair-Only vs Multi-Class

Quantitative Results: F1 Score Comparison

F1 Score comparison: Chair-only (blue) vs Multi-class (red) training

Analysis & Findings:

Quantitative Results:

Threshold	Chair-Only F1	Multi-Class F1	Difference
0.01	6.0%	6.3%	+0.3%
0.02	27.9%	28.7%	+0.8%
0.03	58.9%	52.0%	-6.9%
0.04	80.9%	66.7%	-14.2%
0.05	90.3%	76.9%	-13.4%

Key Observations:

Specialization vs Generalization Trade-off: The chair-only model significantly outperforms the multi-class model on chair reconstruction (13.4% better F1@0.05). This demonstrates that specialization improves performance on the target class.
Qualitative Differences:
- Chair-Only: Captures fine details more accurately. Tighter fit to GT.
- Multi-Class: Produces smoother, more generic shapes that work across categories but miss class-specific details.

Conclusion:

Training on a single class produces specialized, higher-quality reconstructions for that class at the cost of generalization. Training on multiple classes produces a more general model that performs moderately well across categories but sacrifices some accuracy on each individual class.

Model Capacity Hypothesis: The performance gap may also reflect insufficient model capacity. Our network architecture (ResNet18 encoder + simple FC decoder) has limited representational capacity. When trained on a single class, all capacity focuses on learning chair-specific features. When trained on three diverse classes (chairs, cars, planes), the same fixed capacity must be divided among learning features for all categories, leading to a capacity bottleneck where no single class is learned as well. A larger model with more parameters might close this performance gap while maintaining multi-class versatility.

The choice depends on the application: use specialized models for best single-category performance, or use multi-class models (ideally with larger capacity) for versatility across categories.

Assignment 2: Single View to 3D Reconstruction

Table of Contents

1. Exploring Loss Functions

1.1 Fitting a Voxel Grid

1.2 Fitting a Point Cloud

1.3 Fitting a Mesh

2. Reconstructing 3D from Single View

2.1 Image to Voxel Grid (20 points)

2.2 Image to Point Cloud (20 points)

2.3 Image to Mesh (20 points)

2.4 Quantitative Comparisons (10 points)

Quantitative Analysis:

2.5 Hyperparameter Analysis (10 points)

Hyperparameter Impact Analysis:

2.6 Model Interpretation (15 points)

Interpretation Insights:

3. Advanced Architectures & Datasets

3.1 Implicit Network (10 points)

Results & Analysis:

Technical Challenges Addressed:

3.2 Extended Dataset Training (10 points)

Qualitative Results: Visual Comparison

Sample 1 Comparisons:

Sample 2 Comparisons:

Sample 3 Comparisons:

Quantitative Results: F1 Score Comparison

Analysis & Findings: