Assignment 2 - Ishita Gupta

Name: Ishita Gupta
Andrew ID: ishitag

Q1 Exploring Loss Functions

1.1. Fitting a Voxel Grid

Optimized Voxel Grid	Ground Truth Voxel Grid

Observation: The optimized voxel grid successfully converges to match the ground truth shape after optimization, demonstrating that BCE loss is effective for binary voxel occupancy prediction.

1.2. Fitting a Point Cloud

Optimized Point Cloud	Ground Truth Point Cloud

Observation: Starting from random Gaussian noise, the chamfer loss successfully guides the points to form the target chair shape, showing the effectiveness of bidirectional nearest-neighbor distance minimization.

1.3. Fitting a Mesh

Optimized Mesh	Ground Truth Mesh

Observation: The icosphere successfully deforms to approximate the target chair shape. The smoothness regularization prevents the mesh from developing unrealistic geometric artifacts while maintaining surface quality.

Q2 Reconstructing 3D from Single View

2.1. Image to Voxel Grid

Input RGB	Ground Truth Mesh	Predicted Voxel

2.2. Image to Point Cloud (20 points)

Input RGB	Ground Truth Mesh	Predicted Point Cloud

2.3. Image to Mesh (20 points)

Input RGB	Ground Truth Mesh	Predicted Mesh

2.4. Quantitative Comparisons (10 points)

Individual F1 Score Curves:

Voxel Grid	Point Cloud	Mesh

Quantitatively, the three representations show distinct performance characteristics as follows:

Voxel Grid achieves the lowest F1 score (~71.3%) at threshold 0.05
Point Cloud achieves the highest F1 score (~79.9%) at threshold 0.05
Mesh achieves a moderate F1 score (~75.6%) at threshold 0.05

Explanation: This ranking is expected based on the flexibility and constraints of each representation:

Point Clouds (Highest F1): Point clouds are the least constrained representation. They only need to position individual points near the target surface with no requirements for connectivity or topology. This flexibility allows the network to achieve the best geometric matching, especially when evaluated using chamfer distance and precision/recall metrics.
Meshes (Medium F1): Meshes are more constrained because they must maintain surface connectivity and valid topology. Additionally, starting from an icosphere template requires the network to perform complex geometric deformations to approximate chair-like shapes, which is challenging. The smoothness regularization also limits how much the mesh can deform to exactly match the target.
Voxel Grids (Lowest F1): Voxels are the most constrained representation due to their fixed resolution (32x32x32 = 32,768 voxels). This discretization fundamentally limits the level of detail that can be captured compared to continuous representations. Fine features smaller than the voxel size cannot be represented, creating an inherent ceiling on reconstruction quality.

Conclusion: The trade-off between representation flexibility and structure explains the performance differences. Point clouds prioritize flexibility, meshes balance flexibility with surface coherence, and voxels sacrifice flexibility for structured regularity.

2.5. Analyze Effects of Hyperparameter Variations (10 points)

Hyperparameter Studied: Number of points (n_points) in point cloud reconstruction

I analyzed how varying the number of predicted points affects reconstruction quality by training three models with n_points = 500, 1000, and 2000.

Input Image	Ground Truth	n_points 500	n_points 1000	n_points 2000

Analysis: The results show that increasing the number of points significantly improves F1 score from 500 to 1000 points. This improvement occurs because more points provide better surface coverage and higher geometric resolution.

However, further increasing from 1000 to 2000 points yields diminishing returns, indicating performance saturation. This suggests that 1000 points already provides sufficient resolution to capture the geometric details present in this dataset at the given image resolution (137x137).

Conclusion: There is a sweet spot at and around n_points=1000 that balances reconstruction quality, training speed, and memory efficiency. Beyond this point, the additional computational cost outweighs the marginal performance gains.

2.6. Interpret Your Model (15 points)

To understand how confident the model is in its predictions, I visualize the voxel grid at different occupancy thresholds. Each voxel contains a predicted probability (post-sigmoid value), and by varying the iso-value threshold during marching cubes extraction, I can see which regions the model is certain about versus uncertain.

Input RGB	Threshold 0.2	Threshold 0.3 (Low)	Threshold 0.4	Threshold 0.5 (Medium)	Threshold 0.7 (High)

This visualization reveals where the model is confident versus uncertain in its predictions. At high thresholds (0.7), only the core structures like the seat and backrest remain, indicating these are the regions where the model has strong evidence. As we lower the threshold, thin structures like chair legs and edges appear, but they're thicker and blobbier, showing the model is uncertain about their exact geometry. This demonstrates that the model's struggles with thin structures stem from low confidence rather than complete failure to detect them. The 32x32x32 voxel resolution creates inherent uncertainty for features near the voxel size limit, which manifests as lower probability values. The threshold choice thus represents a trade-off between precision (high threshold) and recall (low threshold).

Q3 Exploring Other Architectures / Datasets (and Extra Credit)

3.1 Implicit Network (10 points)

I implemented an implicit decoder network inspired by Occupancy Networks that learns a continuous function f(image, x, y, z) -> occupancy, mapping 3D coordinates to occupancy values conditioned on image features. Unlike the standard voxel decoder that directly predicts all 32x32x32 values, the implicit decoder queries each spatial coordinate independently. The architecture consists of a 4-layer MLP ([512 features + 3 coords] -> 512 -> 256 -> 128 -> 1) that takes concatenated image features and 3D coordinates as input and outputs a single occupancy value. During forward pass, I create a 32x32x32 meshgrid in normalized space [-1,1]^3, expand image features for all voxel positions, concatenate with coordinates to form [B, 32768, 515] tensors, and query the implicit function for each point. For memory efficiency, I process coordinates in chunks of 4,096 points rather than all 32,768 simultaneously, reducing GPU memory usage by 8x while maintaining identical results. The network is trained with binary cross-entropy loss comparing predicted occupancy fields to ground truth voxel grids.

Results:

Sample	Input	Predicted Occupancy Field	Ground Truth
100
200
400

The implicit network successfully learns spatially-aware predictions through coordinate conditioning, producing occupancy fields comparable to the voxel decoder while demonstrating the continuous function representation concept from the Occupancy Networks paper.

3.2 Parametric Network (10 points)

Implementation

I implemented a parametric decoder network inspired by AtlasNet that learns a continuous function f(image, u, v) -> (x, y, z), mapping 2D parametric coordinates to 3D points.

Architecture:

Input: 512-dim image features + 2D UV coordinates
Decoder: MLP with [514 -> 512 -> 256 -> 3]
Output: 3D point cloud with structured UV topology

Key Concept: Unlike standard point decoders that directly predict 3D coordinates, the parametric decoder learns a continuous 2D-to-3D surface mapping. A regular UV grid (50x50 = 2,500 points) in [-1,1]^2 is created, and each UV coordinate is concatenated with image features and passed through the MLP to predict its corresponding 3D location.

How Point Clouds Are Generated

Encode image through ResNet18 -> 512-dim features
Create UV grid of 50x50 regular coordinates in [-1, 1]^2
Concatenate each UV coordinate with image features: [512 + 2]
Decode through MLP to get 3D point for each UV
Assemble 2,500 structured 3D points into point cloud

The parametric formulation provides a continuous function that maintains surface topology through the UV structure.

Results

Below are test set examples showing input images, predicted parametric point clouds, and ground truth:

Input	Predicted	Ground Truth

Observations:

Successfully reconstructs overall chair structure and proportions
Regular point distribution from uniform UV sampling
Maintains smooth topology due to parametric continuity
Captures major geometric features (seat, backrest, legs)
Some surface artifacts: seating areas show convoluted/folded surfaces, likely due to the single-patch parameterization struggling with complex local geometry
Further improvement could involve using multiple patches (as in full AtlasNet) or increasing training iterations

Training

Loss: Chamfer distance between predicted and ground truth points
Points: 2,500 (50x50 UV grid)
Iterations: 10,000 with learning rate 4e-4

The parametric network demonstrates effective learning of continuous 2D-to-3D surface mappings, combining the flexibility of point clouds with structured surface parameterization.

3.3 Extended Dataset for Training (10 points)

I trained the point cloud decoder on two datasets: (1) single-class with only chairs, and (2) extended dataset with three classes (airplanes, cars, and chairs). Quantitatively, training on three classes achieves an F1@0.05 of 76.8% on chair test samples, compared to 79.9% when training on chairs alone, representing a 3.1% decrease. Qualitatively, the three-class model produces more diverse but occasionally less detailed reconstructions. The single-class model overfits to chair-specific features (legs, backrests, arm structures), while the three-class model learns more generalizable shape features applicable across object categories. This manifests as slightly reduced precision for fine chair details but improved robustness to unusual chair designs that deviate from the training distribution. The trade-off represents a classic bias-variance balance: single-class training provides higher accuracy for the target category through specialization, while multi-class training offers better generalization and diversity at the cost of category-specific detail fidelity.

Input RGB Image	Ground Truth Mesh	Predicted (Single Class)	Predicted (3 Classes)