Name: Ishita Gupta
Andrew ID: ishitag


Q1 Exploring Loss Functions

1.1. Fitting a Voxel Grid

Optimized Voxel Grid Ground Truth Voxel Grid
Optimized Voxel Grid Ground Truth Voxel Grid

Observation: The optimized voxel grid successfully converges to match the ground truth shape after optimization, demonstrating that BCE loss is effective for binary voxel occupancy prediction.


1.2. Fitting a Point Cloud

Optimized Point Cloud Ground Truth Point Cloud
Optimized Points Ground Truth Points

Observation: Starting from random Gaussian noise, the chamfer loss successfully guides the points to form the target chair shape, showing the effectiveness of bidirectional nearest-neighbor distance minimization.


1.3. Fitting a Mesh

Optimized Mesh Ground Truth Mesh
Optimized Mesh Ground Truth Mesh

Observation: The icosphere successfully deforms to approximate the target chair shape. The smoothness regularization prevents the mesh from developing unrealistic geometric artifacts while maintaining surface quality.


Q2 Reconstructing 3D from Single View

2.1. Image to Voxel Grid

Input RGB Ground Truth Mesh Predicted Voxel

2.2. Image to Point Cloud (20 points)

Input RGB Ground Truth Mesh Predicted Point Cloud

2.3. Image to Mesh (20 points)

Input RGB Ground Truth Mesh Predicted Mesh

2.4. Quantitative Comparisons (10 points)

Individual F1 Score Curves:

Voxel Grid Point Cloud Mesh
F1 Score - Voxel F1 Score - Point F1 Score - Mesh

Quantitatively, the three representations show distinct performance characteristics as follows:

Explanation: This ranking is expected based on the flexibility and constraints of each representation:

  1. Point Clouds (Highest F1): Point clouds are the least constrained representation. They only need to position individual points near the target surface with no requirements for connectivity or topology. This flexibility allows the network to achieve the best geometric matching, especially when evaluated using chamfer distance and precision/recall metrics.
  2. Meshes (Medium F1): Meshes are more constrained because they must maintain surface connectivity and valid topology. Additionally, starting from an icosphere template requires the network to perform complex geometric deformations to approximate chair-like shapes, which is challenging. The smoothness regularization also limits how much the mesh can deform to exactly match the target.
  3. Voxel Grids (Lowest F1): Voxels are the most constrained representation due to their fixed resolution (32x32x32 = 32,768 voxels). This discretization fundamentally limits the level of detail that can be captured compared to continuous representations. Fine features smaller than the voxel size cannot be represented, creating an inherent ceiling on reconstruction quality.

Conclusion: The trade-off between representation flexibility and structure explains the performance differences. Point clouds prioritize flexibility, meshes balance flexibility with surface coherence, and voxels sacrifice flexibility for structured regularity.


2.5. Analyze Effects of Hyperparameter Variations (10 points)

Hyperparameter Studied: Number of points (n_points) in point cloud reconstruction

I analyzed how varying the number of predicted points affects reconstruction quality by training three models with n_points = 500, 1000, and 2000.

Input Image Ground Truth n_points 500 n_points 1000 n_points 2000

Analysis: The results show that increasing the number of points significantly improves F1 score from 500 to 1000 points. This improvement occurs because more points provide better surface coverage and higher geometric resolution.

However, further increasing from 1000 to 2000 points yields diminishing returns, indicating performance saturation. This suggests that 1000 points already provides sufficient resolution to capture the geometric details present in this dataset at the given image resolution (137x137).

Conclusion: There is a sweet spot at and around n_points=1000 that balances reconstruction quality, training speed, and memory efficiency. Beyond this point, the additional computational cost outweighs the marginal performance gains.


2.6. Interpret Your Model (15 points)

To understand how confident the model is in its predictions, I visualize the voxel grid at different occupancy thresholds. Each voxel contains a predicted probability (post-sigmoid value), and by varying the iso-value threshold during marching cubes extraction, I can see which regions the model is certain about versus uncertain.

Input RGB Threshold 0.2 Threshold 0.3 (Low) Threshold 0.4 Threshold 0.5 (Medium) Threshold 0.7 (High)

This visualization reveals where the model is confident versus uncertain in its predictions. At high thresholds (0.7), only the core structures like the seat and backrest remain, indicating these are the regions where the model has strong evidence. As we lower the threshold, thin structures like chair legs and edges appear, but they're thicker and blobbier, showing the model is uncertain about their exact geometry. This demonstrates that the model's struggles with thin structures stem from low confidence rather than complete failure to detect them. The 32x32x32 voxel resolution creates inherent uncertainty for features near the voxel size limit, which manifests as lower probability values. The threshold choice thus represents a trade-off between precision (high threshold) and recall (low threshold).


Q3 Exploring Other Architectures / Datasets (and Extra Credit)

3.1 Implicit Network (10 points)

I implemented an implicit decoder network inspired by Occupancy Networks that learns a continuous function f(image, x, y, z) -> occupancy, mapping 3D coordinates to occupancy values conditioned on image features. Unlike the standard voxel decoder that directly predicts all 32x32x32 values, the implicit decoder queries each spatial coordinate independently. The architecture consists of a 4-layer MLP ([512 features + 3 coords] -> 512 -> 256 -> 128 -> 1) that takes concatenated image features and 3D coordinates as input and outputs a single occupancy value. During forward pass, I create a 32x32x32 meshgrid in normalized space [-1,1]^3, expand image features for all voxel positions, concatenate with coordinates to form [B, 32768, 515] tensors, and query the implicit function for each point. For memory efficiency, I process coordinates in chunks of 4,096 points rather than all 32,768 simultaneously, reducing GPU memory usage by 8x while maintaining identical results. The network is trained with binary cross-entropy loss comparing predicted occupancy fields to ground truth voxel grids.

Results:

Sample Input Predicted Occupancy Field Ground Truth
100
200
400

The implicit network successfully learns spatially-aware predictions through coordinate conditioning, producing occupancy fields comparable to the voxel decoder while demonstrating the continuous function representation concept from the Occupancy Networks paper.


3.2 Parametric Network (10 points)


Implementation

I implemented a parametric decoder network inspired by AtlasNet that learns a continuous function f(image, u, v) -> (x, y, z), mapping 2D parametric coordinates to 3D points.

Architecture:

Key Concept: Unlike standard point decoders that directly predict 3D coordinates, the parametric decoder learns a continuous 2D-to-3D surface mapping. A regular UV grid (50x50 = 2,500 points) in [-1,1]^2 is created, and each UV coordinate is concatenated with image features and passed through the MLP to predict its corresponding 3D location.

How Point Clouds Are Generated

  1. Encode image through ResNet18 -> 512-dim features
  2. Create UV grid of 50x50 regular coordinates in [-1, 1]^2
  3. Concatenate each UV coordinate with image features: [512 + 2]
  4. Decode through MLP to get 3D point for each UV
  5. Assemble 2,500 structured 3D points into point cloud

The parametric formulation provides a continuous function that maintains surface topology through the UV structure.

Results

Below are test set examples showing input images, predicted parametric point clouds, and ground truth:

Input Predicted Ground Truth

Observations:

Training

The parametric network demonstrates effective learning of continuous 2D-to-3D surface mappings, combining the flexibility of point clouds with structured surface parameterization.

3.3 Extended Dataset for Training (10 points)

I trained the point cloud decoder on two datasets: (1) single-class with only chairs, and (2) extended dataset with three classes (airplanes, cars, and chairs). Quantitatively, training on three classes achieves an F1@0.05 of 76.8% on chair test samples, compared to 79.9% when training on chairs alone, representing a 3.1% decrease. Qualitatively, the three-class model produces more diverse but occasionally less detailed reconstructions. The single-class model overfits to chair-specific features (legs, backrests, arm structures), while the three-class model learns more generalizable shape features applicable across object categories. This manifests as slightly reduced precision for fine chair details but improved robustness to unusual chair designs that deviate from the training distribution. The trade-off represents a classic bias-variance balance: single-class training provides higher accuracy for the target category through specialization, while multi-class training offers better generalization and diversity at the cost of category-specific detail fidelity.

Input RGB Image Ground Truth Mesh Predicted (Single Class) Predicted (3 Classes)