HW2 - Single View to 3D

1. Exploring Loss Functions

1.1 Voxel Loss

def voxel_loss(voxel_src,voxel_tgt):
	# implement some loss for binary voxel grids
	criterion = nn.BCEWithLogitsLoss()
	loss = criterion(voxel_src,voxel_tgt)
	return loss

1.2 Point Cloud Loss

def chamfer_loss(point_cloud_src,point_cloud_tgt):
	# point_cloud_src, point_cloud_src: b x n_points x 3  
	# find the closest point for each point cloud set to its nearest neighbor
	# find the mean of each set, and add the two means

	#turn into float for mean square calculation
	A = point_cloud_src.float()
	B = point_cloud_tgt.float()

	#use knn_points to calculate nearest K distance, no need to return the index
	dist_AB, idx_AB, _ = knn_points(A,B,K=1)
	dist_BA, idx_BA, _ = knn_points(B,A,K=1)

	loss_chamfer = dist_AB.mean()+dist_BA.mean()
	return loss_chamfer

1.3 Mesh Loss (Smoothness)

def smoothness_loss(mesh_src):
	loss_laplacian = mesh_laplacian_smoothing(mesh_src,method="uniform")
	return loss_laplacian

2. Reconstructing 3D from Single View

2.1 Image to Voxel Grid

Decoder design

📌

half the channel while double the spatial dimension is common, therefore us such strategy, and kernel size 4, stride 2 and padding 1 is the formula for double the spatial

# define decoder
if args.type == "vox":
    # Input: b x 512
    # Output: b x 32 x 32 x 32
    self.decoder = nn.Sequential(
        nn.Linear(512, 64 * 2 * 2 * 2),  # 512 -> 512 (rearrange features)
        nn.ReLU(),
        nn.Unflatten(1, (64, 2, 2, 2)),  # (B,512) -> (B,64,2,2,2)
        nn.ConvTranspose3d(in_channels=64, out_channels=32, kernel_size=4, stride=2, padding=1),  #(B,64,2,2,2) > (B,32,4,4,4)
        nn.ReLU(),
        nn.ConvTranspose3d(32,16,kernel_size=4, stride=2, padding=1), #(B,32,4,4,4) -> (B,16,8,8,8)
        nn.ReLU(),
        nn.ConvTranspose3d(16,8,kernel_size=4, stride=2, padding=1), #(B,16,8,8,8) -> (B,8,16,16,16)
        nn.ReLU(),
        nn.ConvTranspose3d(8,1,kernel_size=4, stride=2, padding=1), #(B,8,16,16,16) -> (B,1,32,32,32)
    )            

# Forward
# call decoder
if args.type == "vox":
    voxels_pred = self.decoder(encoded_feat)  # (B, 1, 32, 32, 32)
    return voxels_pred

Training results

Visualization (Input RGB, GT voxel, Predicted voxel)

2.2 Image to Point Cloud

Decoder design

self.decoder = nn.Sequential(
    nn.Linear(512, 1024),
    nn.ReLU(),
    nn.Linear(1024, 2048),
    nn.ReLU(),
    nn.Linear(2048, args.n_points * 3),
)

Training results

Visualization (Input RGB, GT point cloud, Predicted point cloud)

2.3 Image to Mesh

Decoder design

num_verts = mesh_pred.verts_packed().shape[0]
self.decoder = nn.Sequential(
    nn.Linear(512, 1024),
    nn.ReLU(),
    nn.Linear(1024, 2048),
    nn.ReLU(),
    nn.Linear(2048, num_verts * 3),
)

Training results

Visualization (Input RGB, GT mesh, Predicted mesh)

2.4 Quantitative Comparisons

F1 score plots

Explanation:
- Key Insight: The performance differences come down to how straightforward the relationship is between what the model predicts and what the loss function evaluates.
- Point Cloud (F1@0.05: 76.19%) - Most Direct
  Point clouds have the most one-to-one relationship between prediction and loss. The model directly outputs (x,y,z) coordinates, and the loss directly measures point-to-point distances. There's no implicit-explicit gap - what you predict is exactly what you evaluate. This straightforward mapping makes it easiest for the model to learn.
- Mesh (F1@0.05: 69.40%) - Moderately Constrained
  Meshes are still relatively straightforward since they predict vertex coordinates (deltas from a base ico_sphere), but they're more constrained. The model must maintain surface connectivity and starts from a fixed topology that doesn't necessarily fit all chair shapes well. Whether a chair has curved vs. straight legs, or arms vs. no arms, the template has to deform to match - and that depends heavily on the base shape and vertex density.
- Voxel (F1@0.05: 32.71%) - Most Indirect
  Voxels perform worst because the prediction is fundamentally different from continuous 3D geometry. It's a discrete, binary problem - each voxel is either occupied or empty, hit or miss. As I found in the hyperparameter analysis, this makes the model vulnerable to the class imbalance problem, either being too confident about empty voxels or over-predicting occupancy. The discrete nature means there's less of a continuous gradient for the model to learn from, and the 32³ resolution loses fine details that point clouds naturally capture. Also voxel can be a natural disadvantage here with L1 loss since nowhere in the BCE loss function that L1 chamfer loss is used. But for the other two it is the essential part.

2.5 Hyperparameter Variation

Parameter Studied: pos_weight in BCEWithLogitsLoss for voxel occupancy prediction

Motivation: Voxel grids suffer from severe class imbalance - in a 32³ grid, only ~5-10% of voxels are occupied by the object while ~90% are empty. Without correction, the model can minimize loss by simply predicting empty voxels everywhere.

Experimental Setup:
- Tested three configurations: no weighting (pos_weight=1.0), moderate weighting (pos_weight=7.0), and high weighting (pos_weight=30.0)
- Trained for 1-2 epochs on chair category from R2N2 dataset
- Visualized predictions using marching cubes

Results:

Nearly empty predictions with minimal occupancy

def voxel_loss(voxel_src, voxel_tgt):
    # Calculate positive weight based on class imbalance
    pos_weight = torch.tensor([7.0]).to(voxel_src.device)
    
    criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
    loss = criterion(voxel_src, voxel_tgt)
    
    return loss

Over-prediction with thick, blob-like geometry

def voxel_loss(voxel_src, voxel_tgt):
    # Calculate positive weight based on class imbalance
    pos_weight = torch.tensor([30.0]).to(voxel_src.device)
    
    criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
    loss = criterion(voxel_src, voxel_tgt)
    
    return loss

Clear chair structure emerges, balanced geometry

def voxel_loss(voxel_src, voxel_tgt):
  
    criterion = nn.BCEWithLogitsLoss()
    loss = criterion(voxel_src, voxel_tgt)
    
    return loss

Conclusion: The optimal pos_weight should approximate the inverse class frequency ratio (upon some chatGPT research after the experiment). Setting pos_weight=7.0 effectively addresses the class imbalance, enabling the model to learn meaningful 3D structure within just 1-2 epochs. Values too low result in empty predictions, while values too high produce excessive occupancy.

2.6 Model Interpretation (via Gradient-Based Saliency Maps)

# minimal implementation snippet
images_gt.requires_grad = True
prediction = model(images_gt, args)
loss = calculate_loss(prediction, ground_truth)
loss.backward()
gradients = images_gt.grad.abs().mean(dim=-1)  # Average across RGB
plt.imshow(gradients[0].cpu())

# referenced Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). "Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps."

To understand what features each model attends to during prediction, I visualized gradient-based saliency maps following the approach of Simonyan et al. (2013). By computing the absolute gradients of the loss with respect to input pixels, I can identify which image regions most influence the model's predictions. In total I computed saliency map with 5 different images across three models, below is more information
```
#across 5 examples

Vox:
  Average Loss: 0.4773
  Average Saliency: 0.0548
  Average Max Activation: 1.0000

Point:
  Average Loss: 0.0052
  Average Saliency: 0.0627
  Average Max Activation: 0.9997

Mesh:
  Average Loss: 0.0074
  Average Saliency: 0.0879
  Average Max Activation: 0.9999
```

Observed Attention Patterns:
- The voxel model exhibits the most localized activation, focusing primarily on object edges and contours. While this edge-focused attention seems intuitive, the model struggles with thin structures like chair legs - it captures the solid seat area but misses finer geometric details. Interestingly, despite this focused attention, the voxel model achieves the poorest performance (loss: 0.29-0.70).
- The mesh model displays more spread-out activation with higher overall intensity. The saliency maps show a more uniform, grid-like pattern covering the entire chair structure, though some activation bleeds into the background regions. The mesh model's attention appears less fine-grained than the point model, which may relate to its template-based approach where vertex deformations require global shape understanding.
- The point cloud model shows the most diffuse attention pattern, distributed relatively evenly across the chair silhouette. Despite having lower mean saliency values (0.046-0.099) compared to mesh (0.054-0.124), the point model achieves the best performance (loss: 0.003-0.007). This suggests that for direct coordinate prediction, holistic shape information is more valuable than localized edge detection.

Statistical Analysis:
- The quantitative statistics reveal a counterintuitive relationship: models with more localized, edge-focused attention perform worse. The voxel model has the lowest saliency mean (0.027-0.070) and most selective activation, yet the highest loss. Conversely, the point cloud model's broader, more diffuse attention correlates with superior reconstruction quality. This indicates that 3D reconstruction benefits from global context rather than purely local features. For volumetric occupancy prediction, focusing on edges alone cannot provide the spatial reasoning needed to fill a 32³ grid with correct binary decisions.

3. Exploring Other Architectures / Datasets

3.1 Implicit Network

Implementation

# decoder design 
elif args.type == "implicit":
    # Implicit decoder: takes image feature (512) + 3D coordinate (3) -> oppupancy (1)
    self.decoder = nn.Sequential(
        nn.Linear(515,256),
        nn.ReLU(),
        nn.Linear(256, 128),
        nn.ReLU(),
        nn.Linear(128, 64),
        nn.ReLU(),
        nn.Linear(64, 1)  # Output: single occupancy value
    )

# dataloading during training, get random points in the normalized space
  # Sample N random coordinates per image
  B = images.shape[0]
  N = 1000  # Number of points to sample
  coords = torch.rand(B, N, 3) * 2 - 1  # Random in [-1,1]³
  voxels = feed_dict["voxels"].float()
  # Return tuple
  ground_truth_3d = (voxels, coords)

# Special forward pass handling for implicit
voxels_gt, coords = ground_truth_3d

# Manual forward pass
B = images_gt.shape[0]
N = coords.shape[1]

# Encode images
images_normalize = model.normalize(images_gt.permute(0,3,1,2))
encoded_feat = model.encoder(images_normalize).squeeze(-1).squeeze(-1)

# Expand features and concatenate with coords
features_expanded = encoded_feat.unsqueeze(1).expand(-1, N, -1)
decoder_input = torch.cat([features_expanded, coords], dim=-1)
decoder_input_flat = decoder_input.reshape(B * N, 515)

# Decoder forward
prediction_3d = model.decoder(decoder_input_flat).reshape(B, N, 1)

Run 1 (5 epochs)

Best loss achieved: 0.12358080
Final epoch loss: 0.12358080
Total epochs: 5
Total iterations: 3810

Run 2 (23 epochs)

Final loss: 0.0920
Minimum loss: 0.0634
Average loss: 0.1200
Total iterations: 17236
Total time: 29192.9 seconds

result
📌
The implicit network showed training instability, occasionally failing to predict occupied voxels (empty reconstructions). This occurs because implicit MLPs lack architectural constraints enforcing spatial coherence - they must learn 3D structure purely from data, unlike explicit decoders where the architecture provides strong inductive bias toward valid outputs. This resulted in lower F1-scores (43.79%) and occasional reconstruction failures during evaluation.
📌
Overall Performance Analysis Reflection
Point clouds (76.19%) and meshes (69.40%) significantly outperformed implicit (43.79%) and voxel (32.71%) representations.

Point clouds work best because they directly predict 3D coordinates - there's a straightforward path from the loss function to what needs to be learned. The network simply adjusts xyz values until points match the target shape.

Meshes perform well by deforming a spherical template, which provides built-in surface structure. However, forcing all chairs to deform from the same sphere limits flexibility - chairs with different topologies (arms vs. armless, curved vs. straight) all squeeze into the same template.

Implicit networks can query at any resolution and produce smooth surfaces, but training is fragile. The decoder has no built-in understanding of "nearby points should have similar occupancy" - it must learn this from scratch. This resulted in occasional complete failures where the network predicted nothing was occupied.

Voxels performed surprisingly poorly, likely due to minimal training (1-2 epochs) rather than architectural flaws. With proper training, voxels typically outperform implicit networks since the upsampling layers naturally enforce spatial structure.
The ranking suggests that directly predicting the output format (point coordinates, mesh vertices) is more effective than learning intermediate representations (occupancy functions, voxel grids) for this task and training budget.

1. Exploring Loss Functions

1.1 Voxel Loss

1.2 Point Cloud Loss

1.3 Mesh Loss (Smoothness)

2. Reconstructing 3D from Single View

2.1 Image to Voxel Grid

2.2 Image to Point Cloud

2.3 Image to Mesh

2.4 Quantitative Comparisons

2.5 Hyperparameter Variation

2.6 Model Interpretation (via Gradient-Based Saliency Maps)

3. Exploring Other Architectures / Datasets

3.1 Implicit Network

Overall Performance Analysis Reflection