16-825 Assignment 2: Single View to 3D

1. Exploring Loss Functions

1.1. Fitting a voxel grid (5 points)

1.2. Fitting a point cloud (5 points)

1.3. Fitting a mesh (5 points)

2. Reconstructing 3D from Single View

2.1. Image to voxel grid (20 points)

Voxel Training Comparison

Decoder Architecture
- I experimented with a 'Block' decoder architecture inspired by the architecture of the R2N2 model without skip connections.

class DecoderBlock(nn.Module):
    def __init__(self, in_channels, out_channels, skip_channels=0):
        super(DecoderBlock, self).__init__()
        self.layers = nn.Sequential(
            nn.ConvTranspose3d(in_channels+skip_channels, out_channels, stride=2, kernel_size=4, padding=1),
            nn.BatchNorm3d(out_channels),
            nn.ReLU(inplace=True),
            nn.ConvTranspose3d(out_channels, out_channels, kernel_size=3, padding=1),
            nn.BatchNorm3d(out_channels),
        )
        self.relu = nn.ReLU(inplace=True)
        self.skip = nn.ConvTranspose3d(in_channels, out_channels, stride=2, kernel_size=4, padding=1)

    def forward(self, x):
        skip = self.skip(x)
        for layer in self.layers:
            # print(f"\tDecoder Layer: {layer.__class__.__name__}, x input shape: {x.shape}")
            x = layer(x)
            # print(f"\tx output shape: {x.shape}")
        # print(f"{'#'*20} Final shape for block: {x.shape} {'#'*20}")
        x = self.relu(x + skip)
        return x

        if args.type == "vox":
            # Input: b x 512
            # Output: b x 32 x 32 x 32
            self.projection = nn.Linear(512, 512*8*8*8)
            self.decoder = torch.nn.Sequential(
                DecoderBlock(512, 256), # 4x4
                DecoderBlock(256, 128), # 4x4
                nn.Conv3d(128, 1, kernel_size=3, padding=1),
                nn.Sigmoid()
            )

Training

Voxel Training Results

2.2. Image to point cloud (20 points)

Input Image

Decoder Architecture

        self.decoder = nn.Sequential(
            nn.Linear(512, 1024),
            nn.ReLU(inplace=True),
            nn.Linear(1024, 2048),
            nn.ReLU(inplace=True),
            nn.Linear(2048, self.n_point * 3),
        )

Training

PointCloud Training Results

2.3. Image to mesh (20 points)

Input Image

Decoder Architecture

        self.decoder = nn.Sequential(
            nn.Linear(512, 1024),
            nn.ReLU(inplace=True),
            nn.Linear(1024, 2048),
            nn.ReLU(inplace=True),
            nn.Linear(2048, self.n_mesh_verts * 3),
        )

Training

Mesh Training Results

2.4. Quantitative comparisons (10 points)

Voxel Evaluation Results

Interpretation:

Voxel-based evaluation has the most stringent requirements:
- Requires filling entire 3D volumes correctly
- Every voxel in the 3D grid must be classified correctly (occupied vs empty)
- Small errors in boundary placement affect many voxels
- Most sensitive to thickness, shape completeness, and volumetric accuracy
- Hardest to achieve high scores since you're evaluated on dense 3D grids

Pointcloud Evaluation Results

Interpretation:

Pointcloud likely gets the highest scores because:
- Points are zero-dimensional - you only need to be correct at exact locations
- No volumetric or surface continuity requirements
- Easier to get isolated points correct even if the overall shape isn't perfect
- Most forgiving evaluation method

Mesh Evaluation Results

Interpretation:

Mesh reconstruction is more challenging because:
- You need to reconstruct continuous 2D surfaces correctly
- Surface topology and connectivity matter
- Requires getting both position AND surface orientation right
- Harder than point sampling but doesn't require full volumetric accuracy

2.5. Analyse effects of hyperparameter variations (10 points)

Parameter Studied: Voxel extraction threshold for marching cubes and cubify operations

Motivation:
The threshold parameter controls the isosurface value used when converting predicted voxel occupancy grids into explicit 3D meshes. This is a critical hyperparameter because:

Voxel predictions are continuous probabilities (0-1), not binary values
The threshold determines which voxels are considered "occupied" vs "empty"

Experimental Setup:

Tested three threshold values: 0.2 (low), 0.3 (medium), 0.5 (high)
Used the same trained voxel model for all experiments

Results:

Threshold 0.2	Threshold 0.3	Threshold 0.5

Analysis:
I found tuning this hyperparameter to be the most interesting as it reveals where the model feels most confident about its predictions. The model seems to feel most confident about voxels in the center of the predicted shape while voxels towards the boundary of the shape are a little 'fuzzier'.

2.6. Interpret your model (15 points)

Method: L1 Distance Heatmap Overlay

I rendered the predicted 3D shape from the input camera viewpoint and compute pixel-wise L1 distances against the original image. The heatmap reveals where the model's reconstruction fails to match the 2D observation: red regions indicate geometric errors like incorrect depth, missing structures, or hallucinated geometry. This method directly measures multi-view consistency by connecting 3D predictions back to the 2D input domain.

Interpretation

3. Exploring Other Architectures / Datasets (Choose at least one)

3.3. Extended dataset for training (10 points - Optional)

Training
Extended Dataset Training Results

Results
Extended Dataset Visual Results