![]() |
![]() |
![]() |
![]() |
![]() |
![]() |

class DecoderBlock(nn.Module):
def __init__(self, in_channels, out_channels, skip_channels=0):
super(DecoderBlock, self).__init__()
self.layers = nn.Sequential(
nn.ConvTranspose3d(in_channels+skip_channels, out_channels, stride=2, kernel_size=4, padding=1),
nn.BatchNorm3d(out_channels),
nn.ReLU(inplace=True),
nn.ConvTranspose3d(out_channels, out_channels, kernel_size=3, padding=1),
nn.BatchNorm3d(out_channels),
)
self.relu = nn.ReLU(inplace=True)
self.skip = nn.ConvTranspose3d(in_channels, out_channels, stride=2, kernel_size=4, padding=1)
def forward(self, x):
skip = self.skip(x)
for layer in self.layers:
# print(f"\tDecoder Layer: {layer.__class__.__name__}, x input shape: {x.shape}")
x = layer(x)
# print(f"\tx output shape: {x.shape}")
# print(f"{'#'*20} Final shape for block: {x.shape} {'#'*20}")
x = self.relu(x + skip)
return x
if args.type == "vox":
# Input: b x 512
# Output: b x 32 x 32 x 32
self.projection = nn.Linear(512, 512*8*8*8)
self.decoder = torch.nn.Sequential(
DecoderBlock(512, 256), # 4x4
DecoderBlock(256, 128), # 4x4
nn.Conv3d(128, 1, kernel_size=3, padding=1),
nn.Sigmoid()
)


self.decoder = nn.Sequential(
nn.Linear(512, 1024),
nn.ReLU(inplace=True),
nn.Linear(1024, 2048),
nn.ReLU(inplace=True),
nn.Linear(2048, self.n_point * 3),
)


self.decoder = nn.Sequential(
nn.Linear(512, 1024),
nn.ReLU(inplace=True),
nn.Linear(1024, 2048),
nn.ReLU(inplace=True),
nn.Linear(2048, self.n_mesh_verts * 3),
)


Interpretation:

Interpretation:

Interpretation:
Parameter Studied: Voxel extraction threshold for marching cubes and cubify operations
Motivation:
The threshold parameter controls the isosurface value used when converting predicted voxel occupancy grids into explicit 3D meshes. This is a critical hyperparameter because:
Experimental Setup:
Results:
| Threshold 0.2 | Threshold 0.3 | Threshold 0.5 |
|---|---|---|
![]() |
![]() |
![]() |
Analysis:
I found tuning this hyperparameter to be the most interesting as it reveals where the model feels most confident about its predictions. The model seems to feel most confident about voxels in the center of the predicted shape while voxels towards the boundary of the shape are a little 'fuzzier'.
Method: L1 Distance Heatmap Overlay
I rendered the predicted 3D shape from the input camera viewpoint and compute pixel-wise L1 distances against the original image. The heatmap reveals where the model's reconstruction fails to match the 2D observation: red regions indicate geometric errors like incorrect depth, missing structures, or hallucinated geometry. This method directly measures multi-view consistency by connecting 3D predictions back to the 2D input domain.

Training

Results
