16825 HW2 - Yuchen Zhang

from IPython.display import Image
from matplotlib import pyplot as plt

Q1 - Fitting a Voxel Grid (Left: Prediction, Right: GT)

from IPython.display import HTML

HTML('''
<table style="width:80%; text-align:center; margin:auto;">
  <tr>
    <th></th>
    <th>Prediction</th>
    <th>Ground Truth</th>
  </tr>
  <tr>
    <th>Voxel</th>
    <td><img src="outputs/Q1.1_voxel_pred.gif" width="200"></td>
    <td><img src="outputs/Q1.1_voxel_tgt.gif" width="200"></td>
  </tr>
  <tr>
    <th>Point Cloud</th>
    <td><img src="outputs/Q1.2_point_pred.gif" width="200"></td>
    <td><img src="outputs/Q1.2_point_tgt.gif" width="200"></td>
  </tr>
  <tr>
    <th>Mesh</th>
    <td><img src="outputs/Q1.3_mesh_pred.gif" width="200"></td>
    <td><img src="outputs/Q1.3_mesh_tgt.gif" width="200"></td>
  </tr>
</table>
''')

	Prediction	Ground Truth
Voxel
Point Cloud
Mesh

Q2 - Reconstructing 3D from Single View

Voxel

HTML('''
<table style="width:80%; text-align:center; margin:auto;">
  <tr>
    <th>Input</th>
    <th>Prediction</th>
    <th>Ground Truth</th>
  </tr>
  <tr>
    <td><img src="outputs/Q2.1/vox_0_input.png" width="200"></td>
    <td><img src="outputs/Q2.1/vox_0_pred.gif" width="200"></td>
    <td><img src="outputs/Q2.1/vox_0_gt.gif" width="200"></td>
  </tr>
  <tr>
    <td><img src="outputs/Q2.1/vox_100_input.png" width="200"></td>
    <td><img src="outputs/Q2.1/vox_100_pred.gif" width="200"></td>
    <td><img src="outputs/Q2.1/vox_100_gt.gif" width="200"></td>
  </tr>
  <tr>
    <td><img src="outputs/Q2.1/vox_200_input.png" width="200"></td>
    <td><img src="outputs/Q2.1/vox_200_pred.gif" width="200"></td>
    <td><img src="outputs/Q2.1/vox_200_gt.gif" width="200"></td>
  </tr>
</table>
''')

Input	Prediction	Ground Truth

Points

HTML('''
<table style="width:80%; text-align:center; margin:auto;">
  <tr>
    <th>Input</th>
    <th>Prediction</th>
    <th>Ground Truth</th>
  </tr>
  <tr>
    <td><img src="outputs/Q2.2/point_0_input.png" width="200"></td>
    <td><img src="outputs/Q2.2/point_0_pred.gif" width="200"></td>
    <td><img src="outputs/Q2.2/point_0_gt.gif" width="200"></td>
  </tr>
  <tr>
    <td><img src="outputs/Q2.2/point_100_input.png" width="200"></td>
    <td><img src="outputs/Q2.2/point_100_pred.gif" width="200"></td>
    <td><img src="outputs/Q2.2/point_100_gt.gif" width="200"></td>
  </tr>
  <tr>
    <td><img src="outputs/Q2.2/point_200_input.png" width="200"></td>
    <td><img src="outputs/Q2.2/point_200_pred.gif" width="200"></td>
    <td><img src="outputs/Q2.2/point_200_gt.gif" width="200"></td>
  </tr>
</table>
''')

Input	Prediction	Ground Truth

Mesh

HTML('''
<table style="width:80%; text-align:center; margin:auto;">
  <tr>
    <th>Input</th>
    <th>Prediction</th>
    <th>Ground Truth</th>
  </tr>
  <tr>
    <td><img src="outputs/Q2.3_l4/mesh_0_input.png" width="200"></td>
    <td><img src="outputs/Q2.3_l4/mesh_0_pred.gif" width="200"></td>
    <td><img src="outputs/Q2.3_l4/mesh_0_gt.gif" width="200"></td>
  </tr>
  <tr>
    <td><img src="outputs/Q2.3_l4/mesh_100_input.png" width="200"></td>
    <td><img src="outputs/Q2.3_l4/mesh_100_pred.gif" width="200"></td>
    <td><img src="outputs/Q2.3_l4/mesh_100_gt.gif" width="200"></td>
  </tr>
  <tr>
    <td><img src="outputs/Q2.3_l4/mesh_200_input.png" width="200"></td>
    <td><img src="outputs/Q2.3_l4/mesh_200_pred.gif" width="200"></td>
    <td><img src="outputs/Q2.3_l4/mesh_200_gt.gif" width="200"></td>
  </tr>
</table>
''')

Input	Prediction	Ground Truth

Quantitative comparisions

Based on the results, point cloud representation have the highest F1 score at the largest threshold. I believe this advantage is based on:

Points are not fixated to grids like voxels, thereby enabling sub-voxel location to be accurately represented.
Points does not need to maintain a certain ordering like meshes, enabling greater freedom to the network.

HTML('''
<img src="outputs/eval_vox.png" width="400">
<img src="outputs/eval_point.png" width="400">
<img src="outputs/eval_mesh.png" width="400">
''')

2.5. Analyse effects of hyperparams variations

I tried to initialize the mesh with a mesh subdivided by 4(default), and 6 steps. Qualitative visualizations show finer mesh initialization allows outputing more detailed prediction results. This is reasonable because some surfaces will not be approximated with flat surfaces, and cause large loss when the total amount of surfaces are limited.

HTML('''
<table style="width:80%; text-align:center; margin:auto;">
  <tr>
    <th>Prediction - Sub-divide level 4</th>
    <th>Prediction - Sub-divide level 6</th>
    <th>Ground Truth</th>
  </tr>
  <tr>
    <td><img src="outputs/Q2.3_l4/mesh_0_pred.gif" width="200"></td>
    <td><img src="outputs/Q2.3_l6/mesh_0_pred.gif" width="200"></td>
    <td><img src="outputs/Q2.3_l4/mesh_0_gt.gif" width="200"></td>
  </tr>
  <tr>
    <td><img src="outputs/Q2.3_l4/mesh_100_pred.gif" width="200"></td>
    <td><img src="outputs/Q2.3_l6/mesh_100_pred.gif" width="200"></td>
    <td><img src="outputs/Q2.3_l4/mesh_100_gt.gif" width="200"></td>
  </tr>
  <tr>
    <td><img src="outputs/Q2.3_l4/mesh_200_pred.gif" width="200"></td>
    <td><img src="outputs/Q2.3_l6/mesh_200_pred.gif" width="200"></td>
    <td><img src="outputs/Q2.3_l4/mesh_200_gt.gif" width="200"></td>
  </tr>
</table>
''')

Prediction - Sub-divide level 4	Prediction - Sub-divide level 6	Ground Truth

2.6 Visualization Inside the Network

I used DPT-like architecture in the voxel network, thereby having features that are tied to 2D arrangement. I used PCA to visualize the features at different upsampling level, and the new information that flows into it.

HTML('''
<table style="width:80%; text-align:center; margin:auto;">
  <tr>
    <th>input</th>
    <th>type</th>
    <th>level 5x5</th>
    <th>level 8x8</th>
    <th>level 16x16</th>
  </tr>
  <tr>
    <td><img src="outputs/Q2.1/vox_0_input.png" width="200"></td>
    <th>Input</th>
    <td><img src="outputs/Q2.1/vox_0_feat_input_0_pca.png" width="200"></td>
    <td><img src="outputs/Q2.1/vox_0_feat_input_1_pca.png" width="200"></td>
    <td><img src="outputs/Q2.1/vox_0_feat_input_2_pca.png" width="200"></td>
  </tr>
  <tr>
    <th></th>
    <th>Refined</th>
    <td><img src="outputs/Q2.1/vox_0_feat_refined_0_pca.png" width="200"></td>
    <td><img src="outputs/Q2.1/vox_0_feat_refined_1_pca.png" width="200"></td>
    <td><img src="outputs/Q2.1/vox_0_feat_refined_2_pca.png" width="200"></td>
  </tr>
<tr>
    <td><img src="outputs/Q2.1/vox_100_input.png" width="200"></td>
    <th>Input</th>
    <td><img src="outputs/Q2.1/vox_100_feat_input_0_pca.png" width="200"></td>
    <td><img src="outputs/Q2.1/vox_100_feat_input_1_pca.png" width="200"></td>
    <td><img src="outputs/Q2.1/vox_100_feat_input_2_pca.png" width="200"></td>
  </tr>
  <tr>
    <th></th>
    <th>Refined</th>
    <td><img src="outputs/Q2.1/vox_100_feat_refined_0_pca.png" width="200"></td>
    <td><img src="outputs/Q2.1/vox_100_feat_refined_1_pca.png" width="200"></td>
    <td><img src="outputs/Q2.1/vox_100_feat_refined_2_pca.png" width="200"></td>
  </tr>

<tr>
    <td><img src="outputs/Q2.1/vox_200_input.png" width="200"></td>
    <th>Input</th>
    <td><img src="outputs/Q2.1/vox_200_feat_input_0_pca.png" width="200"></td>
    <td><img src="outputs/Q2.1/vox_200_feat_input_1_pca.png" width="200"></td>
    <td><img src="outputs/Q2.1/vox_200_feat_input_2_pca.png" width="200"></td>
  </tr>
  <tr>
    <th></th>
    <th>Refined</th>
    <td><img src="outputs/Q2.1/vox_200_feat_refined_0_pca.png" width="200"></td>
    <td><img src="outputs/Q2.1/vox_200_feat_refined_1_pca.png" width="200"></td>
    <td><img src="outputs/Q2.1/vox_200_feat_refined_2_pca.png" width="200"></td>
  </tr>
</table>
''')

input	type	level 5x5	level 8x8	level 16x16
	Input
	Refined
	Input
	Refined
	Input
	Refined

From the visualization, we can see grid-like pattern in the final input features (level 16x16) used to refine DPT predictions across all examples. This suggest that edge features may be encoded in that layer, which DPT uses to output the final result.

Features from the first few layers have too small resolution, and are hard to explain through PCA.

Q3: Exploring other Datasets

I trained the points architecture with the full dataset and evaluated it against the checkpoint trained only with the chairs class.

HTML('''
<table style="width:80%; text-align:center; margin:auto;">
  <tr>
    <th>Input</th>
    <th>Prediction - 1 Class</th>
    <th>Prediction - 3 Class</th>
    <th>Ground Truth</th>
  </tr>
  <tr>
    <td><img src="outputs/Q2.2/point_0_input.png" width="200"></td>
    <td><img src="outputs/Q2.2/point_0_pred.gif" width="200"></td>
    <td><img src="outputs/Q2.2_full/point_0_pred.gif" width="200"></td>
    <td><img src="outputs/Q2.2/point_0_gt.gif" width="200"></td>
  </tr>
  <tr>
    <td><img src="outputs/Q2.2/point_100_input.png" width="200"></td>
    <td><img src="outputs/Q2.2/point_100_pred.gif" width="200"></td>
    <td><img src="outputs/Q2.2_full/point_100_pred.gif" width="200"></td>
    <td><img src="outputs/Q2.2/point_100_gt.gif" width="200"></td>
  </tr>
  <tr>
    <td><img src="outputs/Q2.2/point_200_input.png" width="200"></td>
    <td><img src="outputs/Q2.2/point_200_pred.gif" width="200"></td>
    <td><img src="outputs/Q2.2_full/point_200_pred.gif" width="200"></td>
    <td><img src="outputs/Q2.2/point_200_gt.gif" width="200"></td>
  </tr>
</table>
''')

Input	Prediction - 1 Class	Prediction - 3 Class	Ground Truth

Qualitatively, the model trained with more classes still predicts shape of the chairs successfully. There are no significant difference qualitatively.

HTML('''
<table style="width:80%; text-align:center; margin:auto;">
  <tr>
    <th>1 class</th>
    <th>3 class</th>
  </tr>
  <tr><td>
<img src="outputs/eval_point.png" width="400"></td><td>
<img src="outputs/eval_point_full.png" width="400"></td>
  </tr>
</table>
''')

1 class	3 class

Quantitatively, model trained on 1 class have better performance in that class compared to the model trained on 3 classes. I have to conclude that the model might be too small to absorb the data from all classes, thereby leading to sub-optimal performance.