from IPython.display import Image
from matplotlib import pyplot as pltfrom IPython.display import HTML
HTML('''
<table style="width:80%; text-align:center; margin:auto;">
<tr>
<th></th>
<th>Prediction</th>
<th>Ground Truth</th>
</tr>
<tr>
<th>Voxel</th>
<td><img src="outputs/Q1.1_voxel_pred.gif" width="200"></td>
<td><img src="outputs/Q1.1_voxel_tgt.gif" width="200"></td>
</tr>
<tr>
<th>Point Cloud</th>
<td><img src="outputs/Q1.2_point_pred.gif" width="200"></td>
<td><img src="outputs/Q1.2_point_tgt.gif" width="200"></td>
</tr>
<tr>
<th>Mesh</th>
<td><img src="outputs/Q1.3_mesh_pred.gif" width="200"></td>
<td><img src="outputs/Q1.3_mesh_tgt.gif" width="200"></td>
</tr>
</table>
''')| Prediction | Ground Truth | |
|---|---|---|
| Voxel | ![]() |
![]() |
| Point Cloud | ![]() |
![]() |
| Mesh | ![]() |
![]() |
HTML('''
<table style="width:80%; text-align:center; margin:auto;">
<tr>
<th>Input</th>
<th>Prediction</th>
<th>Ground Truth</th>
</tr>
<tr>
<td><img src="outputs/Q2.1/vox_0_input.png" width="200"></td>
<td><img src="outputs/Q2.1/vox_0_pred.gif" width="200"></td>
<td><img src="outputs/Q2.1/vox_0_gt.gif" width="200"></td>
</tr>
<tr>
<td><img src="outputs/Q2.1/vox_100_input.png" width="200"></td>
<td><img src="outputs/Q2.1/vox_100_pred.gif" width="200"></td>
<td><img src="outputs/Q2.1/vox_100_gt.gif" width="200"></td>
</tr>
<tr>
<td><img src="outputs/Q2.1/vox_200_input.png" width="200"></td>
<td><img src="outputs/Q2.1/vox_200_pred.gif" width="200"></td>
<td><img src="outputs/Q2.1/vox_200_gt.gif" width="200"></td>
</tr>
</table>
''')| Input | Prediction | Ground Truth |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
HTML('''
<table style="width:80%; text-align:center; margin:auto;">
<tr>
<th>Input</th>
<th>Prediction</th>
<th>Ground Truth</th>
</tr>
<tr>
<td><img src="outputs/Q2.2/point_0_input.png" width="200"></td>
<td><img src="outputs/Q2.2/point_0_pred.gif" width="200"></td>
<td><img src="outputs/Q2.2/point_0_gt.gif" width="200"></td>
</tr>
<tr>
<td><img src="outputs/Q2.2/point_100_input.png" width="200"></td>
<td><img src="outputs/Q2.2/point_100_pred.gif" width="200"></td>
<td><img src="outputs/Q2.2/point_100_gt.gif" width="200"></td>
</tr>
<tr>
<td><img src="outputs/Q2.2/point_200_input.png" width="200"></td>
<td><img src="outputs/Q2.2/point_200_pred.gif" width="200"></td>
<td><img src="outputs/Q2.2/point_200_gt.gif" width="200"></td>
</tr>
</table>
''')| Input | Prediction | Ground Truth |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
HTML('''
<table style="width:80%; text-align:center; margin:auto;">
<tr>
<th>Input</th>
<th>Prediction</th>
<th>Ground Truth</th>
</tr>
<tr>
<td><img src="outputs/Q2.3_l4/mesh_0_input.png" width="200"></td>
<td><img src="outputs/Q2.3_l4/mesh_0_pred.gif" width="200"></td>
<td><img src="outputs/Q2.3_l4/mesh_0_gt.gif" width="200"></td>
</tr>
<tr>
<td><img src="outputs/Q2.3_l4/mesh_100_input.png" width="200"></td>
<td><img src="outputs/Q2.3_l4/mesh_100_pred.gif" width="200"></td>
<td><img src="outputs/Q2.3_l4/mesh_100_gt.gif" width="200"></td>
</tr>
<tr>
<td><img src="outputs/Q2.3_l4/mesh_200_input.png" width="200"></td>
<td><img src="outputs/Q2.3_l4/mesh_200_pred.gif" width="200"></td>
<td><img src="outputs/Q2.3_l4/mesh_200_gt.gif" width="200"></td>
</tr>
</table>
''')| Input | Prediction | Ground Truth |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Based on the results, point cloud representation have the highest F1 score at the largest threshold. I believe this advantage is based on:
HTML('''
<img src="outputs/eval_vox.png" width="400">
<img src="outputs/eval_point.png" width="400">
<img src="outputs/eval_mesh.png" width="400">
''')
I tried to initialize the mesh with a mesh subdivided by 4(default), and 6 steps. Qualitative visualizations show finer mesh initialization allows outputing more detailed prediction results. This is reasonable because some surfaces will not be approximated with flat surfaces, and cause large loss when the total amount of surfaces are limited.
HTML('''
<table style="width:80%; text-align:center; margin:auto;">
<tr>
<th>Prediction - Sub-divide level 4</th>
<th>Prediction - Sub-divide level 6</th>
<th>Ground Truth</th>
</tr>
<tr>
<td><img src="outputs/Q2.3_l4/mesh_0_pred.gif" width="200"></td>
<td><img src="outputs/Q2.3_l6/mesh_0_pred.gif" width="200"></td>
<td><img src="outputs/Q2.3_l4/mesh_0_gt.gif" width="200"></td>
</tr>
<tr>
<td><img src="outputs/Q2.3_l4/mesh_100_pred.gif" width="200"></td>
<td><img src="outputs/Q2.3_l6/mesh_100_pred.gif" width="200"></td>
<td><img src="outputs/Q2.3_l4/mesh_100_gt.gif" width="200"></td>
</tr>
<tr>
<td><img src="outputs/Q2.3_l4/mesh_200_pred.gif" width="200"></td>
<td><img src="outputs/Q2.3_l6/mesh_200_pred.gif" width="200"></td>
<td><img src="outputs/Q2.3_l4/mesh_200_gt.gif" width="200"></td>
</tr>
</table>
''')| Prediction - Sub-divide level 4 | Prediction - Sub-divide level 6 | Ground Truth |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
I used DPT-like architecture in the voxel network, thereby having features that are tied to 2D arrangement. I used PCA to visualize the features at different upsampling level, and the new information that flows into it.
HTML('''
<table style="width:80%; text-align:center; margin:auto;">
<tr>
<th>input</th>
<th>type</th>
<th>level 5x5</th>
<th>level 8x8</th>
<th>level 16x16</th>
</tr>
<tr>
<td><img src="outputs/Q2.1/vox_0_input.png" width="200"></td>
<th>Input</th>
<td><img src="outputs/Q2.1/vox_0_feat_input_0_pca.png" width="200"></td>
<td><img src="outputs/Q2.1/vox_0_feat_input_1_pca.png" width="200"></td>
<td><img src="outputs/Q2.1/vox_0_feat_input_2_pca.png" width="200"></td>
</tr>
<tr>
<th></th>
<th>Refined</th>
<td><img src="outputs/Q2.1/vox_0_feat_refined_0_pca.png" width="200"></td>
<td><img src="outputs/Q2.1/vox_0_feat_refined_1_pca.png" width="200"></td>
<td><img src="outputs/Q2.1/vox_0_feat_refined_2_pca.png" width="200"></td>
</tr>
<tr>
<td><img src="outputs/Q2.1/vox_100_input.png" width="200"></td>
<th>Input</th>
<td><img src="outputs/Q2.1/vox_100_feat_input_0_pca.png" width="200"></td>
<td><img src="outputs/Q2.1/vox_100_feat_input_1_pca.png" width="200"></td>
<td><img src="outputs/Q2.1/vox_100_feat_input_2_pca.png" width="200"></td>
</tr>
<tr>
<th></th>
<th>Refined</th>
<td><img src="outputs/Q2.1/vox_100_feat_refined_0_pca.png" width="200"></td>
<td><img src="outputs/Q2.1/vox_100_feat_refined_1_pca.png" width="200"></td>
<td><img src="outputs/Q2.1/vox_100_feat_refined_2_pca.png" width="200"></td>
</tr>
<tr>
<td><img src="outputs/Q2.1/vox_200_input.png" width="200"></td>
<th>Input</th>
<td><img src="outputs/Q2.1/vox_200_feat_input_0_pca.png" width="200"></td>
<td><img src="outputs/Q2.1/vox_200_feat_input_1_pca.png" width="200"></td>
<td><img src="outputs/Q2.1/vox_200_feat_input_2_pca.png" width="200"></td>
</tr>
<tr>
<th></th>
<th>Refined</th>
<td><img src="outputs/Q2.1/vox_200_feat_refined_0_pca.png" width="200"></td>
<td><img src="outputs/Q2.1/vox_200_feat_refined_1_pca.png" width="200"></td>
<td><img src="outputs/Q2.1/vox_200_feat_refined_2_pca.png" width="200"></td>
</tr>
</table>
''')| input | type | level 5x5 | level 8x8 | level 16x16 |
|---|---|---|---|---|
![]() |
Input | ![]() |
![]() |
![]() |
| Refined | ![]() |
![]() |
![]() |
|
![]() |
Input | ![]() |
![]() |
![]() |
| Refined | ![]() |
![]() |
![]() |
|
![]() |
Input | ![]() |
![]() |
![]() |
| Refined | ![]() |
![]() |
![]() |
From the visualization, we can see grid-like pattern in the final input features (level 16x16) used to refine DPT predictions across all examples. This suggest that edge features may be encoded in that layer, which DPT uses to output the final result.
Features from the first few layers have too small resolution, and are hard to explain through PCA.
I trained the points architecture with the full dataset and evaluated it against the checkpoint trained only with the chairs class.
HTML('''
<table style="width:80%; text-align:center; margin:auto;">
<tr>
<th>Input</th>
<th>Prediction - 1 Class</th>
<th>Prediction - 3 Class</th>
<th>Ground Truth</th>
</tr>
<tr>
<td><img src="outputs/Q2.2/point_0_input.png" width="200"></td>
<td><img src="outputs/Q2.2/point_0_pred.gif" width="200"></td>
<td><img src="outputs/Q2.2_full/point_0_pred.gif" width="200"></td>
<td><img src="outputs/Q2.2/point_0_gt.gif" width="200"></td>
</tr>
<tr>
<td><img src="outputs/Q2.2/point_100_input.png" width="200"></td>
<td><img src="outputs/Q2.2/point_100_pred.gif" width="200"></td>
<td><img src="outputs/Q2.2_full/point_100_pred.gif" width="200"></td>
<td><img src="outputs/Q2.2/point_100_gt.gif" width="200"></td>
</tr>
<tr>
<td><img src="outputs/Q2.2/point_200_input.png" width="200"></td>
<td><img src="outputs/Q2.2/point_200_pred.gif" width="200"></td>
<td><img src="outputs/Q2.2_full/point_200_pred.gif" width="200"></td>
<td><img src="outputs/Q2.2/point_200_gt.gif" width="200"></td>
</tr>
</table>
''')| Input | Prediction - 1 Class | Prediction - 3 Class | Ground Truth |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Qualitatively, the model trained with more classes still predicts shape of the chairs successfully. There are no significant difference qualitatively.
HTML('''
<table style="width:80%; text-align:center; margin:auto;">
<tr>
<th>1 class</th>
<th>3 class</th>
</tr>
<tr><td>
<img src="outputs/eval_point.png" width="400"></td><td>
<img src="outputs/eval_point_full.png" width="400"></td>
</tr>
</table>
''')| 1 class | 3 class |
|---|---|
![]() |
![]() |
Quantitatively, model trained on 1 class have better performance in that class compared to the model trained on 3 classes. I have to conclude that the model might be too small to absorb the data from all classes, thereby leading to sub-optimal performance.