Holding the threshold constant, F1 scores of 3D reconstruction for meshes and pointclouds are much higher than for reconstruction for voxel grids. This makes sense because the voxelgrid BCE loss function carries no notion of connectivity over the reconstructed shape and treats neighboring predicted voxel values independently, whereas pointcloud and mesh reconstructions are trained with Chamfer loss which takes into account features of neighboring regions of the predicted shape.
Voxel grid:

Point cloud:

Mesh:

I chose to vary the Chamfer loss weight in training the 3D mesh reconstruction. The default value is 1.0 (used in section 2.3 above), and I also tried decreasing and increasing the weight value to 0.1 and 10, respectively. The reuslts are below (see rightmost column for predicted mesh). We observe that the Chamfer loss weight has a noticable qualitative effect on the the reconstructed mesh: at a lower weight, the loss is dominated by the other term (smoothness loss), so the mesh result is much more smooth, vs. at a higher weight, the Chamfer loss dominates so the reconstruction is "spikier" and less smooth. The default values of w_chamfer = 1.0 and w_smooth = 0.1 seem to achieve a good balance between faithfulness of the reconstruction and smoothness of the mesh.
Weight = 0.1

Weight = 1

Weight = 10

I created a "slicing" animation that animates orthogonal slices through the predicted voxel volume and shows prediction probabilities via a heatmap. This makes it easier to see whether the structure has holes or is hollow where it shouldn't be, whether the boundaries of the shape are "fuzzy" (higher uncertainty), and where the shape is thinner or thicker overall. Below are some example gifs:



I chose to train and evaluate the poitn cloud reconstruction model on the 3xtended dataset containing three classes.
We first compare the qualitative results; below are the same sample results copied from above, using the model trained on a single class:

In comparison, these are the results on the same testing samples, using the multi-class trained model:

We see that the results do still largely resemble the ground truth shape but are overall lower quality -- shape outlines are not as precise and are a bit more amorphous / ambiguous (less diversity of the output samples). Notice there is less definition in thinner parts of the reconstructed chairs such as the chair legs.
For the quantitative comparison, here are the F1 score plots:
Training on one class:

Training on three classes:

In general, the multiclass model has a lower F1 score than the previous single class model while holding the threshold constant, indicating degraded 3D consistency of the output samples comapred to the previous model.