andrew ID: guyingl
Left: Ground truth; Right: Fitting result
Left: Ground truth; Right: Fitting result
Left: Ground truth; Right: Fitting result
The hyperparamers for training the following three models are roughly
the same with the initial learning rate to be 4e-4. The
models for Q 2.1 was trained with 20k iterations. While the
models for Q 2.2 and Q 2.3 were trained with
15k iterations. For the implementation of each decoder, please refer to
model.py. None of the argument was used in training.






GT image | GT | Predicted






GT image | GT | Predicted






GT image | GT | Predicted


Mesh (F1@0.05: 78.09) | Vox (F1@0.05: 41.57) | Point (F1@0.05: 86.48)
Voxel prediction tends to achieve lower scores compared to point cloud or mesh prediction. This may be because voxel outputs require the model to implicitly learn structural connectivity, even though the training process does not enforce it. By contrast, mesh prediction benefits from starting with a template mesh that already encodes connectivity, and point cloud prediction avoids this issue altogether since connectivity is irrelevant and the evaluation protocol aligns well with that representation.
I tried different w_smooth with 0.1 and
100000. I tested the models with different weights but the
same interation 12k. The visualizations for 3 cases are
listed as follows:






GT | W=10000 | W=0.1
Although increasing w_smooth makes the mesh smoother, it
also causes the result to degenerate into a sphere, making fine details
difficult to learn. Therefore, during training, it is important to
balance the pursuit of smoothness and the preservation of geometric
details.
I experimented with two input images to obtain different feature codes, and then used a blending parameter λ to interpolate between them. From left to right and top to bottom, I show the predicted point clouds corresponding to different values of λ (λ = 0, 0.2, 0.4, … 0.8, 1). As λ increases, the predicted point cloud gradually transitions from the shape corresponding to the left image to that of the right image.
This smooth transformation indicates that the learned feature space is continuous and structured: interpolating between feature embeddings produces meaningful intermediate 3D reconstructions, rather than random noise. In other words, the network has learned a latent representation where semantic and geometric information varies smoothly, demonstrating that the model captures a consistent manifold of 3D shapes.

Input image 1 | Input image 2





From left to right and top to bottom, λ = 0, 0.2, 0,4, … 0.8, 1
I implemented based on Atlasnet. For the implementation details,
please refer to model.py.






GT image | GT | Predicted