Project 2

andrew ID: guyingl

Q 1.1

Left: Ground truth; Right: Fitting result

Q 1.2

Left: Ground truth; Right: Fitting result

Q 1.3

Left: Ground truth; Right: Fitting result

Q 2

The hyperparamers for training the following three models are roughly the same with the initial learning rate to be 4e-4. The models for Q 2.1 was trained with 20k iterations. While the models for Q 2.2 and Q 2.3 were trained with 15k iterations. For the implementation of each decoder, please refer to model.py. None of the argument was used in training.

Q 2.1

GT image | GT | Predicted

Q 2.2

GT image | GT | Predicted

Q 2.3

GT image | GT | Predicted

Q 2.4

Mesh (F1@0.05: 78.09) | Vox (F1@0.05: 41.57) | Point (F1@0.05: 86.48)

Voxel prediction tends to achieve lower scores compared to point cloud or mesh prediction. This may be because voxel outputs require the model to implicitly learn structural connectivity, even though the training process does not enforce it. By contrast, mesh prediction benefits from starting with a template mesh that already encodes connectivity, and point cloud prediction avoids this issue altogether since connectivity is irrelevant and the evaluation protocol aligns well with that representation.

Q 2.5

I tried different w_smooth with 0.1 and 100000. I tested the models with different weights but the same interation 12k. The visualizations for 3 cases are listed as follows:

GT | W=10000 | W=0.1

Although increasing w_smooth makes the mesh smoother, it also causes the result to degenerate into a sphere, making fine details difficult to learn. Therefore, during training, it is important to balance the pursuit of smoothness and the preservation of geometric details.

Q 2.6

I experimented with two input images to obtain different feature codes, and then used a blending parameter λ to interpolate between them. From left to right and top to bottom, I show the predicted point clouds corresponding to different values of λ (λ = 0, 0.2, 0.4, … 0.8, 1). As λ increases, the predicted point cloud gradually transitions from the shape corresponding to the left image to that of the right image.

This smooth transformation indicates that the learned feature space is continuous and structured: interpolating between feature embeddings produces meaningful intermediate 3D reconstructions, rather than random noise. In other words, the network has learned a latent representation where semantic and geometric information varies smoothly, demonstrating that the model captures a consistent manifold of 3D shapes.