Q1

Q1.1

Ground Truth	Fitted Result

Q1.2

Ground Truth	Fitted Result

Q1.3

Ground Truth	Fitted Result

Q2

Q2.1

Ground Truth Image	Ground Truth	Prediction

Q2.2

Ground Truth Image	Ground Truth	Prediction

Q2.3

Ground Truth Image	Ground Truth	Prediction

Q2.4

Voxel (Avg F1@0.05: 74.961)	Point Cloud (Avg F1@0.05: 79.393)	Mesh (Avg F1@0.05: 73.999)

For all three representation networks, the F1 score naturally increases as we increase the distance threshold as increasing the distance threshold relaxes the constraints. Among these representations, point cloud model performs the best with Avg. F1 score for around 80, followed by VOX at 75, and ultimately Mesh at 74. Point Cloud has the highest F1 score since the model is reducing the distance between ground truth and predicted point locations, and hence it tries to prevent points being too far from the ground truth mesh giving higher precision. This allows the model to capture much finer details. For Voxels, we have discritized the 3d structure resulting in loss of finer details of the 3d structures which may lead to poor recall and introduction of new artifacts which may lead to poor precision. Something to note, is that VOX network had to be trained for 10x iterations compared to Point Cloud and Meshes to get this score. This suggests that voxel model can converge stably with enough iterations. Meshes are constrained by connections of the isometric surfaces they are deforming. This can be a more involved problem to optimize compared to simply moving points around as is the case with point clouds or predicting whether a point is occupied in a space as is the case with voxels. Because of this we also may itroduce planar surfaces if our initialization of the isometric surface isn't detailed enough, which may result in poorer precision and recall. Moreover for voxels we are optimizing over a dense grid compared to only points on the surface as is the case with meshes. For naive voxel network implementation, we mostly will be optimizing for a lot of empty space that may not be very informative for the network to learn. Hence, we get comparable performace with a lot less iterations in the case of meshes.

Q2.5

The networks for 500, 1500, and 2000 points were trained for same number of iterations, across two learning rates 4e-3 and 4e-4.

Ground Truth Image alt text

	LR = 0.004	LR = 0.0004
500 Points
1500 Points
2000 Points

	LR = 0.004	LR = 0.0004
500 Points	Avg F1@0.05: 67.965	Avg F1@0.05: 71.255
1500 Points	Avg F1@0.05: 74.725	Avg F1@0.05: 81.897
2000 Points	Avg F1@0.05: 72.118	Avg F1@0.05: 83.565

Quantitatively, we see tha F1 scores were higher when trained with learning rate of 4e-4 compared to 4e-3. Qualitatively, we further see the output of network trained with higher learning rate, though represents a chair, it's not accurate compared to the ground truth chair. This indicates that training is a lot more stable with lower learning rate and having a higher learning rate can lead to overshooting minima or getting stuck at a minima that doesn't generalize well across all types of chairs. The lower learning rate generalizes much better. The highest F1 score was achieved by network predicting 2000 points with F1 score of around 83.5. As we decrease the number of point we see a decline in the F1 score. The increase of F1 scores with increase in the number of points suggests that the model has a lot more signals to learn and isn't forced to learn underlying probability distribution of dense surfaces with sparse points. With less points, due to sparsity the model needs to resolve ambiguity, which could be hard to resolve. Learnign with less number of points can also lead to underfiting as the number of points are not enought to explain the underlying data distribution. Qualitativelty, these effects are visible in the outputs. We can see the with 500 points, the chair is missing some parts for examples extension of the sitting platform down to the front legs. The 1500 points introduce some artifacts such as hand rests that do not appear. The 2000 points structure is able to capture every structure and doesn't include non-existent structures in the ground truth image such as arm rests. However the legs are thicker than the ground truth surface. We don't see that in the 500 points structure since the loss might be too much if we remove the legs and there aren't enough points for it to capture all the details. For 2000 points, there might be slight ambiguity in the poisitioning of the legs and to minimize the loss, the network might try to place is different possible locations resulting in thicker legs.

Q2.6

I ran two methods to further evaluate the voxels model.

Error Point Clouds

When trying to analyze F1 scores and predicted ground truth, I found it hard to udnerstand gaps in finer details of the structure. Hence I decided to present a cloud map color coded in the following way:

Grey points: Points that match the ground truth (True Positives)
Red points: Points that don't exist in the ground truth but exist in the prediction (False Positives)
Blue points: Points that exist in the ground truth but are missing in the prediction (False Negatives)

These are computed over different thresholds similar to how F1 scores are calculated. These help understand structural descrepancies where simple visualization and F1 scores fall short. For example in image 1 we can see that model has descripancies in deciding where to put the back rest. It puts it more forward compared to the original ground truth indicating depth discrepancy. Similarly, for the second image we see that model has problem deciding the correct location of the base and the chair base is much shorter than ground truth indicating. For the final right image, the red dots do form a structure similar to a chair. This indicates that there aren't enough samples for these kind of chairs and the model doesn't generalize well to all chair styles. It tries to fit everything into a standard chair.

Grad Cam

I also tried running grad cams to see which areas of the image are not represented well in the 3d structures. Here we are backpropagating the loss from ground truth and 3d structures through the encoder to get a heatmap over the image. As you can see the bright yellow parts are usually missing from the 3d structures indicating that the model is usually messing up the back part of the images while reconstructing.

Image	GT	Pred	Grad Cam

Q3

Q3.3

I trained mesh network on all 3 classes.

3 classes (Avg F1@0.05: 73.010)	single class: Mesh (Avg F1@0.05: 73.999) (Avg F1@0.05: 73.999)

Ground Truth Image	Ground Truth	Single Class Training Prediction	Multi Class Training Prediction

Through renderings we can observe that training with 3 classes captures the global shape of the object like in the single class training. This is also reflected vy comparable F1 scores with difference of roughly 1. This tells us that the network can still learn useful geometric features even when jointly trained on multiple classes. However, despite using smoothness loss the multi class model struggles with finer details. We see a lot more non-smooth surfaces marked by spikes especially on the back of the chair in the multi-class training case. This further explains the lower f1 score compared to single class training case. There are more triangular surfaces in different orientations. For example, hand rest in example 1 that's coming out diagonally. This might be due to the fact that when we introduce other classes, the deformation function our network is learning to move the vertices, also has to generalize across shapes with different structures (such as planes vs chairs). This might cause interference as feature useful for one class might deform the geometry of other. Eg. plane wings might interfere with chair hand rests. One way to improve this could be to condition the model on classes.

Q3.1

I implemented implicit surfaces where I am appending 3d points to the image encodings. I then pass it through the decoder to predict the probability of occupancy. This is trained with BCE with logits loss for 100k iterations. The loss reaches 0.04 at the end of training. The points are randomly sampled from (-1, 1) during training. During inference we sample 32x32x32 points from (-1, 1).

Ground Truth Image	Ground Truth	Prediction

We can see the outputs don't look as good with mostly planar sheets. While it does capture some things such as verticle backrest it still far from the mesh we want. Even the F1 score at 0.05 is fairly low ≈22%. This might be due to training the model with very low number of point 2k at the beginning. I only increased to 16k points later in the training at around 70k iterations. This could have led to model getting stuck in a bad basin from which it's hard to come out from. Moreover, the points sampled could have been more in the empty space. to tackle this, I would have liked to try specifically sampling some points on the voxels. Concatenating a single global vector z to (x,y,z) could also discard spatial cues and fine details.