16-726 Learning-Based Image Synthesis

Final Project: Disentangling Latent Space via VAE
Jun Luo
May 2021

Overview

Variational Auto-encoder (VAE) and Generative Adverarial Network (GAN) are two major directions for structured data synthesis. In VAEs, the encoder tries to learn a mapping from the data to latent space \( q_{\phi}(z|x) \), while the decoder tries to learn a mapping from the latent space back to the data \( p_{\theta}(x|z) \). For GANs, a generator, \( G \), and a discriminator, \( D \), try to play a min/max game where the discriminator tries to differentiate the real images \( x \) and the synthesized image \( G(x) \) generated by the generator \( G \), while the generator \( G \) tries to generate realistic images to fool the discriminator. In both methods, labels of the images can serve as additional information for the models to learn from. The VAEs will be enabled to learn the new conditional mappings \( q_{\phi}(z|x, c) \), and \( p_{\theta}(x|z,c) \). And the GANs can have the labels as additional information for the generator, \( G(z, c) \) and the discriminator, \( D(x, c) \).

In this project, we focus on disentangling the latent space via VAE as a modification of an existing method, [1]. Specifically, we train an encoder to extract the latent vectors from the image data, and we push the latent vectors to be irrelevant with the labels. We then add the labels to the latent vector to generate different images from different classes with the goal of the style of each image generated with different latent vector (but from different classes) to be similar.


Method

As shown in the figure below. We feed the image into an encoder and get the latent vector \( z \). We restrict the distribution of it to be standard normal distribution. Then the model is split into two branches. The upper branch in the figure is responsible for generate the image. The one-hot embeded label of the image is concatenated with the latent vector, and they are fed into the decoder to get the reconstructed image. We compute the reconstruction loss as the binary cross entropy loss. And since we push the distribution of the latent vector \( z \) to be normal distribution, we introduce a KL divergence loss.

Original image

For the lower branch we take the latent vector \( z \) as the input to an Adversarial Classifier, which is a multilayer perceptron network. The goal of the Adversarial Classifier is to classify the category of the imagethe from its latent vector. A cross entropy loss is enforce to the Adversarial Classifier. Whereas, since our goal for the encoder is to let it generate latent vector according to the image, which does not contain any information regarding the image's category, we assign an loss to the encoder to make it fool the Adversarial Classifier.

More specifically, there are four losses in our network. The reconstruction loss $$ L_{recon}=Binary Cross Entropy (x, \hat{x}) $$. The KL Divergence loss $$ L_{KL}=D_{KL}(q_{\phi}(z|x) || N(0, I)) $$. The adversarial loss for the classifier $$ L_C^{adv} = -\mathbb{E}_{q_{\phi}(z|x)} \sum_c \mathbb{I}(c=y) \log q_{\omega}(c|z) $$. And the adversarial loss for the encoder, which push it to strip the information regarding the label from the latent vector \(z\) $$ L_{E}^{adv}=-\mathbb{E}_{q_{\phi}(z|x)} \sum_{c} \frac{1}{C} \log q_{\omega}(c|z) $$, where the adversarial classifier is parameterized by \( \omega\).


Experiments and Results

To evaluate our method, we compare our method with the vanilla VAE [2] and the conditional VAE [3]. We conduct experiments on different datasets including MNIST [4], the cat [5] dataset from homework3, and CIFAR10 [6]. Unfortunately, not all datasets does the method outputs reasonable results. Below, we show the reconstruction results and sample results. For the reconstruction results, the odd columns are the original images.

Reconstruction

VAE
Conditional VAE
Ours

Sample

For the sample results, for VAE, we sample 16 different latent codes from the standard normal distribution. For Conditional VAE and our method, we sample 5 different latent codes (corresponding to the 5 rows), and for each, we see the results for assigning different labels to it (corresponding to the columns). Below are the sample results on MNIST dataset.

VAE
Conditional VAE
Ours

We can see from these results that the conditional VAE actually shows that the style of each row (corresponding to a different latent vector) are similar within the row while different across the rows. This is not captured by our method, which means that the decoder in our method value the label too much during the synthesis.

Below are the reconstruction results and sample results on the cat dataset. Again, for the reconstruction results, the odd columns are the original images.

Reconstruction

Conditional VAE
Ours

Sample

Conditional VAE
Ours

Again, we can see that different latent code only applies difference for conditional VAE, but for this dataset, the overal synthesis in our method looks more realistic. This also means that the decoder takes the label as the major factor to synthesize the image, which helps itself to learn better regarding the characteristic of each class.

The failure cases shows in the CIFAR 10 dataset and the pokemon dataset [7]. Below, we show the sample results for CIFAR10 dataset and the reconstruction results for pokemon dataset.

failure on samples of CIFAR 10
failure on reconstruction of pokemon dataset

Conclusion

In this project, we developed a VAE model that tries to disentangle the latent space with label irrelevant latent vector. We use a adversarial classifier to enforce the encoder to generate latent vector that contains little information with respect to the classification. However, this to some extent hurt the performance of the decoder since the decoder will rely much more on the label to synthesize the images, so much that the difference in different latent vector will even be ignored. To resolve this issue, further hyperparameter tuning is needed, which includes tuning the four coefficients for the the four losses in this method.


References

[1] Zheng, Zhilin, and Li Sun. "Disentangling latent space for vae by label relevant/irrelevant dimensions." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

[2] Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." arXiv preprint arXiv:1312.6114 (2013).

[3] Sohn, Kihyuk, Honglak Lee, and Xinchen Yan. "Learning structured output representation using deep conditional generative models." Advances in neural information processing systems 28 (2015): 3483-3491.

[4] MNIST dataset http://yann.lecun.com/exdb/mnist/

[5] Cat dataset (grumpy cat & Ruassian Blue) from HW3 https://learning-image-synthesis.github.io/assignments/hw3

[6] CIFAR10 dataset https://www.cs.toronto.edu/~kriz/cifar.html

[7] Pokemon dataset from HW3 https://learning-image-synthesis.github.io/assignments/hw3