Exploring the Latent Space of GANs

Previously, we’ve shown how we can train a Generative Adversarial Network (GAN) to produce fake but realistic-looking images mimicking a target distribution. Now that we have a function to map a standard multi-variate Gaussian to the space of real cat images, we can invert this relation to perform image synthesis operations directly on the latent space.

For this project, we will use our own DC-GAN as well as NVIDIA’s StyleGAN. The architectures of both generators as well as some samples are shown below:

DC-GAN Architecture
DC-GAN Samples
StyleGAN Architecture
StyleGAN Samples

Inverting the Generator

Unfortunately, our generator is not actually an invertible function, so we cannot simply take the function inverse to find the corresponding latent code. Instead, we can treat the inversion as a non-convex optimization problem. For a loss function \mathcal{L} , trained generator G , and real image x , we can approximate the corresponding latent code as

\displaystyle z^* = \underset{z}{\arg\min} \mathcal{L}(G(z), x)

All that is left is to define the loss function and optimizer. The L2 distance between the generated and target images might seem like a good metric at first, but this does not suffice in practice. In the Neural Style Transfer post, we discusses the concept of perceptual loss as a combination of content and style losses. Since we really only care about content here, we will just use the content loss. Following the Image2StyleGAN paper, we will feed the target and generated images through a pre-trained VGG-16 network and use the content losses at layers conv1_1, conv1_2, conv_3_2, and conv4_2.

Combined with the pixel-space loss, our optimization objective them becomes

\displaystyle z^* = \underset{z}{\arg\min}\{ \lambda_{perc}\mathcal{L}_{perc}(G(z), x) + MSE(G(z), x)\}

Since we are optimizing relatively few parameters, we can again use the second-order L-BFGS optimizer.

Here, we will explore the differences between reconstructions in different settings.

First, we will see how the quality of the reconstruction varies as we adjust the weight of the perceptual loss (perc_wgt):

Original Image
DC-GAN
StyleGAN
Original Image
DC-GAN
StyleGAN

The differences are subtle, but weighting the perceptual loss with weight 1 seems to yield the highest quality results (as also noted in the Image2StyleGAN paper. Too much weight on the perceptual loss allows for too much difference in the pixel space, and too little weight on the perceptual loss makes the optimization less smooth by relying too much on the pixel loss.

Furthermore, we also note that the StyleGAN’s reconstruction is much more faithful to the original image. However, StyleGAN also took much longer to generate projections, likely because the optimizer had to back-propagate through a much deeper network.

Knowing that StyleGAN produces more faithful reconstructions, we will use it exclusively moving forward. Since we are only using StyleGAN, we can now also strictly use higher resolution images that our DC-GAN architecture cannot handle.

First, let’s analyze the differences in choosing what latent variables to optimize. Our options are: optimizing the input the the “mapping network” (Z), optimizing the same style vector for each layer of the “synthesis network” (W), or optimizing a different style vector for each layer of the “synthesis network” (W+).

Optimizing the W+ space yields the clearest results. This is likely because optimizing W+ allows the model to retain as much expressivity as possible when reconstructing the target image. Optimizing W+ also took the most time out of the three methods, likely because of the number of parameters being optimized by our second order method.

Now that we’ve identified the optimal reconstruction strategy, we cane move on to performing arithmetic on the latent space.

Traversing the Latent Space

Consider two images of cats in the 256x256x3 RGB pixel space. Note that all possible cat images (and even natural images in general!) make up a very small subset of all possible 256256x256x3 ≈ 1.75×10473479 images in the space. Thus, if we were to linearly interpolate between two cat images in the pixel space, the intermediate images would not lie in the space of real(ish) cat images:

Start
Naive Interpolation
End

Clearly, the intermediate images from this naïve interpolation do not lie on the manifold of real cat images. However, since the latent space of our generator lies in a convex subspace, we can safely linearly interpolate within the latent space while remaining inside it.

Manifold Interpolation Illustration

To help “illustrate” this point, I had some fun with Google Drawing and made this diagram. Consider the blue curve as our cat manifold and the purple line as our latent space. If we were to just linearly interpolate between the two selected points, the intermediate points would not lie on the cat manifold, so the interpolated image would not look like a cat.

However, if we first project both cat images into out convex latent space, we can then safely linearly interpolate without escaping the latent space. To get back to the cat manifold, we simply pass our interpolated latent vector into our trained generator, and we now have an interpolated image that is still on the cat manifold!

To do this we can use the method described in the previous section to project are target images on to the latent space and then linearly interpolate between the latent vectors. By this method, our latent vectors remain in the convex latent space, and the generated cat images remain within the target manifold:

Start
Latent Interpolation
End

Note that while the start and end images here were constructed via our projection method, the intermediate images were all just created by feeding the interpolated latent vector through our GAN; no optimization needed!

Included below are more interpolations generated via projection to the latent space:

Start
Latent Interpolation
End
Start
Latent Interpolation
End
Start
Latent Interpolation
End
Start
Latent Interpolation
End
Start
Latent Interpolation
End

There are certainly some distortions during the interpolation. However, the intermediate images remain mostly on the manifold of real(ish) cat images.

sketch2cat

Another application of our projection method is generating a cat image that looks most like a target doodle. To do this, we provide a sparse doodle as our target RGB image and optimize the error between the generated image and the target doodle. However, since most of the space in the sketches are blank, we only optimize the error between the pixels where the sketch S actually has brush strokes. We accomplish this by using a binary mask M:

\displaystyle z^* = \underset{z}{\arg\min}\| M\circ G(z) - M\circ S\|_2^2

Performing this optimization yields the following results:

Sketch
Output

So what happened here? It seems like in an effort to match the sketch, the optimizer adjusted the latent variables too much such that they no longer lie in the subspace used when the generator was originally being trained. Thus, the generator no longer maps the latent variables to the manifold of real cat images.

To stop the optimizer from adjusting the latent variables too much, we can add a regularizer. Specifically, we can penalize the norm of the latent vector such that it can’t become too large. This is typically known as an L2 regularizer or weight decay.

Our updated optimization is then:

\displaystyle z^* = \underset{z}{\arg\min}\{\| M\circ G(z) - M\circ S\|_2^2 +\lambda_{reg}\|z\|_2^2 \}

Varying the weight of our regularizer yields the following results:

Lam=0.1 seems to be the sweet spot before the regularizer becomes overpowering. Once we reach lam=1, we again fall off the manifold of real(ish) cat images. If our regularizer is weighted too heavily, the optimal latent variables will eventually be all zeros, so we have to give the optimizer room to actually optimize our image loss.

Below are side-by-sides of the sketches provided by the 16-726 course staff and their projections onto the cat manifold (or the cat-ifold):

Because of the masking technique, our projection method is able to handle both dense and sparse sketches. While there are only so many ways to depict the same cat, our optimization method still produced outputs that resemble the sketches. If we consider that the output of the generator is constrained to the set of grumpy cat images, we can only expect so much of a resemblance to the arbitrary sketches.

And now for a few of my own works of art:

It seems like our sketch2cat algorithm is robust enough to handle even my sketches. One interesting thing to note is that for the more, erm, implausible sketches, the result seems lower contrast or more washed out. This indicates that the optimization is struggling to match the sketch while still producing a realistic image.

Thanks for Reading!

If you’d like to know more about how StyleGAN works or see some more of its applications, you can check out NVIDIA’s demo video here:

Leave a Reply

:)