Blending StyleGAN2 models to turn faces into cats(and more)

Project Website for 16726 - Learning Based Image synthesis

CMU Spring 2021


Tarang Shah(tarangs)

Rohan Rao(rgrao)


🚀
Demo Website

Goal of the project

Projector upgrades to StyleGAN2

The original code uses a single weight vector using StyleGAN's internal mapping network - ww. However, inspired from HW5, we modify this to use a collection of weight vectors instead - w+w+ (this is technically a latent "tensor" but we refer to it as a latent vector for brevity)

We can see the mapping network in the above image which maps a random vector z to the weight space

We also tried adding a Mean Squared Error (MSE) loss, but we noted that this MSE loss was actually making the resulting images smoother and less photorealistic, which reduced the perceptual quality. We also tried varying the noise regularization parameter, but this too did not result in any significant changes in the generated outputs.

In the future, we would like to add a feed-forward neural network to provide a quick one-shot initialization for the optimization.

Training StyleGAN2

  1. Start with a Pretrained FFHQ based face generator model - Here we used the 256x256 version for faster experimentation (From NVIDIA). We also used the version with the latest ADA improvements, which allow us to train a StyleGAN model with significantly fewer images.
  1. Fine tuned the model on 4 datasets - AFHQ Cats, AFHQ Dogs, AFHQ Wild Animals and Google Cartoons

Inverting the Generator

Background

The core idea here is to invert an image generator, particularly the StyleGAN2 generator. Before we get into details of the inversion and how we do it, lets first understand what the generator does. An image generator usually takes a vector as input and generates an image. Essentially it is a black box which takes a vector and returns an image.

Vector → [Generator] → Image

We can see the visualization of the generator model from our "Vanilla GAN" of Homework 3 below

Generator from the Vanilla GAN

Although we show the Generator from a simple GAN, it is possible to use any generator. For the purposes of this assignment, we use the very famous and popular StyleGAN2 generator.

The task

Now that we have seen what a generator does, lets talk about our task. Our first task is to generate a vector from a given input image. This is literally the opposite of what the generator does 🙃.

The vector we want from a given image is also known as a latent vector, since it belongs to the "latent space" of the generator.

Given an input image, the goal is to find a latent vector that produces the input image when we pass it through the generator.

Input Image → [ ?? ] → Latent Vector → [Generator] → Generated Image

Our goal is to figure out the "??" in the above process, such that the Generated Image is as similar to the input image as possible. Since the process is the reverse of what the Generator does, we call this "inverting" the generator.

We use optimization techniques for achieving this inversion. We don't actually use a model to replace the "??" in the above image, but we use some math and optimization techniques to achieve the results we want.

Steps followed

  1. We start with a random Latent Vector and then pass it through the Generator.
  1. The Generator is in eval mode so it is only use for a forward pass
  1. Since we want the resultant image from the Generator to be as close to the real image as possible, we need to build a loss function for the same
    1. We use a combination of a simple Mean-Squared-Error loss and a Perceptual Loss to achieve this. This is a weighted combination as mentioned in this paper
      1. Metric=(1λ)×mse_loss+λ×perceptual_lossMetric = (1-\lambda)\times mse\_loss + \lambda \times perceptual\_loss (here λ\lambda is also called perceptual weight or perc_wgt)
      1. The Perceptual Loss here is the "Content Loss" at conv_4 of a VGG network, as described here
    1. The loss is calculated between the resultant image and the input real image
      1. Loss=[Metric(RealImage)Metric(GeneratedImage)]2Loss = \sqrt{[Metric(RealImage)-Metric(GeneratedImage)]^2}
  1. We use an ADAM optimizer on this loss to optimize the input Latent Vector (changing from the LBFGS optimizer in HW5)
  1. Finally after about 500 iterations, we can use the resultant vector as an optimized vector

Model Blending using StyleGAN2

Background and Task Description

Here, we would like to blend two trained StyleGAN2 models together. The first model, called the base model, will be trained on a particular dataset, like FFHQ. We already have pre-trained FFHQ models available, thanks to NVIDIA. These models are then fine-tuned on specific datasets, like the AFHQ-Cats, AFHQ-Dogs, AFHQ-Wild Animals and Google Cartoons type of datasets.

We then use two trained models and swap out the corresponding layers using either binary or fractional blending techniques. This is described in the figure below.

As shown above, we can create a model that uses some weights from the base model and some weights from the blended model and then use this to generate new and interesting results. Here we have two options - either switch between the two sets of weights abruptly, or use a fractional linear combination to smoothly transition between them.

When the blending layer is BK, For each block above K, we do the below operation for each layer,

LM=(1α)LA+αLBL_M = (1-\alpha) L_A + \alpha L_B

For the simple case, α=1\alpha = 1

Alternatively, we also use a progressive α\alpha value,

α=11+em/q \alpha = \frac{1}{1+e^{-m/q}}

The main configurable parameter for the fractional α\alpha is the q value. We found the best results at q=0.7

Where,

LML_M = Weights of the resultant blended model

LAL_A = Weights from the base model layer

LBL_B = Weights from the second model layers

α\alpha = Weight factor

k = block after which we start blending

m = the index of the block starting at 0 for the kth block(for example, if K is 8, B16 has m=1, B32 has m=2)

q = Blend Width

Note: the second model MUST BE transfer-learned from the base model, for best results. Here we show the difference with and without transfer learning:

From this experiment, it seems that when models are transfer learned, many features in the early stage of the models are related. Hence when we blend or even swap layers across these models, we get to see interesting combinations.

Blending Model Weights vs Latent Space Interpolation

Latent Space v/s Blending Weights

NameLatent Space InterpolationBlending Weights
Type of OperationSingle model operationMultiple models - Humans, Cats, Dogs, Wild Animals, Cartoons!
Number of operationsAllows for multiple operations within the same modelOne time operation - blend the weights and save the weights
Ease of usage with new inputsNeed to find a latent vector for each input (requires optimization or forward pass through a CNN)If we have the base model latent vectors, they do not need to be recalculated per-model
How it worksNeed to tweak the latent vector to obtain the required stylesEasier to apply the same style on a variety of inputs

Experiments and Interesting Results

Simple Blending Experiments (with random seeds)

In the first row we can see we have the Dog generator as the base model and the face generator as the blended model. As we blend at higher and higher layers, we see less of the blended model(faces).

We also see some interesting things happening here:

  1. The first row B4 column shows that the FFHQ network used the corresponding latent vector to generate a person who has a very similar face pose, background and angle.
  1. For the rest of the rows, B16 and B32 generate the most exciting blends that form the "uncanny valley" between the human faces and cat faces.

More Interesting Blending Experiments

The middle section includes the best layer. Between B16 and B64, we clearly see that its a mix of the 2 models, but the cuteness is still preserved 😍

High-level Procedure

Results on custom images

We see a sequence of progressive optimization on the latent vector. In each group of threes images above, the first image is the original image. The Second is the synthesized image, the third image is the blended image

We show results of blending Cat models onto Face models at different levels. Column 1 is the original Generator output. Column 2 is B4, Column 3 is B8 and so on.

Github Repository

The code for this repository is available here - https://github.com/t27/stylegan2-blending

We started from the base StyleGAN2 ADA - Pytorch repository and made various changes including improving the existing code and also adding our own code.

Links to our pretrained models and instructions to set up and run our code are also provided in the README of the repository.

We also include the source code for the webapp demo described below.

Demo Website

We have a demo website for this project available. We used the Streamlit library to develop the website.

The demo website is available here https://t27.pagekite.me/

Since the website runs on our local machine, we aim to keep it live upto 25th May.

Here is a video walkthrough of the website - https://youtube.com/watch?v=Urr-bbI10DQ

Note for the instructors and TAs, if there are any issues or problems with the website, please do contact us and we will try our best to fix and ensure the link is up and running.

Applications

Next Steps

References