Blending StyleGAN2 models to turn faces into cats(and more)

Project Website for 16726 - Learning Based Image synthesis

CMU Spring 2021

Tarang Shah(tarangs)

Rohan Rao(rgrao)

TOC
Goal of the project
Projector upgrades to StyleGAN2
Training StyleGAN2
Inverting the Generator
Background
The task
Model Blending using StyleGAN2
Background and Task Description
Blending Model Weights vs Latent Space Interpolation
Experiments and Interesting Results
Simple Blending Experiments (with random seeds)
More Interesting Blending Experiments
High-level Procedure
Results on custom images
Github Repository
Demo Website
Applications
Next Steps
References

🚀

Demo Website

Goal of the project

Our project looks at how we can make artistic edits to images through model blending.

The models described here are StyleGAN2 models that include one pretrained model(Faces) and the others were trained using transfer learning on their respective datasets.

We use this model to transfer characteristics from cats/dogs/cartoons/wild animals onto faces of the people we know 😉

Projector upgrades to StyleGAN2

The original code uses a single weight vector using StyleGAN's internal mapping network - $w$ . However, inspired from HW5, we modify this to use a collection of weight vectors instead - $w+$ (this is technically a latent "tensor" but we refer to it as a latent vector for brevity)

We can see the `mapping` network in the above image which maps a random vector z to the weight space

We also tried adding a Mean Squared Error (MSE) loss, but we noted that this MSE loss was actually making the resulting images smoother and less photorealistic, which reduced the perceptual quality. We also tried varying the noise regularization parameter, but this too did not result in any significant changes in the generated outputs.

In the future, we would like to add a feed-forward neural network to provide a quick one-shot initialization for the optimization.

Training StyleGAN2

Start with a Pretrained FFHQ based face generator model - Here we used the 256x256 version for faster experimentation (From NVIDIA). We also used the version with the latest ADA improvements, which allow us to train a StyleGAN model with significantly fewer images.

Fine tuned the model on 4 datasets - AFHQ Cats, AFHQ Dogs, AFHQ Wild Animals and Google Cartoons

Inverting the Generator

Background

The core idea here is to invert an image generator, particularly the StyleGAN2 generator. Before we get into details of the inversion and how we do it, lets first understand what the generator does. An image generator usually takes a vector as input and generates an image. Essentially it is a black box which takes a vector and returns an image.

Vector → [Generator] → Image

We can see the visualization of the generator model from our "Vanilla GAN" of Homework 3 below

Although we show the Generator from a simple GAN, it is possible to use any generator. For the purposes of this assignment, we use the very famous and popular StyleGAN2 generator.

The task

Now that we have seen what a generator does, lets talk about our task. Our first task is to generate a vector from a given input image. This is literally the opposite of what the generator does 🙃.

The vector we want from a given image is also known as a latent vector, since it belongs to the "latent space" of the generator.

Given an input image, the goal is to find a latent vector that produces the input image when we pass it through the generator.

Input Image → [ ?? ] → Latent Vector → [Generator] → Generated Image

Our goal is to figure out the "??" in the above process, such that the Generated Image is as similar to the input image as possible. Since the process is the reverse of what the Generator does, we call this "inverting" the generator.

We use optimization techniques for achieving this inversion. We don't actually use a model to replace the "??" in the above image, but we use some math and optimization techniques to achieve the results we want.

Steps followed

We start with a random Latent Vector and then pass it through the Generator.

The Generator is in eval mode so it is only use for a forward pass

Since we want the resultant image from the Generator to be as close to the real image as possible, we need to build a loss function for the same
1. We use a combination of a simple Mean-Squared-Error loss and a Perceptual Loss to achieve this. This is a weighted combination as mentioned in this paper
  1. $Metric = (1-\lambda)\times mse\_loss + \lambda \times perceptual\_loss$ (here $\lambda$ is also called perceptual weight or perc_wgt)
  1. The Perceptual Loss here is the "Content Loss" at conv_4 of a VGG network, as described here
1. The loss is calculated between the resultant image and the input real image
  1. $Loss = \sqrt{[Metric(RealImage)-Metric(GeneratedImage)]^2}$

We use an ADAM optimizer on this loss to optimize the input Latent Vector (changing from the LBFGS optimizer in HW5)

Finally after about 500 iterations, we can use the resultant vector as an optimized vector

Model Blending using StyleGAN2

Background and Task Description

Here, we would like to blend two trained StyleGAN2 models together. The first model, called the base model, will be trained on a particular dataset, like FFHQ. We already have pre-trained FFHQ models available, thanks to NVIDIA. These models are then fine-tuned on specific datasets, like the AFHQ-Cats, AFHQ-Dogs, AFHQ-Wild Animals and Google Cartoons type of datasets.

We then use two trained models and swap out the corresponding layers using either binary or fractional blending techniques. This is described in the figure below.

As shown above, we can create a model that uses some weights from the base model and some weights from the blended model and then use this to generate new and interesting results. Here we have two options - either switch between the two sets of weights abruptly, or use a fractional linear combination to smoothly transition between them.

When the blending layer is BK, For each block above K, we do the below operation for each layer,

L_M = (1-\alpha) L_A + \alpha L_B

For the simple case, $\alpha = 1$ 

Alternatively, we also use a progressive $\alpha$ value,

\alpha = \frac{1}{1+e^{-m/q}}

The main configurable parameter for the fractional $\alpha$ is the q value. We found the best results at q=0.7

Where,

$L_M$ = Weights of the resultant blended model

$L_A$ = Weights from the base model layer

$L_B$ = Weights from the second model layers

$\alpha$ = Weight factor

k = block after which we start blending

m = the index of the block starting at 0 for the kth block(for example, if K is 8, B16 has m=1, B32 has m=2)

q = Blend Width

Note: the second model MUST BE transfer-learned from the base model, for best results. Here we show the difference with and without transfer learning:

From this experiment, it seems that when models are transfer learned, many features in the early stage of the models are related. Hence when we blend or even swap layers across these models, we get to see interesting combinations.

Blending Model Weights vs Latent Space Interpolation

Latent Space v/s Blending Weights

Name	Latent Space Interpolation	Blending Weights
Type of Operation	Single model operation	Multiple models - Humans, Cats, Dogs, Wild Animals, Cartoons!
Number of operations	Allows for multiple operations within the same model	One time operation - blend the weights and save the weights
Ease of usage with new inputs	Need to find a latent vector for each input (requires optimization or forward pass through a CNN)	If we have the base model latent vectors, they do not need to be recalculated per-model
How it works	Need to tweak the latent vector to obtain the required styles	Easier to apply the same style on a variety of inputs

Experiments and Interesting Results

Simple Blending Experiments (with random seeds)

In the first row we can see we have the Dog generator as the base model and the face generator as the blended model. As we blend at higher and higher layers, we see less of the blended model(faces).

We also see some interesting things happening here:

The first row B4 column shows that the FFHQ network used the corresponding latent vector to generate a person who has a very similar face pose, background and angle.

For the rest of the rows, B16 and B32 generate the most exciting blends that form the "uncanny valley" between the human faces and cat faces.

More Interesting Blending Experiments

The middle section includes the best layer. Between B16 and B64, we clearly see that its a mix of the 2 models, but the cuteness is still preserved 😍

High-level Procedure

Results on custom images

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/af9f59b7-d905-4c0c-a567-7eba8df60234/ezgif.com-gif-maker.webm

We see a sequence of progressive optimization on the latent vector. In each group of threes images above, the first image is the original image. The Second is the synthesized image, the third image is the blended image

We show results of blending Cat models onto Face models at different levels. Column 1 is the original Generator output. Column 2 is B4, Column 3 is B8 and so on.

Github Repository

The code for this repository is available here - https://github.com/t27/stylegan2-blending

We started from the base StyleGAN2 ADA - Pytorch repository and made various changes including improving the existing code and also adding our own code.

Links to our pretrained models and instructions to set up and run our code are also provided in the README of the repository.

We also include the source code for the webapp demo described below.

Demo Website

We have a demo website for this project available. We used the Streamlit library to develop the website.

The demo website is available here https://t27.pagekite.me/

Since the website runs on our local machine, we aim to keep it live upto 25th May.

Here is a video walkthrough of the website - https://youtube.com/watch?v=Urr-bbI10DQ

https://youtube.com/watch?v=Urr-bbI10DQ

Note for the instructors and TAs, if there are any issues or problems with the website, please do contact us and we will try our best to fix and ensure the link is up and running.

Applications

Can be used for sim2real transfer - allow us to preserve specific features from different spaces (low/high level, depending on the blend layers)

Can be used for artistic applications, including generation of caricatures, realistic (uncanny) avatars for games, and Animoji (Apple)

Next Steps

More experiments on fractional blending - To explore how much of a control we can have on the blending results and if it is possible to edit finer features in the output images

Exploring newer creative results - Interleaved or randomized blending instead of sequential blending after a Kth block

Using a feed forward network for the Latent space projection so that the Image to Image style transfer can be done in a single pass

Combining Latent Space editing and Weight blending for building more creative tools for inspiring artists

References

Karras et. al, Analyzing and Improving the Image Quality of StyleGAN

Pinkney et. al, Resolution Dependent GAN Interpolation for Controllable Image Synthesis Between Domains

Abdal et. al, Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space?

https://www.justinpinkney.com/stylegan-network-blending/