Assignment 5

GAN Photo Editing

Problem Description

In this assignment, the focus in around imaging editing with GAN as the backbone. The goal is to edit the image in a way that is semantically meaningful. The assignment is divided into two parts. The first part is to obtain the latent code of images with a pre-trained GAN model. The second part is to apply editing to the latent code and generate the corresponding images.

Part 1: Inverting the Generator

Loss fuctions

Three loss functions are implemented in this assignment. The first one is the perceptual loss. This loss is copied over from the previous assignment with some minor changes. It incorporates the VGG19 network to compute a content loss and a style loss between the generated image and the target image.
The second and third loss functions are p-norm losses. In particular an L1 loss and an L2 loss are implemented.
The activation and importance of these three loss functions are controlled by setting the corresponding weights as hyperparameters.

Models

Two model architectures are involved in this assignment: a DC-GAN and a StyleGAN. Since the latent space are different for each model, different corresponding noise sampling methods are implemented. For the DC-GAN, the noise is sampled from a Gaussian distribution. For the StyleGAN, the latent vectors are sampled from the mapping network taking Gaussian noises as input.

Results

To find the best combination of model and hyperparameters, I conducted exhaustive grid search over all options. Below I select and present the effect of each part separately in a controlled manner.

Loss Compositions

To illustrate the effect of each combination of loss functions, below are the results applied on StyleGAN with w as latent space.

Original reference cat
Using only perceptual loss
Using only l1 loss
Using only l2 loss
Using perceptual loss and l1 loss
Using perceptual loss and l2 loss
Using all three loss
Using all losses with regularization

Above results are obtained from finetuning hyperparameters like weights of each loss functions, regularization strength and training length. Each setting is run for 5 times and the best is selected to emphasize the effect of each loss function by mitigating the randomness during training.

From the results above, we can see that:
  • Using perceptual loss only results in pixel-wise intensity mismatch and is undesirable. However, perceptual loss is able to make the latent vectors learn the content and style such as the color blobs in the background.
  • Using p-norm loss (L1 or L2 loss) only can fairly well reconstruct main object in the image. However, the surrounding background is not poorly reconstructed and some content details are lost. Among the two, reconstruction with L1 loss is smoother and reconstruction with L2 loss is sharper.
  • Using perceptual loss and p-norm loss together can generate images that are more similar to the reference image in all aspects and using all three loss functions achieves the best result from my experiments.
  • While the difference is not obvious, using a regularization loss avoids the delta in latent vectors to be too extreme and overfitting while maintaining a high quality of reconstruction.

Based on the experiments above, I select the combination of perceptual loss with a weight of 0.001, L1 loss with a weight of 10, L2 loss with a weight of 10, and regularization loss with a weight of 0.01 as my final hyperparameters in the following experiments.

As for speed of the above methods:
  • perceptual loss only: 46.73s
  • L1 loss only: 46.45s
  • L2 loss only: 47.44s
  • perceptual loss + L1 loss: 43.57s
  • perceptual loss + L2 loss: 42.74s
  • perceptual loss + L1 loss + L2 loss: 48.63s
  • perceptual loss + L1 loss + L2 loss + regularization loss: 42.28s

Model and latent space

To illustrate the effect of model and latent space, below are the results applied with the hyperparameters found above.

DC-DAN
StyleGAN with z space
StyleGAN with w space
StyleGAN with w+ space

Using the discovered hyperparameters, each model and latent space is run for 5 times and the best is selected. From the results above, we can see that:

  • DC-DAN is not able to reconstruct the image well. It fails to capture the details of the image and the reconstruction is blurry and pixelated. This is because the latent space z is not expressive enough to capture the content and style of the image.
  • StyleGAN with z space is able to reconstruct the image better. However, for the similar reason above, the reconstructed image is not as sharp and some details are still missing, for example the eyes should be wider more grumpy in the reconstructed image.
  • StyleGAN with w space is able to reconstruct the image better. The reconstructed image is sharper and more details are captured.
  • StyleGAN with w+ space is able to reconstruct the image the best. Not only the reconstructed image is sharper and more details are captured, the reconstructed image is also more similar to the reference image in terms of color and style, e.g. the color in the background.

From the above experiments, we can see that the latent space is very important for the reconstruction. The latent space should be expressive enough to capture the content and style of the image. The w+ space with StyleGAN is the best choice for the reconstruction.

As for speed of the above methods:
  • DC-GAN: 24.22s
  • StyleGAN with z space: 43.87s
  • StyleGAN with w space: 42.28s
  • StyleGAN with w+ space: 42.62s

Part 2: Interpolate your Cats

This section explores the interpolation results by generating convex linear combinations of two latent vectors. The hyperparameters are the best set found in the previous section.

Results

DC-GAN

Image 1
Interpolation gif
Image 2

StyleGAN with z space

Image 1
Interpolation gif
Image 2

StyleGAN with w space

Image 1
Interpolation gif
Image 2

StyleGAN with w+ space

Image 1
Interpolation gif
Image 2

Another Example ...

DC-GAN

Image 1
Interpolation gif
Image 2

StyleGAN with z space

Image 1
Interpolation gif
Image 2

StyleGAN with w space

Image 1
Interpolation gif
Image 2

StyleGAN with w+ space

Image 1
Interpolation gif
Image 2

In the first example, because the first and second cats have different face orientations, the interpolation shows the process of the cat gradually turning its head. In the second example, the color theme of the two images are different. The interpolation shows the process of the overall color changing from pink to white.

Limited by the expressiveness of the z space, the interpolation of the DC-GAN does not have well reconstructed ending images to begin with.

For similar reasons, the interpolation of the StyleGAN with z space have better but not perfect ending images. Moreover, the interpolation seems to need an intermediate cat face to reach the ending cat face in the first example.

StyleGAN with w space and w+ space have better ending images and interpolation. However, with the w space, the interpolation is not as smooth as the w+ space. The w+ space interpolation is smoother and more linear between the two images.

Part 3: Scribble to Image

This part generates images from scribbles by training the latent code, initialized as the mean of the latent space, to reconstruct only the unmasked region of the given image, which is the scribble region. Below are some results.

Provided Sketches

Sketch 1
Sketch 2
Sketch 3
Sketch 4
Mask 1
Mask 2
Mask 3
Mask 4
DC-GAN
DC-GAN
DC-GAN
DC-GAN
StyleGAN with z space
StyleGAN with z space
StyleGAN with z space
StyleGAN with z space
StyleGAN with w space
StyleGAN with w space
StyleGAN with w space
StyleGAN with w space
StyleGAN with w+ space
StyleGAN with w+ space
StyleGAN with w+ space
StyleGAN with w+ space

More Provided Sketches

Sketch 5
Sketch 6
Sketch 7
Sketch 8
Mask 5
Mask 6
Mask 7
Mask 8
DC-GAN
DC-GAN
DC-GAN
DC-GAN
StyleGAN with z space
StyleGAN with z space
StyleGAN with z space
StyleGAN with z space
StyleGAN with w space
StyleGAN with w space
StyleGAN with w space
StyleGAN with w space
StyleGAN with w+ space
StyleGAN with w+ space
StyleGAN with w+ space
StyleGAN with w+ space

It can be observed that consistently StyleGAN with w+ space performs the best. However whether the sketch is sparser or denser plays a very important role in the quality of the generated image.

In the first 4 sketches, the image generated from StyleGAN with w+ space correctly captures the key features and properties from the sketches. The eye positions and colors are mostly matching; the fur color distribution are very close to the original ones; and the overall face structure is well preserved.

However, in the last 4 sketches, particularly among the densely drawn sketches (sketch 6 and 8 have the whole face region painted), the w and w+ space StyleGANs fail to generate realistic images. One potential reason may be that since the target image and the training dataset have a significant domain gap, it is hard to reduce the loss from the sketches because the StyleGAN is not trained in the sketch domain, hence ineffective learning process. In another word, when the sketch is sparse, there is more room/freedom as the loss only matches the regions with strokes. Hence we are able to train latent vectors that effectively reduces the loss and the generated images are more realistic.

My Own Sketches

Above we explored the effect of sparse and dense sketches on the quality of the generated images. However, most sketches above have similar color distribution (e.g. eye color and fur color). To further explore the effect of colors on the generated images, the experiments below are conducted with my own sketches.

Sketch 9
Sketch 10
Sketch 11
Mask 9
Mask 10
Mask 11
StyleGAN with w space
StyleGAN with w space
StyleGAN with w space
StyleGAN with w+ space
StyleGAN with w+ space
StyleGAN with w+ space

From the above results, we can see that color does have a significant impact on the quality of the generated images. While fur color can be captured to some extent, the eye color is not well preserved. The overall generated images seem to remain strongly biased towards the training dataset (the grumpy cat color distribution).