Assignment 5

16-726 | Qin Han | qinh@andrew.cmu.edu

Introduction

In this project, the objective is to develop GAN-based photo editing techniques that manipulate images on a natural image manifold. (1) I utilize an existing pre-trained generator for GAN inversion to reverse engineer and identify a latent variable that accurately recreates a specified real image. This involves deducing the latent code of the provided image, modifying this latent code, and then projecting the adjusted latent code back into the image domain. (2) I transform a hand-drawn sketch into a corresponding realistic image using latent code interpolation. (3) I employ a pre-trained Stable Diffusion model to apply a forward diffusion process on the input image, transforming it into noise, which is then iteratively denoised to generate a realistic image. This comprehensive approach encompasses GAN inversion, latent code interpolation, and sketch-to-image generation.

Part 1: Inverting the Generator

In this part, I try to reconstructing an image from a specific latent code. I utilize the LBFGS solver to optimize the latent code for generating images from a pre-trained GAN model. The runtime for vanilla GAN is around 10 seconds, while the runtime for StyleGAN is around 25 seconds.

Ablation on loss weight

I guide the optimization process using perceptual loss and L1 loss. Perceptual loss measures content differences between two images by analyzing feature distances at specific network layers, specifically using the conv_5 layer of a pre-trained VGG-19 network to evaluate these differences between the input and target images here. I also carry out a series of ablation studies to investigate various combinations of perceptual and L1 losses, with the L1 loss weight fixed at 10, while varying the weight of the perceptual loss. I utilize StyleGAN and operate within the latent space w for all the experiments.

Target Image	perc_weight=0.0001	perc_weight=0.001	perc_weight=0.01	perc_weight=0.1	perc_weight=1

Ablation on generator

I also investigate the impact of using different generators on the optimization process. I compare the results of using StyleGAN and vanilla GAN as the generator, with the perceptual loss weight set to 0.01 and the L1 loss weight set to 10.

Target Image	StyleGAN	Vanilla GAN

From the result, we could see StyleGAN performs better than vanilla GAN in terms of reconstruction quality.

Ablation on latent space

I also explore the impact of using different latent spaces on the optimization process. I compare the results of using latent space z, w and w+ for StyleGAN, with the perceptual loss weight set to 0.01 and the L1 loss weight set to 10.

Target Image	StyleGAN Z	StyleGAN W	StyleGAN W+

From the result, we could see StyleGAN with latent space w+ performs the best in terms of reconstruction quality. This is because w+ space allows to integrate different w vectors tailored for each layer of StyleGAN.

Part 2: Scribble to Image

In this part, I use the scribble to guide the optimization process. The scribble is used to mask the content image, and the masked image is used as the L1 loss target. Below are the results of the scribble to image task, where I use StyleGAN as the generator and set latent space as w+.

Scribble	Mask	StyleGAN W+

When using StyleGAN, sparse sketches yield poor results due to insufficient details for optimizing the latent code. However, denser sketches allow StyleGAN to generate more realistic images in the extended latent space w+, where varying w vectors across layers improves input reconstruction.

Part 3: Stable Diffusion

Here are some visualizations of the stable diffusion process, where the text propmt is "Grumpy cat reimagined as a royal painting".

Input	Output

Effect of guidance strength

I investigated the effect of adjusting the guidance strength weight. The results reveals that as the guidance strength increases, the output more closely adheres to the input prompt; however, the output becomes less similar to the input image.

Input	Strength: 5	Strength: 10	Strength: 15	Strength: 20	Strength: 50

Effect of the number of diffusion steps

I also explored the impact of varying the number of diffusion steps in the process. As illustrated in the table below, the results increasingly align with the text prompt as the number of steps increases; however, they become less similar to the input image.

Input	Num Steps: 500	Num Steps: 700	Num Steps: 1000