Assignment #5 - Cats Photo Editing

Jialu Gao (jialug)
2024 Spring, 16-726 Learning-based Image Synthesis

Description

In this project, we will be implementing various image editing techniques on natural images. For the first part, we will invert the GAN generator to reconstruct any given real image. Then we will perform constrained image generation where given a scribble image, we want the GAN generator to fill in the details. Finally, we will implement constrained image generation with Latent Diffusion Models from a text prompt and an input image.

 

For Bells & Whistles, I will first show some results where we interpolate in the GAN latent space. Next, I will show other results using pre-trained image editing methods and compare the results with our implementations.

Part 1: Inverting the Generator [30 points]

Implementation

In this part, we will try to reconstruct an image with some specific latent embedding. We will experiment with various loss functions, different generator architectures, and different sampling strategies in the latent space.

 

For the following experiments, we will try to reconstruct the following image:
Image 1
Goal Image

Results

1. Combination of different losses

We will first experiment with different loss functions. For this part of experiments, I am using w+ as latent space and styleGAN as the generator. The results of different combinations of perceptual loss, L1 and L2 loss as Lp losses are shown in the following table. The results are all achieved after 1000 steps.
prec_wgt=0.01 prec_wgt=0.05 prec_wgt=0.1 prec_wgt=0.5
l1=10, l2=0 Image 1 Image 1 Image 1 Image 1
l1=0, l2=10 Image 1 Image 1 Image 1 Image 1
l1=5, l2=5 Image 1 Image 1 Image 1 Image 1

 

From the results we can observe that with complex images with many details, L1 loss is not the best choice because it only try to minimize the average absolute differences, but will leave out the structual information such as the flower at the background. However, L2 loss will produce blurry results with edges smoothed out. Therefore, I choose to combine L1 and L2 loss together for the best result.

 

For perceptual loss, if the weight is too small, it will not have very huge effects. If the weight is too large, it will lead to blurry images. Therefore, I am choosing the weight to be 0.1 for the perceptual loss.

 

The runtime is pretty similar between each combination, usually taking 20-21 seconds to finish training and get the final output.

2. Different Latent Space Sampling

In this part, we experiment the effect of different latent spaces z, w and w+ for StyleGAN generator. For the following experiments, I am using perceptual loss weight of 0.1, and both L1 and L2 losses for optimization. The following results are achieved after 1000 steps.
Target Image StyleGAN with z StyleGAN with w StyleGAN with w+
Image 1 Image 1 Image 1 Image 1

 

From the table, we can observe that with latent z, the reconstruction results don't have enough details. With w and w+, the reconstructed image has similar structure to the original image, but with w+ the reconstruction contains more details and higher similarity in color.

 

The runtime is similar for each combination around 20 seconds for 1000 timesteps.

3. Different Generator Models

In this part, we will test the effect of using different generation models. I am setting the perceptual loss weight to be 0.1, and with a combination of L1 and L2 loss, with latent z as the latent space. The following results are achieved after 1000 steps.
Target Image Vanilla StyleGAN
Image 1 Image 1 Image 1

 

We can observe that StyleGAN leads to better results because it has a more disentangled latent space.

 

However, the vanilla architecture is faster to run, only taking 7 seconds. In contrast, the StyleGAN takes around 20-21 seconds to run.

4. The Best Results

Overall, I feel that using the StyleGAN generator with latent space w+, the perceptual loss weight as 0.1, with L1 loss weight as 5 and L2 loss weight as 5 will lead to the best result.

Part 2: Scribble to Image [40 Points]

Implementation

For this part, we will be implementing scibble to image generation by optimizing the latent embedding using the scribble input as a constraint. The constraint is added through a mask, where we keep the pixel values within the mask and optimize the remaining area.

Results

For the generation setup, I am using StyleGAN model with w+ latent space, a combination of L1 and L2 loss, and perceptual loss weight as 0.1. All results are generated after 1000 optimization steps.

 

Sketch Mask Output
Image 1 Image 1 Image 1
Image 1 Image 1 Image 1
Image 1 Image 1 Image 1
Image 1 Image 1 Image 1
Image 1 Image 1 Image 1
Image 1 Image 1 Image 1
Image 1 Image 1 Image 1
Image 1 Image 1 Image 1

 

We can observe that sketches with more detailed lines and shapes tend to lead to worse generated images. This is because we are constraining the generated image to have the same pixel values as the sketches within the masked area, and sketches with more details would be harder for the generator to fill in the remaining parts. Also when the colors of sketches are similar to the cat images, such as brown and white, the generated image will look realistic compared to sketches with blue and pink colors in them.

Part 3: Stable Diffusion [30 points]

In this part, we are implementing image generation based on text prompt and input image with latent diffusion models using DDPM sampling. We first convert the input image into the latent space and add noise to it, then we denoise the image using forward diffusion process with classifier-free guidance to inject text condition into the generation process.

 

Results

Here I am experimenting with two input images of my own choice, and trying to edit them into royal painting style.

 

Input t=500, strength=7.5 t=750, strength=7.5 t=500, strength=15 t=750, strength=15
Image 1
Grumpy cat reimagined as a royal painting
Image 1 Image 1 Image 1 Image 1
Image 1
Astronut riding horse as a royal painting
Image 1 Image 1 Image 1 Image 1

 

We can conclude from the table that with longer timesteps (more noise added), the output image will diverge more from the input image. And with a larger guidance strength, the output image will align more towards the text prompt description.

 

Bells & Whistles

1. Interpolate between two latent codes in the GAN model

Here I am using linear interpolation between two optimized latent embeddings for two separate input cat images. The latent embeddings are obtained through StyleGAN generator with latent space w+, and perceptual weight=0.1 as well as a combination of L1 and L2 loss for optimization objectives.

 

Source Image Interpolation Target Image
Image 1 Image 1 Image 1
Image 1 Image 1 Image 1

 

 

2. Compairson with Other Image Editing Methods

Recent years, we have a lot of generation models that are trained with huge amounts of data, and can generate stylized images given text prompts. Based on these generation models, there are also image editing models that can condition the generated images on some input image and text prompt.

 

InstructPix2pix is one of the most recent work that can perform image editing in an end-to-end fashion without any fine-tuning. Therefore, I am running the InstructPix2pix model on our image editing task and compare the results.

 

For the prompt for InstructPix2pix, I am using "Make it a royal painting." as suggested by the InstructPix2Pix paper

 

Input InstructPix2pix Output Our Output
Image 1
Image 1
Run time: 15 seconds
Image 1
Run time: 124 seconds
Image 1
Image 1
Run time: 15 seconds
Image 1
Run time: 124 seconds

 

From the results, we can observe that InstructPix2pix stays pretty close to the input images in terms of structures, but not really brining out the "royal painting" style change. This could be because the InstructPix2Pix model is not really familiar with the concept of "royal painting", or it could be better prompts are needed in order to bring out the correct changes. Becuase the InstructPix2pix model is pre-trained with some other datasets, we are not sure whether training samples with "royal painting" edits exist in the training set.

 

For computation statistics, the inference time for InstructPix2pix on A4500 is 15-16 seconds on average, which is a lot faster than the runtime for stable diffusion edits. One reason could be because the implementation is different in our homework and the Huggingface stable-diffusion codebase, and another reason could be the difference in sampling schemes, such as DDPM takes more sampling steps than DDIM.

 

Overall, the end-to-end models are easier to use, but it is being used as a black-box and therefore are harder to exert fine-grained control. Moreover, these models can have unknown bias due to its training data, and when there exists a gap between your inference domain and the distribution of the training data, they are more likely to fail. On the other hand, methods like SDEdit can exert control based on individual inputs, and can allow users to modify the level of constraints based on their preference, but users will have to perform some level of prompt-tuning and hyper-parameter tuning in order to get the best performing results.