**16-726
Cats Photo Editing**
**Anish Jain (anishaja)**
Overview
======================================================================
In this project, I implemented a few techniques to manipulate images on the manifold of natural images.
Inverting the Generator
======================================================================
In this section I invert a pre-trained generator to find a latent variable that closely reconstructs the given real image.
Ablation Study : Losses
---------------------------------------------------------------------
I used the following losses to train the generator:
$L(G(z), x) = \lambda_{1} \cdot L_{1}(G(z), x) + \lambda_{2} \cdot L_{2}(G(z), x) + \lambda_{3} L_{perceptual} (G(z), x)$.
The perceptual loss calculates the content distance between two images at a particular layer. For this assignment, the conv_5 layer in the pre-trained VGG-19 net is used to compute the distance after extracting features from both the input and target images.
I experimented with different values of $\lambda_{1}$, $\lambda_{2}$ and $\lambda_{3}$ to see how they affect the quality of the reconstructed image.
First I fixed $\lambda_{1} = 10$ and $\lambda_{2} = 0$ and varied $\lambda_{3}$.
| Target | lambda_3 = 0.1 | lambda_3 = 0.01 | lambda_3 = 0.001 |
|----------------------------------------------------------------|-------------------------------------------------------------------------------|---------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
|  |  |  |  |
Next I fixed $\lambda_{1} = 0$ and $\lambda_{2} = 10$ and varied $\lambda_{3}$.
| Target | lambda_3 = 0.1 | lambda_3 = 0.01 | lambda_3 = 0.001 |
|----------------------------------------------------------------|-------------------------------------------------------------------------------|---------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
|  |  |  |  |
For above results, I used stylegan and latent space w+.
From the above results it can be seen that $\lambda_{1}=10, \lambda_{2}=0, \lambda_{3}=0.01$ gives the best results.
Ablation Study : Generative models
---------------------------------------------------------------------
I experimented with different generative models to see how they affect the quality of the reconstructed image with the best hyperparameters when using the latent space z.
| Target | Vanilla | Style GAN |
|------------------------------------------------|---------------------------------------------------------------|----------------------------------------------------------------|
|  |  |  |
|  |  |  |
The results from style GAN are better than vanilla GAN.
Ablation Study : Latent space
---------------------------------------------------------------------
I experimented with different latent spaces to see how they affect the quality of the reconstructed image with the best hyperparameters when using stylegan.
| Target | z | w | w+ |
|------------------------------------------------|----------------------------------------------------------------|----------------------------------------------------------------|-----------------------------------------------------------------|
|  |  |  |  |
|  |  |  |  |
The results from w+ are better than z and w. In the w+ space, the colors (like the purple in the upper left corner) are much closer to the original and the features are sharper (especially in the eyes).
The method performs 1000 iteration in less than 30 seconds on a GPU.
Scribble to Image
======================================================================
| Sketch | Mask | Image |
|---------------------------------------------|---------------------------------------------|---------------------------------------------------------|
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
For most of the results the model can generate images with decent details based on the mask. It can be observed that sparse masks give better results. With a denser mask, the constraint of the optimizer causes the resulting image to look too much like the sketch, since more pixels are affected in the mask
Stable Diffusion
======================================================================
In this section, I take a hand drawn image and use the stable diffusion technique to generate an image that fits the sketch accordingly.
Varying Noise
---------------------------------------------------------------------
| Prompt | Sktech | Noise Std Dev 0.5 | Noise Std Dev 1 |
|---------------------------------------------|-------------------------------------------------------------|----------------------------------------------------------------|--------------------------------------------------------------|
| Grumpy cat reimagined as a royal painting |  |  |  |
Varying Classifier free guidance Strength
---------------------------------------------------------------------
| Prompt | Sketch | Strength 5 | Strength 40 |
|---------------------------------------------|---------------------------------------------------------------|-----------------------------------------------------------------|------------------------------------------------------------------|
| A fantasy landscape, trending on artstation |  |  |  |
Some more results
---------------------------------------------------------------------
In all honesty, my skteching skills are not so bad. But I wanted to see how the model performs on the below poor sketches :)
| Prompt | Sketch | Output |
|-------------------------------------------------------------------------|-------------------------------------------------------------|------------------------------------------------------------|
| Realistic photo of red car driving on a road with mountains in backdrop |  |  |
| Artisic photo of a cat wearing a magician's red hot |  |  |
| A high quality photo of harry potter casting a spell |  |  |
Bells and Whistles
======================================================================
Interpolation
---------------------------------------------------------------------
In this section, I interpolate between two images in the latent space of the generator.
  
High res Grumpy Cats
---------------------------------------------------------------------
In this section the pretrained stylegan256 model and the 256x256 cat images are used. The W+ latent space along with L2 loss is used to generate the images.
| Original Image | Generated Image |
|------------------------------------------------------|-----------------------------------------------------------------------|
|  |  |
|  |  |
|  |  |