**16-726 Assignment 5: GAN Photo Editing** Emma Liu (emmaliu) (#) Overview The objective of this assignment is to implement several image manipulation techniques by using GANs to manipulate images to achieve results on the manifold of original real images. This is achieved by inverting a pre-trained generator to achieve a latent variable closely reconstructing original real images, of which we can use to generate images that fit hand-drawn sketches. (#) Inverting the Generator To invert a pre-trained generator (vgg19), we solve a non-convex optimization problem to reconstruct images from latent vectors. The goal is to get the output manifold of the trained generator as close to the natural image manifold as possible, recalling that natural images lie on a low-dimensional manifold. Expressed numerically, given real image $x$, generator $G$, and loss function $L$, our objective is to get a latent vector $z*$ such that:
$$z_* = argmin_z L(G(z), x) $$
The loss function we use is a combination of perceptual loss (implemented as the same content mean-squared error content loss as in the previous neural style transfer assignment) and pixel loss (we use L2 loss below). As before, the perceptual loss is computed at multiple layers in the network, so the total perceptual loss is the sum of these individual losses. This optimization problem can be solved with LBFGS (which is what we use) or other first-order/quasi-Newton optimization methods, since we have access to gradients. (##) Effect of Different losses (###) Weighing Pixel and Perceptual loss We can perform an abalation study on the effect of changing pixel and perceptual loss weights. These results are from StyleGAN run in the $w$ latent space.
Original
$w_{perc}=0$
$w_{perc}=0.2$
$w_{perc}=0.4$
$w_{perc}=0.6$
$w_{perc}=0.8$
$w_{perc}=1$
Personally, I think using a perceptual loss weight of 0.6 and a pixel loss weight of 0.4 yielded the best-looking results (so weigh perceptual content loss and thus the content/features in the image slightly more heavily), maintaining the most reconstruction of detailed features and style similarity as well (including the flower, the eyes, the facial fur, the green background). In terms of runtime, there is not a significant amount of difference between these weight choices. (##) Effect of Model Type Holding perceptual loss and pixel loss weights and latent space $z$ constant, we can compare the results of using VanillaGAN versus StyleGAN.
Original
VanillaGAN
250 iters
2750 iters
5250 iters
10000 iters
StyleGAN
250 iters
2750 iters
5250 iters
10000 iters
We can see that StyleGAN converges to a decent result relatively sooner than VanillaGAN, with not a huge amount of difference between different iterations, and leads to better results in the end. This probably comes down to the difference in underlying architecture layers of the two models (at least, from the previous assignment, we know that StyleGAN's underlying VGG-19 is pretrained on ImageNet). In terms of runtime, however, VanillaGAN runs significantly faster than StyleGAN (about 16 seconds after 1,000 iterations, as opposed to 36 for StyleGAN). (##) Effect of Latent Space Holding perceptual loss and pixel loss weights the same at 0.6 for perceptual and 0.4 for pixel, as well as using StyleGAN for all trials, we can test the differences between choice of latent space: $z$, $w$, or $w+$.
Original
z - 5000 iters
z -10000 iters
w - 5000 iters
w - 10000 iters
w+ - 5000 iters
w+ - 10000 iters
As expected, $z$ generally performs the worst compard to the other two, since we do not map random noise (or vector of zeros in the case we take the mean) to the style space, thus encapsulating some information from the style of the image. Between $w$ and $w+$, $w+$ generally performs better in maintaining feature simlarity. This is also to be expected, since $w$ is only extracted from one of many layers that $w+$ consists of, and thus has less data to go through. A difference between execution times wasn't too noticeable. From these experiments, I generally used StyleGAN with latent w+ space throughout the assignments, and usually with weights 0.6 and 0.4 for perceptual and pixel loss respectively, though at times I will use the default weight settings if perceptual loss comes forward too strongly, or I want to compare. (#) Interpolating between Cats The next part of the assignment interpolates generated images by combining their inverses (the latent vectors used to generate them) via linear interpolation over a discretization over $\theta \in [0,1]$. In more formal terms, given images $x_1$ and $x_2$, we first find $z_1 = G^{-1}(x_1)$ and $z_2 = G^{-1}(x_2)$, then linearly interpolate these with $\theta$ as $z'=\theta z_1 + (1−\theta)z_2$. Then using this new $z'$, we generate an intermediate GIF frame as $x'=G(z')$. The results are as follow. For all settings, we are able to achieve a smooth transition from a given source image (odd images from the loader - by my implementation, I start from a lower weight given to z_1 that increases to 1 first) to a given target image (the previous even images given by the loader). The resulting interpolations are below with the original images on opposite ends, using different $z,w,w+$ vectors and the same perceptual loss weight of 0.6 and pixel loss weight of 0.4 for all.
Original 1
z
w
w+
Original 0
Original 3
z
w
w+
Original 2
Original 5
z
w
w+
Original 4
Original 7
z
w
w+
Original 6
We can see that the interpolated images in between the source and target images proceed fairly smoothly. Subjectively, I think w+ gives the best-looking results, smoothly translating the color and detail from the source to target without strange warping effects (as is noticeable for the interpolation between original images 1 and 0 for latent space z). It also looks like w/w+ beat out z in capturing color and texture changes in the interpolated frames as well. In all, these results more or less align with my observations during the ablation trials for the projected images. (#) Scribble to Image Now we are able to use what we've learned so far to constrain generated images to look like scribbled sketches, while maintaining realism. To start off, we need to more generally generate an image subject to some constraints. Again, we solve a non-convex optimization problem as before to produce realistic-looking cats, but this time we also apply a foreground mask made from our sketches so that the generated image tries to emulate the structure/color/detail of the sketch. This can be done usng the Hadamard product (elementwise multiplication) applied between the generated and source images, and we minimize the difference between these as the function, as before, as depicted in this problem:
$z_*=argmin_z\|M∗G(z)−M∗S\|_1$
Here are some results made with the provided sketches, where StyleGAN, latent space w+, and the default perceptual/pixel loss weights (0.01 and 10 respectively) and number of iterations (1000) are used.
Sketch 1
Mask 1
$w_{perc}=0.5, w_{pixel}=0.5$
$w_{perc}=0.01, w_{pixel}=10$
Sketch 2
Mask 2
$w_{perc}=0.5, w_{pixel}=0.5$
$w_{perc}=0.01, w_{pixel}=10$
Sketch 3
Mask 3
$w_{perc}=0.5, w_{pixel}=0.5$
$w_{perc}=0.01, w_{pixel}=10$
Sketch 4
Mask 4
$w_{perc}=0.5, w_{pixel}=0.5$
$w_{perc}=0.01, w_{pixel}=10$
Sketch 5
Mask 5
$w_{perc}=0.5, w_{pixel}=0.5$
$w_{perc}=0.01, w_{pixel}=10$
Sketch 6
Mask 6
$w_{perc}=0.5, w_{pixel}=0.5$
$w_{perc}=0.01, w_{pixel}=10$
Sketch 7
Mask 7
$w_{perc}=0.5, w_{pixel}=0.5$
$w_{perc}=0.01, w_{pixel}=10$
In terms of quality, we can see that when we lower the weight for perceptual loss, the resemblance to the mask is slightly less severe, in terms of filling in the dead space of the mask at least (they aren't just dark holes). Dense sketches using similar color schemes and features resembling realistic Grumpy cats (like sketches 3,4,7, albeit roughly...) transfer decently well along those two aspects, but the texture over constant colors are smoothed over. On the other hand, sparse sketches features sometimes transfer poorly, sometimes fine (comparing sketches 1 and 6), but when they do transfer, they are very noticeable against the more realistic looking sections (fur, texture, etc) that fill in the spaces of the mask. When the color palette used is dissident from the one provided/colors that are typically noticeable on realistic Grumpy cats, the results look cartoonish and unrealistic (sketch 1). These trends are more noticeable in the results produced by my own sketches, run each for 2500 iterations using StyleGAN, latent space w+, and perceptual/pixel loss weights of 0.01 and 10 (default respectively). For the dense sketches, I made use of solid fill shapes (i.e. ovals) to make the base of the face, and these solid shapes transfer well into homogenously-colored regions oval regions in the generated images. While the color and features matches relatively well in these homemade sketches, it is clear (especially from the results generated from denser sketches) that the texture of dense, mono-colored regions isn't transferred as well.
Sparse, Wrong Color
Mask
Result (StyleGAN w+)
Sparse, Similar Color
Mask
Result (StyleGAN w+)
Dense, Wrong Color
Mask
Result (StyleGAN w+)
Dense, Similar Color
Mask
Result (StyleGAN w+)
(#) Bells & Whistles (##) High-Resolution Cats I experimented with high-resolution cats (128x128 and 256x256), using the high-resolution pretrained models and data provided. Not much changed in terms of code effort, other than accomodating additional GAN options for parser arguments. Results below are interpolation with StyleGAN128 and StyleGAN256, using latent space w+ and perceptual/pixel loss weights of 0.6 and 0.4 respectively. We can see that the quality of the intermediate images really improves with high-resolution data and models!
Original 3
128x128
Original 2
Original 3
256x256
Original 2
Original 5
128x128
Original 4
Original 5
256x256
Original 4
Original 7
128x128
Original 6
Original 7
256x256
Original 6
As for scribble drawing, here are the results below, using default perceptual/pixel loss weights, latent space w+, and 2500 iterations. It looks like with a high-resolution dataset, such large perceptual and pixel loss weights are necessary, and few iterations need to be run (with somewhat decent results and features being captured in the initial large iteration step 250 compared to 2500, where texture at that point is mostly smoothed out).
Sparse, Wrong Color
Iter 250 (128x128)
Iter 2500 Final (128x128)
Sparse, Wrong Color
Iter 250 (256x256)
Iter 2500 Final (256x256)
Sparse, Correct Color
Iter 250 (128x128)
Iter 2500 Final (128x128)
Sparse, Correct Color
Iter 250 (256x256)
Iter 2500 Final (256x256)
Dense, Wrong Color
Iter 250 (128x128)
Iter 2500 Final (128x128)
Dense, Wrong Color
Iter 250 (256x256)
Iter 2500 Final (256x256)
Dense, Correct Color
Iter 250 (128x128)
Iter 2500 Final (128x128)
Dense, Correct Color
Iter 250 (256x256)
Iter 2500 Final (256x256)
(##) Texture Constraint Using Style Loss Much in the same way that we add content loss after select layers (i.e. $conv_5$ was used throughout this assignment), we can add style loss like we did in HW4 for neural style transfer, in order to better encapsulate the texture of original images. After select layers of our pretrained VGG19 network (i.e. StyleGAN), the style loss can be computed as the MSE loss (i.e. distance) between the Gram matrix applied to the layer feature and the target. This measure encapsulates the correlation of the two vectors. Since using style loss on earlier layers produced better results for my implementation of HW4, I decided to use style loss inserted after layers $[conv_1, conv_2, conv_3, conv_4, conv_5]$, and kept content loss to be computed after $conv_5$. Here are the projection results with both content and style loss, using 128x128 high-resolution images, perceptual/pixel loss weights of 0.6 and 0.4, run for 2500 iterations. We can see that the texture transferred (i.e. finer details of hair, whiskers) is much better than before, across all latent spaces. Still, w+ seems to give the best results in terms of pose and lighting, especially in image 2.
Original 1
z
w
w+
Original 2
z
w
w+
Original 3
z
w
w+
Original 4
z
w
w+
Original 5
z
w
w+
We can further see how style loss performs in interpolation, compared to our previous 128x128 interpolation results. The texture appears to be slightly more consistent, and features (like eyes) less deformed in intermediate frames, with style loss included.
Original 3
Content Loss Only
128x128
Content + Style Loss
Original 2
Original 5
Content Loss Only
Content + Style Loss
Original 4
Original 7
128x128
Content + Style Loss
Original 6
Finally, we can show off how texture synthesis peforms when we subject our generated images to scribble constraints as well. The sparse images with consistent color scheme to real Grumpy cats worked surprisingly well here. The dense images look decent at lower iterations. The weights used here are default for perceptual/pixel loss, and latent space w+ is used.
Sparse, Wrong Color
Iter 250 (128x128)
Iter 2500 Final (128x128)
Sparse, Correct Color
Iter 250 (128x128)
Iter 2500 Final (128x128)
Dense, Wrong Color
Iter 250 (128x128)
Iter 2500 Final (128x128)
Dense, Correct Color
Iter 250 (128x128)
Iter 2500 Final (128x128)
(#) Attributions I collaborated with Jason Xu and Joyce Zhang on this project, to the extend of discussing higher-level details of the writeup. I used the [15-668: Physics-Based Rendering](http://graphics.cs.cmu.edu/courses/15-468/) report template. For an overview of Markdeep and its syntax, see the [official demo document](https://casual-effects.com/markdeep/features.md.html) and the associated [source code](https://casual-effects.com/markdeep/features.md.html?noformat).