Assignment #5 - GAN Photo Editing

Tarasha Khurana (Andrew ID: tkhurana)

Inverting the Generator

Choice of latent space

In the following visuals, we see that the quality of reconstructions with Vanilla GAN using the latent space z is very limited. On the other hand the StyleGAN architecture does a better job reconstructing from the z latent space. However the quality of reconstructions is further improved when latent space w or w+ is used. Specifically, w+ latent space is able to give the best reconstructions, in the sense that, the overall images are smooth in the continuous regions and are sharp wherever the features of the cat appear, for example, the eyes, outline of the mouth and nose, etc.

target
target
target
Vanilla GAN + z
Vanilla GAN + z
Vanilla GAN + z
StyleGAN + z
StyleGAN + z
StyleGAN + z
StyleGAN + w
StyleGAN + w
StyleGAN + w
StyleGAN + w+
StyleGAN + w+
StyleGAN + w+

Combination of losses

In the following visuals we see that a combination of both the L2 norm and the PerceptualLoss computed from the conv_1 layer of VGG19 gives the best reconstruction. The quality of reconstruction is at its peak when they are weighted appropriately, in this case, when the weight on the perceptual loss is 1.0. When only MSE loss is used, the colors of the target image are not properly matched in the reconstruction and when only the Perceptual Loss is used, the reconstruction is much smoother than we would like, for example, the hair on the forehead region of the cat. I used the best model from above (StyleGAN + w+) for trying out the combination of losses.

target
MSE

Perceptual

MSE + 0.1 * P
MSE + 1 * P

In all the following experiments StyleGAN model is used with w+ latent code and a combination of perceptual and MSE loss, unless otherwise specified, as this configuration performs the best. The reconstruction for every image takes less than 15s which is a reasonable time overhead considering we are running ~1000 iterations of optimization.

Interpolate your Cats

The following visuals show interpolation using StyleGAN with latent code w+. The intermediate frames make the transition from the start to the end images very smooth when using w+ latent code. Features like eyes, color and shape of mouth and nose changes smoothly, resulting in the fact that all intermediate frames are plausible cats. w+ therefore encodes the cat images very well.

end
start
interpolation with stylegan and latent code w+
end
start
interpolation with stylegan and latent code w+

Scribble to Image

In the following reconstructions constrained by the given input sketches we see that outputs for given drawings are pretty realistic. I tune the weights and iterations more for my sketches and keep these hyperparameters constant for getting the images for the given scribbles. Hence, the images are more realistic for my sketches than the given ones but they can be tuned similarly.

In general, we see that since a denser input sketch has more pixels where the constraints are imposed, the network has less chance for hallucinating as there are not many empty/unconstrained pixels. The output for denser sketches looks almost like the sketches themselves. Whereas for the sparser sketches, the network hallucinates the cat faces in the empty pixel regions and the color of the lip or eyes, for example, is constrained to be that in the input sketch. All in all, the algorithm works much like we would like it to work!

My sketches:

moderately dense input sketch with given colors
sparse input sketch with arbitrary color
good output on adjusting weight of Perceptual Loss
poor output with previous weights
dense input sketch with colors slightly deviated
sparse input sketch with arbitrary color
good output with the same perceptual weight as before
good output on adjusting weights

Given sketches:

1.png
3.png

1.png to image
3.png to image
5.png
2.png
4.png
5.png to image
2.png to image
4.png to image