16726 HW4 - Sean Chen (yuhsuan2)

0. Potential Bells and Whistles(5pts)

For Style Transfer, besides using my own images as content, I also borrowed my artistic girlfriend's paintings as style images, and was able to achieve desirable results (see experiment 8)

1. Goal of the project

Implementing style transfer from one picture to another, and know the tips of tuning content/style layers.

2. Implementation

The major idea is to compare images' latent vector and optimize their correlations to the output image.

In this assignment, we directly use the pretrained VGG19 to encode our images to latent vectors. Therefore, it is just an optimization problem that doesn't involve training.

3. Content Reconstruction

No style loss, just content loss.

Experiment1: optimizing content loss at different layers(15 points)

We can easily notice that the simpler the network is (the lower the content layer is), the better the reconstruction is, both in terms of convergence rate and image quality. I think this is because bigger network causes higher instability, and there may be a lot of data loss and stucking at local minimum happening, which makes the convergence harder.

Experiment2: Choose your favorite one (specify it on the website). Take two random noises as two input images, optimize them only with content loss. Please include your results on the website and compare each other with the content image(15 points)

I chose conv4 to be the content layer. As the converging process may look very similar, I subtracted the two images and you can see that there are still many pixel-level differences that might be too hard for human to perceive.

4. Texture Synthesis

No content loss, just style loss.

Experiment3: Report the effect of optimizing texture loss at different layers.(15 points)

Again, I choose to use only 1 style layer a time, respectively 1st, 2nd, 4th, 8th, 16th layer.

I observed that, for lower level layer, the textures are more geometrical, rough and shallow. At higher levels, the texture tends to be fine grains.

Experiment4: Take two random noises as two input images, optimize them only with style loss. Please include your results on the website and compare these two synthesized textures.(15 points)

In this experiment, I choose to use only conv4 as style layer. Unlike the result in exp2, the images generated from different noises are completely different. However I think it's safe to claim that they belong to the same texture.

5. Style Transfer

Include both content loss and style loss.

Experiment5: Tune the hyper-parameters until you are satisfied. Pay special attention to whether your gram matrix is normalized over feature pixels or not. It will result in different hyper-parameters by an order of 4-5. Please briefly describe your implementation details on the website.(10 points)

Following the suggestion, I choose content layer to be conv4 and style layer to be conv1,2,3,4,5.

In experiment 5, I generate images from content image, and I found that the style weight is best to be higher than recommended. I use 10^7 at last. However, the same set of parameter works terrible for random input images. At last, I choose 10^5 for random noise input. This phenomenon is understandable because if given the content image, the influence from style's image is reasonably a lot weaker, which means we need a bigger style weight to balance this factor.

The content weight was kept to be 1. I also use L-BFGS optimizer to and optimized input image for 300 steps.

Here is my tweaking experiment result. The goal is to make an image look like the content, but with the style of the style image.

When input is random noise, the effect is more obvious.

Experiment6: Please report at least a 2x2 grid of results that are optimized from two content images mixing with two style images accordingly.(10 points)

Experiment7: Take input as random noise and a content image respectively. Compare their results in terms of quality and running time.(10 points)

As shown in the figure below, sometimes the generating from content yields better results (e.g. Style 1, 3) and sometimes the generating from noise yields better results (e.g. Style 2). I think the reason is because the nature of the input, and how it matches the texture of the content and style image. For example. since style 2 has less blocks of color as style 1, and the direction is more random than homogeneous compared to style 3, it's easier for a random noise to approach such texture. On the other hand, inputting a real image will make homogeneous texture styles transferred better.

The running time is similar (both ~1min on AWS).

Experiment8: Try style transfer on some of your favorite images.(10 points)

First, I tried my favorite content images with the 3 styles:

I also wanted to try other styles. Thankfully my girlfriend is a good artist, and I was able to use her paintings to generate some novel styles too:

One notable thing is that the "sketch" style seems to fail more often than the others. I think the reason is because black and white images has 2 less information compared to RGB, thus the latent representation might be harmed. Another thought is that it consists of sharp curves that doesn't have good gradients.