**16-726 Assignment 4:
Neural Style Transfer**
**Eileen Li (chenyil)**
Submission Date: April 2, 2023
Overview =============================================================================== In this assignment we will implement neural style transfer, combining a content image with a style image such that we preserve the content of first while adopting the style of latter. In the first section, we will start from random input and optimize this input in the content space to reconstruct a content image. In the second section, we will ignore content and only optimize to generate textures from the style image. In the third section, we will combine these two for the final neural transfer result, initializing the input image from random noise as well as from the content image. Training Details ------------------------------------------------------------------------------- I train with 300 steps locally on CPU, and L-BFGS optimizer (lr=1.0) for Parts 1 and 2. For the final style transfer results in Part 3, I train on GPU. Bells & Whistles =============================================================================== In addition to optimizing an input image as required by the assignment, I also try to use a feedforward network to directly 1) output style transfer image and 2) synthesize texture from a particular style image. I also tried to apply the trained MLPs on multiple content images, but the results were not very good. Train MLP to output style transfer image directly [8 pts] ------------------------------------------------------------------------------- Ref: [Perceptual Losses for Real-Time Style Transfer and Super-Resolution](https://cs.stanford.edu/people/jcjohns/papers/eccv16/JohnsonECCV16.pdf)
Inspired by the paper above, I use my `CycleGenerator` from Assignment 3 as my network. During training, I use `Adam` optimizer to learn the weights of the network to minimize the same content and style losses as in the assignment. Note that unlike the assignment which optimizes the input image directly, here I can calculate the style transfer image by `output_img = mlp(content_img)`.
Below I share some results after 5000 steps of training:
content image: dancing
style image: frida_kahlo
mlp style transfer
Train MLP to to synthesize texture [8 pts] ------------------------------------------------------------------------------- I do something very similar as above to synthesize texture with a MLP, except I only optimize `StyleLoss` and not `ContentLoss`. The `CycleGenerator` parameters are optimized for a particular style image, and I can generated the texture image by `output_img = mlp(noise_img)`.
Below I share some results after 5000 steps of training:
style image: frida_kahlo
mlp texture synthesis
Part 1: Content Reconstruction =============================================================================== The content loss is an L2 pixel loss between the input image (starting with noise) features and a content image features, extracted from the same layer of a pretrained VGG-19 network. We optimize the input image with respect to this loss. The VGG-19 network has 5 blocks total. I experiment with using the first and second conv layers from each of these blocks to compute the content loss. The earliest conv layer can reconstruct almost perfectly, and the quality degrades the deeper the we go, with the 5th block failing completely.
original wally
conv1_1
conv1_2
conv2_1
conv2_2
conv3_1
conv3_2
conv4_1
conv4_2
conv5_1
conv5_2
I want to preserve the semantic meaning of the image while allowing the finer details to be influenced by style transfer. From my results above (and from consulting literature), I chose `conv4_2` as the layer to extract content features. Below, I use this layer and optimize two random noises only with content loss. They look pretty similar (albeit some subtle differences), confirming the fact that the learned semantics or content do not change much with different initializations.
original wally
optimized noise #1
optimized noise #2
Part 2: Texture Synthesis =============================================================================== Now we repeat the experiments for generating texture from a style image. For style loss, we construct the gram matrix for extracted features from selected layer(s), and optimize an input image from noise to minimize the distance for its gram matrix and that of the style image. The gram matrix is constructed from a matrix multiplication of feature channels with itself, and reflects the correlation of different features rather than location of features (unlike content loss).
For these experiments, I found training to be much less stable than `Part 1`, often leading to `nan` and needing to tune `style_weight` hyperparameter. I tried `style_weight = {10, 100, 1000, ... 1M}` and kept the best results. Below are the results for using different conv_layer(s) to generate texture. I experiment with using a single layer as well as range of layers (ex. conv1_1~conv5_1).
Note: I later realized the training instability is eliminated on the GPU.
original starry_night
conv1_1
conv2_1
conv3_1
conv4_1
conv5_1
conv4_1~conv5_1
conv3_1~conv5_1
conv2_1~conv5_1
conv1_1~conv5_1
The earlier layers look like granular repeating style patterns while the middle layers are larger and more fluid. The deepest layers most resemble noise. Combining multiple layers seem to have best results. From the results above and consulting literature, I select `conv1_1~conv5_1` for the style losses. Similar to `Part 1`, I synthesis texture from the same image but with two different noise initialization:
original starry_night
optimized noise #1
optimized noise #2
Unlike `Part 1`, there are obvious differences between the two initializations. They are similar in terms of colors and size of patterns, but differ a lot pixel by pixel. This makes sense because the loss is not based on semantic similarity but feature correlation. Part 3: Style Transfer =============================================================================== I tried out different hyperparameters and decided on the best combination for qualitative results:
| Hyperparameter | Value | | ---------------- | ---------------- | | content_layer | conv4_2 | | content_weight | 1.0 | | style_layer | conv1_1~conv5_1 | | style_weight | {1000, 10000} | | initialization | {noise, content} |
Setting 1: Noise Initialization
I explained my choice of layers in the previous parts. The `style_weight` is the main hyperparameter to tune. When this parameter is higher, we see more style influence, and vice versa. Below I show the results of varying `style_weight` with noise initialization, and other hyperparameters from the table above. From these results, I pick `style_weight = 1000` when we initialize from noise.
style_weight = 1
style_weight = 100
style_weight = 1000
style_weight = 10000
style_weight = 1000000
Below I show some results, for noise initialization and with the hyperparameters above:
Setting 2: Content Initialization
We can also initialize the input image from the content image. In this setting, we need much more style influence for the same result, and increase `style_weight = 10000`. Unlike initializing from noise, initializing from content did not have the danger of the content getting "washed out" by too large a `style_weight`. I like the quality of this setting better personally. I also did not notice a significant difference in running time between the two settings. Below I show some results, for content initialization and with the hyperparameters above:
Below I show some more style transfer results, from my own photo album of my favorite beings: