Overview

This project aims at generating an image that captures the content of one image while emulating the style of another. It is a style transfer process where an initial noisy image is iteratively optimized to minimize a loss function between the intermediate feature maps of the noisy image and the target image. During the optimization process, the weights of the CNN are kept fixed, and the pixel values of the noisy image are updated iteratively to minimize the content loss.

Content reconstruction

Content reconstruction is used to generate an image that captures the content of a target content image. The content loss is typically defined as the mean squared error (MSE) between the feature maps of the noisy image and that of the target content image at a selected layer of the convolutional neural network (CNN). By minimizing the content loss, the generated image preserves the same high-level content as the target content image.

Texture reconstruction

Texture synthesis is used to generate an image that mimics the style of a target style image. The style loss is typically defined as the MSE between the Gram matrices of the feature maps of the input image and the style image at a selected layer of the CNN. Note that gram matrix G of kth dimension of the Lth feature map of an image is as follows:


By minimizing the style loss, the generated image adopts the same style as the style image.

Neural Style transfer

In neural style transfer, an initial noisy image is iteratively optimized to minimize a loss function that is a combination of content loss and style loss. Consider L_contetn as content loss, L_style as style loss, λ_content as the content loss weight, and λ_style as style loss weight. The neural style transfer loss is as follows:


The final generated image combines the high-level content of the content image with the texture and style of the style image.

Experiments

I used the pre-trained VGG19 model for the pre-trained CNN. I also used L-BFGS optimizer and set the number of steps to 500. In addition, I experimented with different settings, including different input images, different loss weights, different layers for applying loss functions, and normalizing the gram matrix.

Content reconstruction Results

I applied content loss to different conv layers of VGG19 and considered different combinations of applying the loss to the layers. By applying content loss in deeper layers, more textures of the input image are discarded, and the output image gets closer to noise.

Layer Result Loss
Conv 1
Conv 2
Conv 3
Conv 4
Conv 5
Conv 1,2
Conv 1,2,3
Conv 1,2,3,4
Conv 1,2,3,4,5

I applied content loss at the deepest layers, conv5, because I wanted to capture the semantics of the content image and discard its texture as much as possible. However, this setting resulted in the loss value becoming NAN for some of the images.
I set content loss in conv4 and experimented with different random initializations to check the input’s sensitivity. The semantics of the two outputs are similar, but the textures are different.

Input Image Result
Random Image 1
Random Image 2

Texture reconstruction Results

I applied style loss to different conv layers of VGG19 and different combinations of applying the loss to the layers. By applying style loss in deeper layers, courser textures of the input image are synthesized, and the finer textures are disappeared. Thus, applying it at deeper levels results in a noisier image. When the style losses of multiple layers (starting from the first layer) are combined, results the image to has a variety of granularities of the style image's texture.

Layer Result Loss
Conv 1
Conv 2
Conv 3
Conv 4
Conv 5
Conv 1,2
Conv 1,2,3
Conv 1,2,3,4
Conv 1,2,3,4,5

I applied style loss in conv4 layer and experimented with different random initializations to check the input’s sensitivity. The textures of the two outputs are similar, but the semantics are different.

Input Image Result
Random Image 1
Random Image 2

Style transfer Results

I experimented with different combinations of applying content loss and style loss in the model and different loss weights. I applied the content loss to conv4 to capture the semantics of the content image. On the other hand, since different texture granularity of the style information captures through different layers starting from the lowest layer, I applied style loss to the conv1,2,3,4. In addition, for loss weights, I set content loss weight as 1 and style loss weight as 100.

Applying style loss to the deeper layers leads style of the content images appearing more in the output image. In addition, initializing the input image with the content image leads to the best results.

Compared to the random initialization, when I used content initialization, achieving optimal results required less weight for content loss and more weight for style loss. It is because the input image already contains both the content and style information of the content image during initialization. Thus, it needs more style weight loss to modify the input style to that of the style image.

In random initialization, the style image is more preserved in the output, while the content initialization, the content is more preserved.

Hyperparameter tuning

Layers and Gram Normalization

Normalization in gram matrix leads the output image has more texture of the content image.

Content Layers Style Layers Result With Gram Normalization Result Without Gram Normalization
Conv 3 Conv 1
Conv 4 Conv 1
Conv 3 Conv 1, 2
Conv 4 Conv 1, 2
Conv 3 Conv 1, 2, 3
Conv 4 Conv 1, 2, 3
Conv 3 Conv 1, 2, 3, 4
Conv 4 Conv 1, 2, 3, 4
Conv 3 Conv 1, 2, 3, 4, 5
Conv 4 Conv 1, 2, 3, 4, 5


Combinations of Layers and Weights

Content Weight Style Weight Content: Conv 1
Style: Conv 1
Content: Conv 1
Style: Conv 5
Content: Conv 5
Style: Conv 1
Content: Conv 5
Style: Conv 5
1 1
1 2
1 5
1 10
1 50
1 100
2 100
5 100

Random vs Content as Input

Running time for the cases with random mage as input are almost the same. However, running time for the case with content image as input is more than random image as input.

Experiment Result Time
Random Image 1 as Input 14.5 Sec
Random Image 2 as Input 14.4 Sec
Content Image as Input 16.1 Sec


Final Results

Content \ Style


Extra Styles

Content \ Style

Bells & Whistles

For Bells & Whistles, I did the followings:

  1. Style Transfer on Cats
  2. Style Transfer on Video

Style Transfer on Cats



Content \ Style


Content \ Style

Style Transfer on Video

I applied the Style Transfer on each frame.

Video Style Result