This project aims at generating an image that captures the content of one image while emulating the style of another. It is a style transfer process where an initial noisy image is iteratively optimized to minimize a loss function between the intermediate feature maps of the noisy image and the target image. During the optimization process, the weights of the CNN are kept fixed, and the pixel values of the noisy image are updated iteratively to minimize the content loss.
Content reconstruction is used to generate an image that captures the content of a target content image. The content loss is typically defined as the mean squared error (MSE) between the feature maps of the noisy image and that of the target content image at a selected layer of the convolutional neural network (CNN). By minimizing the content loss, the generated image preserves the same high-level content as the target content image.
Texture synthesis is used to generate an image that mimics the style of a target style image. The style loss is typically defined as the MSE between the Gram matrices of the feature maps of the input image and the style image at a selected layer of the CNN. Note that gram matrix G of kth dimension of the Lth feature map of an image is as follows:
By minimizing the style loss, the generated image adopts the same style as the style image.
In neural style transfer, an initial noisy image is iteratively optimized to minimize a loss function that is a combination of content loss and style loss. Consider L_contetn as content loss, L_style as style loss, λ_content as the content loss weight, and λ_style as style loss weight. The neural style transfer loss is as follows:
The final generated image combines the high-level content of the content image with the texture and style of the style image.
I used the pre-trained VGG19 model for the pre-trained CNN. I also used L-BFGS optimizer and set the number of steps to 500. In addition, I experimented with different settings, including different input images, different loss weights, different layers for applying loss functions, and normalizing the gram matrix.
I applied content loss to different conv layers of VGG19 and considered different combinations of applying the loss to the layers.
By applying content loss in deeper layers, more textures of the input image are discarded, and the output image gets closer to noise.
| Layer | Result | Loss |
|---|---|---|
| Conv 1 |
|
|
| Conv 2 |
|
|
| Conv 3 |
|
|
| Conv 4 |
|
|
| Conv 5 |
|
|
| Conv 1,2 |
|
|
| Conv 1,2,3 |
|
|
| Conv 1,2,3,4 |
|
|
| Conv 1,2,3,4,5 |
|
|
I applied content loss at the deepest layers, conv5, because I wanted to capture the semantics of the content image and discard its texture as much as possible. However, this setting resulted in the loss value becoming NAN for some of the images.
I set content loss in conv4 and experimented with different random initializations to check the input’s sensitivity. The semantics of the two outputs are similar, but the textures are different.
| Input Image | Result |
|---|---|
| Random Image 1 |
|
| Random Image 2 |
|
I applied style loss to different conv layers of VGG19 and different combinations of applying the loss to the layers.
By applying style loss in deeper layers, courser textures of the input image are synthesized, and the finer textures are disappeared. Thus, applying it at deeper levels results in a noisier image. When the style losses of multiple layers (starting from the first layer) are combined, results the image to has a variety of granularities of the style image's texture.
| Layer | Result | Loss |
|---|---|---|
| Conv 1 |
|
|
| Conv 2 |
|
|
| Conv 3 |
|
|
| Conv 4 |
|
|
| Conv 5 |
|
|
| Conv 1,2 |
|
|
| Conv 1,2,3 |
|
|
| Conv 1,2,3,4 |
|
|
| Conv 1,2,3,4,5 |
|
|
I applied style loss in conv4 layer and experimented with different random initializations to check the input’s sensitivity. The textures of the two outputs are similar, but the semantics are different.
| Input Image | Result |
|---|---|
| Random Image 1 |
|
| Random Image 2 |
|
I experimented with different combinations of applying content loss and style loss in the model and different loss weights. I applied the content loss to conv4 to capture the semantics of the content image. On the other hand, since different texture granularity of the style information captures through different layers starting from the lowest layer, I applied style loss to the conv1,2,3,4.
In addition, for loss weights, I set content loss weight as 1 and style loss weight as 100.
Applying style loss to the deeper layers leads style of the content images appearing more in the output image. In addition, initializing the input image with the content image leads to the best results.
Compared to the random initialization, when I used content initialization, achieving optimal results required less weight for content loss and more weight for style loss. It is because the input image already contains both the content and style information of the content image during initialization. Thus, it needs more style weight loss to modify the input style to that of the style image.
In random initialization, the style image is more preserved in the output, while the content initialization, the content is more preserved.
Normalization in gram matrix leads the output image has more texture of the content image.
| Content Layers | Style Layers | Result With Gram Normalization | Result Without Gram Normalization |
|---|---|---|---|
| Conv 3 | Conv 1 |
|
|
| Conv 4 | Conv 1 |
|
|
| Conv 3 | Conv 1, 2 |
|
|
| Conv 4 | Conv 1, 2 |
|
|
| Conv 3 | Conv 1, 2, 3 |
|
|
| Conv 4 | Conv 1, 2, 3 |
|
|
| Conv 3 | Conv 1, 2, 3, 4 |
|
|
| Conv 4 | Conv 1, 2, 3, 4 |
|
|
| Conv 3 | Conv 1, 2, 3, 4, 5 |
|
|
| Conv 4 | Conv 1, 2, 3, 4, 5 |
|
|
| Content Weight | Style Weight | Content: Conv 1 Style: Conv 1 |
Content: Conv 1 Style: Conv 5 |
Content: Conv 5 Style: Conv 1 |
Content: Conv 5 Style: Conv 5 |
|---|---|---|---|---|---|
| 1 | 1 |
|
|
|
|
| 1 | 2 |
|
|
|
|
| 1 | 5 |
|
|
|
|
| 1 | 10 |
|
|
|
|
| 1 | 50 |
|
|
|
|
| 1 | 100 |
|
|
|
|
| 2 | 100 |
|
|
|
|
| 5 | 100 |
|
|
|
|
Running time for the cases with random mage as input are almost the same. However, running time for the case with content image as input is more than random image as input.
| Experiment | Result | Time |
|---|---|---|
| Random Image 1 as Input |
|
14.5 Sec |
| Random Image 2 as Input |
|
14.4 Sec |
| Content Image as Input |
|
16.1 Sec |
| Content \ Style | ![]() |
![]() |
![]() |
![]() |
![]() |
|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Content \ Style | ![]() |
![]() |
![]() |
|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
For Bells & Whistles, I did the followings:
| Content \ Style | ![]() |
![]() |
![]() |
|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
| Content \ Style | ![]() |
![]() |
![]() |
|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
I applied the Style Transfer on each frame.
| Video | Style | Result |
|---|---|---|
|
||
|