Assignment #4 - Neural Style Transfer

Project Overview

Neural Style Transfer is an exciting field at the intersection of computer vision and artistic expression, leveraging deep learning to blend content from one image with the style of another. This project explores the implementation and optimization of neural style transfer techniques, allowing for the creation of unique, stylistically transformed images. The primary objective of this project is to develop a deep understanding of how neural networks can be used to analyze and blend artistic styles onto content images. We will implement algorithms to optimize an input image in such a way that it reflects the content of one image and the style of another, using pre-defined and computed losses to guide the transformation process.

Neural Style Transfer involves the optimization of an image to match the content of one image and the style of another. The process is facilitated by a convolutional neural network (CNN), often the VGG network, due to its robustness in capturing image features at various levels of abstraction. The core of the algorithm involves two primary loss functions: content loss and style loss.

VGG Network

The VGG Network, specifically VGG-19, is commonly used as a feature extractor in computer vision. The network consists of multiple convolutional layers grouped into 5 blocks. Each block captures features at a different level of abstraction. For Neural Style Transfer, we add loss computation after each convolution operation in the block and before the ReLU and maxpooling operation in the block and make the VGG-19 network itself frozen.

VGG-19

Content Loss

The content loss ensures the generated image (G) retains the content of the content image (C). It is defined at a layer ( $F^l_C$ $F^l_G$ of the content and generated images, respectively:

$\text{CL}^l = \frac{1}{2} \sum_{i,j} (F^l_{C_{ij}} - F^l_{G_{ij}})^2$

Style Loss

The style loss ensures the generated image mimics the style of the style image ( $G^l_S$ $G^l_G$ of the style and generated images, respectively. The Gram matrix is a measure of the correlation between different feature maps at layer (l). The style loss at layer (l) is defined as the MSE of the Gram matrices:

$\text{SL}^l = \frac{1}{4N^2_lM^2_l} \sum_{i,j} (G^l_{S_{ij}} - G^l_{G_{ij}})^2$

$N_l$ $M_l$ is the size of each feature map at layer (l).

Total Loss

$\alpha$ $\beta$ control the relative importance of content and style, respectively:

$\text{Total Loss} = \alpha \sum_l \text{CL}^l + \beta \sum_l \text{SL}^l$

Optimization

Starting with an initial image (either random noise or the content image), the algorithm iteratively updates the image to minimize the total loss. The optimization is typically performed using backpropagation and an optimizer such as L-BFGS.

By carefully balancing the content and style losses, Neural Style Transfer generates images that blend the content of one image with the artistic style of another, showcasing the power of CNNs in understanding and manipulating image content at various levels of abstraction.

Part1: Content Reconstruction

1. Content loss at different layers

We show the results when applying content loss after convolution block1, block2, block3, block4, and block5 separately.

After convolution block1:

reconstructed_image_conv_1_2

After convolution block2:

reconstructed_image_conv_2_2

After convolution block3:

reconstructed_image_conv_3_4

After convolution block4:

reconstructed_image_conv_4_4

After convolution block5:

reconstructed_image_conv_5_4

Based on the results, we can find that when reconstructing from the feature of block1,2,3. The general reconstruction results are good and the shape of the dogs can be recognized. However, when reconstructing from the feature of block4,5, the general reconstruction results becomes noisy. The reconstruction results from block4 can recognize shape information while the reconstruction results from block5 is almost random. Therefore, my favourite reconstruction is reconstructing from block3.

2. Random noise as input

I take the convolution block3 as the position to add content loss. WIth two random noise initializations, we can find that the reconstructed image has little difference (left eye of the dog has different shapes) and has different noisy point distribution.

reconstruction results from seed 0:

reconstructed_image_seed_0

reconstruction results from seed 1:

reconstructed_image_seed_1

Part2: Texture Synthesis

1. Style loss at different layers

We show the results when applying content loss after convolution block1, block2, block3, block4, and block5 separately.

After block1:

synthesized_texture_conv_1_2

After block2:

synthesized_texture_conv_2_2

After block3:

synthesized_texture_conv_3_4

After block4:

synthesized_texture_conv_4_4

After block5:

synthesized_texture_conv_5_4

Based on the texture synthesis results, we find that all synthesized texture are not very likely to the original style. From convolution block 1 to block 5, the texture becomes less and less alike and more and more noisy. Typically, we found that all synthesis results from block1 to block5 have noisy points on the generated image. I like the results based on convolutional block2 the best.

2. Random noise as input

I take the convolution block2 as the position to add style loss. With two random noise initializations, we can find that different initialization of random noise bring huge difference to the synthesized results. Typically, the shape of the texture is obviously different with two random initilization. However, the color and general style of two synthesized texture is the same.

Texture synthesis results from seed 0：

synthesized_image_seed_0

Texture synthesis results from seed 1:

synthesized_image_seed_1

Part3: Style Transfer

1. Hyper-parameter tuning

1.a random noise initialization

content loss position (from conv block 1 to block 5)

style loss position (from conv block 1 to block 5)

content loss weight (0.01, 1, 100)

style loss weight (1e5, 1e6, 1e7)

Based on the ablation study above, the details of implementation for random initialization are:

content loss position = block 3

style loss position = block1-5

content loss = 100

style loss = 1e6

1.b content image initialization

content loss position (from conv block 1 to block 5)

style loss position (from conv block 1 to block 5)

content loss weight (0.01, 1, 100)

style loss weight (1e5, 1e6, 1e7)

Based on the ablation study above, the details of implementation for content image initialization is:

content loss position = block 3

style loss position = block1-5

content loss = 10

style loss = 1e7

2. Grid results

2.a random noise initial ization

2.b content image initialization

3. Random noise and content image as input

For quality, comparing the results based on random noise input and content image input, the results are shown above. Starting from the content image, the style transfer results are less noisy and have more concrete content compared with starting from the random noise.

For time, starting from the random noise takes 20.02s wall-clock time to finish while starting from the content image takes 18.60s to finish, indicating that starting from the content image is faster.

4. Favorite image style transfer

For my favorite image style transfer, I transfer the style of La Mort de Marat into pointillism style.

Style image:

pointillism

Content image:

marat

style transferred image:

favorite_style_transfer