In this project, we aim to transfer stlyes from a style image to a content image by optimizing with respect to a content loss and a style loss. The content loss is defined as the squared L2 distance between deep features of the content image and the optimized input image. The style loss is defined as the distance between the gram matrix of deep features of the style image and the optimized input image. By optimizing the input image wrt the combined loss, we are able transfer stlyes from the style image to the content image.
Four samples from texture synthesis results using conv {1,4,7,10} as content feature as shown below. Clearly using shallow features (conv 1 and conv 4) yield more realistic results with less artefacts. Considering that Conv 1 is very low level (close to raw pixel values), in the final model, conv 4 feature is used as content feature.
The images generated from 2 random noises are visually similar overall. However, if we look at the details of high frequency areas, we can see some differences in noises (See Pixel difference (detail)).
Below shows 3 samples from texture synthesis results using shallow feature (conv_1~conv_5), mid-level feature (conv_6~conv_10) and deep feature (conv_11~conv_15). Clearly the shallow sample shows more realistic texture results. Therefore, in the final model, the shallow feature (conv_1~conv_5) is used.
Below shows two samples from texture synthesis results from 2 different random noises. The results are clearly different but both are nice results, with the general texture captured by the generated images.
For loss computation, we use conv_4 as content feature and conv_1~conv_5 as style feature. The gram matrix is normalized by the number of elements in the feature maps. The style weight is set to $10^6$ and the content weight is set to 1. We use L-BFGS optimizer to and optimized input image for 300 steps.
As shown in the figure below, sometimes the generating from content yields better results (e.g. Style 3) and sometimes the generating from noise yields better results (e.g. Style 2). The running time is similar (both ~4min on my machine with input resolution at $256\times 256$).
Based on the method proposed in [1], we can get fast style transfer results by making use of a pretrained tranformation network to perform the style transfer. By making use of this FFN, we can remove the iterative optimization process during style tranfer.
In our experiments, we follow the method described in [1]. We train the network by using the same style and content loss on ImageNet train split for 100K iterations. The results are shown below. We used the Scream as style image. Note the content and style are consistent with the respective images. Furthermore, we find the running time is reduced by $>98\%$ from 238s to ~5s, compared to the naive optimization method.
[1]: Perceptual losses for real time style tranfer and super resolution (Johnson et al. 2016)
Now we have a faster processing method, we can scale up the number of images processed by applying style transfer on videos. The results are shown below. Note this video would take a long time (several hours) to process if we optimize for each single frame.