16726 Project Report

Photo-realistic Style Transfer

Jingxiang Lin (jingxia2)

Introduction

Style transfer is the task of transferring the style of an image to a target image whiling keeping the content of the target image. Leveraging the recent advances in deep learning, a neural network based approach was first proposed in [1], and has shown . Although this method can produce good artistic looking images, it cannot generate images that looks realistic. An example of it is shown below. We can see that there are light textures on the sky and the the windows of the buildings are distorted in the image.

Image generated by Neural Style Transfer

To fix this issue, a photo-realistic style transfer was proposed [2]. It builds upon the previous method and used a augmented style loss and a photo-realistic regularization term to generate more realistic-looking images. In this project, I reimplemented this method did experiments and analysis on my reimplementation.

Approach

Neural Style Transfer

The Neural Style Transfer [1] method used a VGG-19 network pretrained on ImageNet as the feature extractor to extract the style and content images’ information at different levels. It then tries to jointly optimize a content loss and a style loss to make the generated image has the style image’s style while keeping the content image’s content. The equation are shown below.

Here L^l_c is the content loss for the l-th layer, and L^l_s is the style loss for the l-th layer. F_l[ ] is the extracted feature for the l-th layer, and G_l is the gram matrix for the l-th layer calculated based on F_l[ ].

Photo-realistic Style Transfer

The photo-realistic style transfer paper [2] proposed two improvements on the original neural style transfer method to make the generated image photo-realistic.

The first improvement is the augmented style loss, and the idea is that we only transfer the style of an object in the style image to the corresponding object in the content image. It was achieved by first doing a semantic segmentation on both the content image and the style image, and then apply the masks on the extracted features before calculating the gram matrices. Formally, the augmented style loss was formulated in the equation below.

Here C is the total number of semantic masks, and M_l,c[] is the c-th sematic mask at layer l.

The second improvement is the photo-realistic regularization, and the motivation is that the generated image should not have unrealistic distortions like the windows of the building we shown earlier. To achieve this, the author exploits the fact that the the input image is already photo-realistic, and tries to seek the image transform that is locally affine in color space. Formally, the author builds upon the Matting Laplacian of Levin et al [3] that has shown how to express a grayscale matte as a locally affine combination of the input RGB channels. The equation for the photo-realistic regularization is shown below.

Here M_I is the Matting Laplacian matrix that only depends on the input image I. It has shape N x N, where N is the number of pixels in the input image. V_c is the vectorized output image with shape NX1. This regularization term penalize the outputs that are not well explained by a locally affine transformation.

Now the total loss becomes a weighted combination of the content loss, the augmented style loss and the photo-realistic regularization. The equation is shown below.

Implementation Details

For both the style and content images, I resize the longer border to 512, and then apply a center crop to make images have shape 512×512.

Semantic Segmentation

I used dilated resnet-50 pretrained on ADE20K dataset as the semantic segmentation model. The model is obtained from https://github.com/CSAILVision/semantic-segmentation-pytorch. The model predicts 150 classes, and I merged some classes to make to results looks cleaner. The merged classes are:

[water, sea, river, lake] -> water
[plant, grass, tree, fence] -> tree
[house, building, skyscraper, wall, hovel, tower] -> building
[road, path, sidewalk, sand, hill, earth, field, land, grandstand, stage, dirt track] -> raod
[awning, ceiling] -> roof
[mountain, rock] -> mountain
[stairway, stairs, step] -> stair
[chair, seat, armchair, sofa, bench, swivel chair] -> chair
[car, bus, truck, van] -> vehicle

An example of the semantic segmentation result after merging can be seen in the figure below.

We can see that in the example above, the style image and the content image contains some different semantic classes (e.g. Mountain and Chair). I handled this by only using the mask classes that exist in both images. I also did some manual corrections on the semantic prediction results for each image. The semantic masks are down-sampled by average pooling and bilinear interpolation to match the size of the feature map at different layers.

Style transfer

For the style transfer optimization, I followed the original paper and adopted a two stage optimization approach: first optimizing without the photo-realistic regularization, then use the output of the first stage as initialization to do the full optimization. I also found that the weight mentioned in the paper did not work well, and I ended up using 100 for content loss weight, 10^6 for style loss weight and 1 for the photo-realistic regularization weight.

Experiments

Based on my reimplementation, I did several experiments to verify the effectiveness of the proposed augmented style loss, the photo-realistic regularization, and tried to transfer the style between semantically different objects.

Effect of Augmented Style Loss

I run my reimplementation with the original style loss and the augmented style loss, and the qualitative results are shown below.

We can see that the neural style transfer approach has obvious non-photo-realistic artifacts in the sky of the generated images: there are grass-like textures in the first row and lighting textures in the second row. This is caused by the neural style transfer method does not have the semantic context during style transfer: it does not know the object correspondences between the style image and the content image. From the images above, we can see that the augmented style loss can solve this issue decently well and the artifacts those weird texture disappeared. I suspect that the red/green-ish color on the top-right corner of the image generated by the augmented style loss (first row) is caused by the imperfect segmentation masks: some pixels on the mountain may have leak into the sky segmentation mask. Still, we can see the window distortions on the buildings for the second row images even with the augmented style loss.

Effect of Photo-realistic Regularization

I also did experiments with different amount of photo-realistic regularization, and the results are shown below.

We can see that even with a very small amount of regularization (0.1), the distortions can be improved a lot, though some small distortions still exist. With the regularization weight 1, the generated image is decently photo-realistic . However, as the regularization weight goes larger, it seems the image becomes grey-ish and the sky does not look realistic anymore. When the regularization weight reaches 1000, I found that the optimization becomes very unstable and the generated image looks pretty bad. Thus I used regularization weight 1 for all my following experiments.

More Qualitative Results

Below are more generated images using my reimplementation of the paper. We can see that the generated results are not perfect, e.g. the sky & reflection in row 3, the wall in 4. Note that I used the same hyper-parameters for all the below generation, and a better result might be obtained if we tune the hyper-paramters for each pair of the image.

Style Transfer between semantically different objects

The augmented style loss allows us to transfer the style between objects of the same category by using semantic masks, but what would happen if I use masks to transfer the style between semantically different objects? To experiment with this, I used the VGG Image Annotator [3] to manually create masks for the images downloaded from the web, and then apply my reimplementation on those images. Below are some generated images.

We can see that the algorithm is able to transfer the texture of the style image to the content image for semantically different objects. It seems the results looks better for transferring from texture-rich/ texture less objects to texture-less objects (first two rows). For the opposite direction, the original content image’s texture would make the generated image not so visually satisfying (third row).

Conclusion

In this project, I reimplemented photo-realistic style transfer paper [2], and did experiments to verify the effectiveness of the augmented style loss and the photo-realistic regularization. I ran my implementation on the several content-style image pairs and got decent qualitative results. I also investigated what will happen if we use manually created masks to transfer styles between semantically different objects, and the qualitative results showed the algorithm’s capability for this task.

References

[1] Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge. “A neural algorithm of artistic style.” arXiv preprint arXiv:1508.06576 (2015).

[2] Luan, Fujun, et al. “Deep photo style transfer.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

[3] https://www.robots.ox.ac.uk/~vgg/software/via/