16-726 SP22 Assignment #4 - Neural Style Transfer

Run

Bells and whistles using my previous pictures.

Introduction

This Assignment talks about neural network style transfer. We start by defining two losses: content loss and style loss, which indicates the difference of feature map of certain layer to the input image. We insert these loss after certain layers in a trained VGG19 network.

Experiments

Part1: Content Loss

I insert the content loss layer into conv_1, conv_3, conv_5, conv_7 layers for each experiment and get the result. It can be inferred that the deeper layer the content loss layer is inserted, the worse result it gets.

ImageConv_1Conv_3Conv-5Conv-7
Wallyconv1_2conv3_2conv5_2conv7_2

And from two random image with content loss (inserted into conv_4 as default):

NoiseResult
NoiseNoise_result
Noise2Noise_result2

The result image is different, some details on dog's face preserve these noises from input image. Probably due to the different noise distribution from input.

 

 

Part2: Style Loss

I insert the style loss layer into conv_1-3, conv_4-6, conv_7-8 for each experiment and get the result. It can be inferred that in the shallow layers, the color and the large area of color block is kept, while in the middle layers, the mixing feeling is kept and in the even deeper layer, unexpected noise occurs.

 

the_scream

StyleConv_1-3Conv_4-6Conv-7-8
The_screamconv1-3conv4-6conv7-8

And from two random image with only style loss(inserted as default conv1-5):

NoiseResult
noisenoise_result
noise2noise_result2

As you can see, different noise distribution affects the brushstrokes in the texture generated. Below is the top left part(0,0-100,100px) of the generated texture.

Comparasion

 

Part 3: Style Transfer

Hyper-parameters

One intuitive method is to adjust the input image size. I tried 1024 instead of default 512, and the result is more interesting than the default one. There are more delicate strokes and has no large areas of color blocks. The dog input picture is realistic so that it could be weird. I use ballet dancer as an alternative to show the difference. One interesting discovery for the input is that the greater resolution the input image is, the more organized the style will be. For lower resolution input, the stroke on final result seems wild and fantastic.

hyper_input1024

hyper_input1024_2

A second approach is to adjust the default style weight and content weight. I adjust the 1000000:1 to 800000:1. The result is very exciting and artistic.

weight

For the following parts, I use adjusted weight ratio 900000:1 to show my result.

Results generated from content

style image\ content imagedancing Dancingwally Wally
frida_kahloFrida_kahlo12
picassoPicasso34

Results generated from noise

style image\ content imagedancing Dancingwally Wally
frida_kahloFrida_kahlo56
picassoPicasso78

Generally speaking, the image quality generated by content image is better than that from noise, ceteris paribus.

I added the elapsed time code into project. The result generated from noise cost 18.077855s while the result generated from content cost 17.449960s, based on 300 runs on Picasso+Dancing. The result generated from noise cost 29.000032s while the result generated from content cost 28.763395 s, based on 500 runs on Picasso+Dancing. It seems that no huge time consuming difference exists.

My favourite pictures

StyleContentResult
little_hei2dogedoge_xiaohei2
Impression_sunrisefinal-fantasy-xiv-endwalker-headerresult3

 

 

Bells and Whistles

I tried my previous work (image blending) with style transfer too.

StyleContentResult
Impression_sunriseImpression Sunriseblendingblending_result

 

Reference