Final Project - Text-to-Style Reconstruction for Diffusion Style Transfer

Name: Shun Tomita (ID: stomita)

Artistic style transfer is a method to synthesize images like arts by combining content images and artistic style from other images. By leveraging CLIP (Contrastive Language-Image Pre-training) [Radford et al., 2021], previous papers such as [Bai et al., 2023] and [Yang et al., 2023] showed successful results of artistic style transfer with text guidance. However, current methodologies have several limitations. A method based on arbitrary style transfer and AdaIN [Bai et al., 2023] showed text-based style transfer, but the model can generate only one example given an image and text guidance as no randomness is involved. Also, though diffusion-based models such as DiffusionCLIP [Kim et al., 2022] and CLIPstyler [Kwon and Ye, 2022] used CLIP-based style loss to train models, they sometime transferred contents of text-guidance to synthesized images by mistake. This paper proposes diffusion-based text-guided style transfer using style extraction networks.

A novel contribution of this project is to utilize style extraction model proposed in ITstyler [Bai et al., 2023] in diffusion-based style transfer models. Style extraction model would enable the model to encode style embedding from CLIP encoding of text inputs, which would reduce mistakes of transferring contents of input text.

Image embedding generated by CLIP image encoder can have style parts and content parts in the CLIP embedding space. Similarly, text embedding obtained from CLIP text encoder should have style parts. Assuming it, the method proposed in ITstyler [Bai et al., 2023] used MLP networks to map from CLIP embedding space to style space encoded by pre-trained VGG network [Simonyan and Zisserman, 2014]. While the mean and variance of the output of intermediate layer in VGG are considered as the style representation of images and used in AdaIN-like architecture in the model in ITstyler, in this project I used Gram matrix representation of styles as proposed in [Gatys et al., 2015] so that the style representation can be used in guidance in diffusion model. Therefore, I trained Mapping Network φ that maps CLIP embedding to the Gram matrix representation of style using following loss function:

e1

where Gram is the Gram matrix representation of images with VGG and the mapping network learned how to convert CLIP encoding to style encoding. At inference time, text inputs instead of image inputs are used to get Gram matrix representation. A figure below describes the architecture of style encoding networks and training of them.

f1
f1

I used pre-trained diffusion models since [Yang et al., 2023] showed that the pre-trained model with guidance of both contents and style can achieve style transfer tasks. DDPM [Ho et al., 2020] directly samples xt from original image x0 by adding Gaussian noise with βt ∈ (0, 1) at time t ∈ [1, ..., T ] using following formula

e2

where ε ∼ N (0, I), αt = 1 − βt, and ̄αt is products of αi. The reverse sampling process to generate a clean image is obtained by applying following backward diffusion step repeatedly:

e3

where the network εθ (xt, t) returns the predicted noise. Then we used loss function defined as follows:

e4

L_content and L_style represent content loss and style loss respectively. This loss function is used as guidance in the sampling process to get the denoised image as follows:

e5

Style loss is basically L2 loss of the Gram matrix of text-style embedding and VGG style embedding of generated images:

e6

where p_target is a target prompt.

The content loss consists of Patch-wise Zero-shot Contrastive loss (ZeCon) [Yang et al., 2023], perceptual loss using VGG, and pixel-lebel MSE loss. Patch-wise ZeCon loss [Yang et al., 2023] helps the model keep semantic content in the denoising process. Intuitively, ZeCon loss will make patches in the same location of ˆx0,t and x0 similar and patches in different locations of ˆx0,t and x0 different. It is defined mathematically as follows:

e7

All losses are weighted when it is used and weights are tuned as hyperparameters.

To show how much the mapping network can extract style representation from texts, first I performed style reconstruction of texts. Specifically, I tried to reconstruct images that minimize L_style with text inputs. The results of the reconstruction and the prompts used are shown below. As we can see, colors used in the reconstructed images match the input prompts but the images are blur and noisy. Probably it is because of the error of the mapping network. Also, the model failed to learn texture since all reconstructed images do not have differences in texture.

1
"frilly, green, red, vein, fading color"
1
"beautiful and vibrant, like the fall colors"
1
"patches of orange, yellow, and red mixed"
1
"wonderful summer scene"

Also, the results of neural style transfer with content images and style texts (similar to what we did in HW4) are shown below. It successfully transferred the style of the text inputs but the images generated are still noisy because the Gram matrix from the mapping network is noisy.

1
1
1
1
1
1
1
1
original
1
"blue background, yellowish"
1
"crystal, white, water color, green, grey"
1
"frilly, green, red, vein, fading color"
1
"beautiful and vibrant, like the fall colors"
1
"patches of orange, yellow, and red mixed"
1
"wonderful summer scene"

As we saw, while the direct usage of the mapping network in the style reconstruction returns successful style transfer from text to image, the images generated are blur and noisy. The next idea to utilize this style representation of texts is diffusion style transfer. The logic behind this idea is that the knowledges of the pre-trained diffusion models can be leveraged to synthesize less blur or noisy images and transfer styles using the guidance at the same time.

The results of the diffusion style transfer are shown below. Compared to the neural style transfer, it failed to transfer styles from text inputs. The reason might be the difference of the image generation and optimization process. While the pixel values are directly optimized to minimize the style loss in the neural style transfer, the predicted noises are guided to reduce the style loss. The guidance might not be enough to transfer styles especially when the style is defined as the Gram matrix.

1
1
1
1
1
1
1
1
original
1
"blue background, yellowish"
1
"crystal, white, water color, green, grey"
1
"frilly, green, red, vein, fading color"
1
"beautiful and vibrant, like the fall colors"
1
"patches of orange, yellow, and red mixed"
1
"wonderful summer scene"

Also, the following images show the effects of the weight on the style loss. When the weights on the style loss increase, the content of the original images tends to get lost and some random patterns show up. Also, if the weights of the loss are too high it caused mathematical unstability and the model failed to compute the gradient of the loss function and ended up with returning all black images.

1
1
1
1
weight_style = 10000
1
weight_style = 15000
1
weight_style = 20000

The comparisons with the model using CLIP loss [Yang et al., 2023] are shown below. The possible reason is that our model failed to learn texture of images and to replicate textures from text inputs.

1
ZeCon, 'cubism'
1
ZeCon, 'mosaic'
1
Ours, 'cubism'
1
Ours, 'mosaic'

The proposed method utilized the mapping network to extract style features from CLIP encoding of texts. While it showed ability to extract color information, the model failed to transfer texture of images and also the guidance with the Gram matrix failed to transfer extracted style embedding in the diffusion model.

Yunpeng Bai, Jiayue Liu, Chao Dong, and Chun Yuan. Itstyler: Image-optimized text-based style transfer. arXiv preprint arXiv:2301.10916, 2023.

Leon Gatys, Alexander S Ecker, and Matthias Bethge. Texture synthesis using convolutional neural networks. Advances in neural information processing systems, 28, 2015.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.

Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426–2435, 2022.

Gihyun Kwon and Jong Chul Ye. Clipstyler: Image style transfer with a single text condition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18062–18071, 2022.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.

Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

Serin Yang, Hyunmin Hwang, and Jong Chul Ye. Zero-shot contrastive loss for text-guided diffusion image style transfer. arXiv preprint arXiv:2303.08622, 2023.

The architecture of the mapping network is basically multi-layer perceptrons with batch normalization. It has 6 linear layers followed by batch normalization layer and SiLU activation layer. The number of neurons in layers are [2048, 2048, 1024, 1024, 1024] and the dimension of the outputs is 20,672. Since I used outputs of first 4 convolutional layers of VGG network and computed the Gram matrix as the target of the training, the network also should return 4 Gram matrices. Because the gram matrix is symmetric by nature of the matrix, outputs of the model is converted to the symmetric matrix. The model is trained on Wikiart dataset that contains a lot of different style images of arts.