16726 Final Project: Multi-Modal Instruction Image Editing

Tiancheng Zhao (andrewid: tianchen), Chia-Chun Hsieh (andrewid: chiachun)

In this project, we explore the use of multi-modal instructions to edit images.
Object-deforming/structural-change type of image editing is under-explored in current diffusion-based image editing literature (see fig 1). We hypothesize that multi-modal instructions can be a natural way to address this, because we can use

  1. sketches/masks to designate object boundaries
  2. text to designate object attributes

Figure 1: Motivating Example

We experimented with 3 different models:

  1. Instruct-Pix2Pix (text-only baseline)
  2. Stable-Diffusion (img2img translation with text prompt)
  3. Pix2Pix-Zero (with masks)
and we show the results in the following sections

Experiment 1: Instruct Pix2Pix

Method Overview

Instruct Pix2Pix is a fine-tuned text-guided image editing model that takes an input image and an editing instruction to produce the desired image.

To first verify that text-only editing instructions perform poorly for our purposes (structural changes),
We first explored different input images/prompts/diffusion steps/guidance weights on Instruct-Pix2Pix.
Indeed, we observe that this method performs well when doing attribute changes, but produces a lot of artifacts across all of our samples when we ask the model to make structural changes. This phenomenon can be attributed to the training process of Instruct-Pix2Pix, which uses Prompt-to-Prompt to generate training data and Prompt-to-Prompt struggles with the structural based editing.

1.1 Attribute-Change Results

1.1.1 Number of inference steps

For this experiment, we fixed the image guidance weight to 1.5 and text guidance weight to 15.

Input Image
Prompt
Step 10 Step 30 Step 50
Samoyed
"Make the dog Brown"
Placeholder Image 3
Placeholder Image 4
Placeholder Image 4
Chihuahua
"Make the background snowy"
Placeholder Image 7
Placeholder Image 8
Placeholder Image 4
Chihuahua
"Make the chihuahua red"
Placeholder Image 7
Placeholder Image 8
Placeholder Image 4

1.1.2 Text Guidance Weights

For this experiment, we fixed the image guidance weight to 1.5, and number of inference steps to 30.

Input Image
Prompt
Weight 1.5 Weight 15 Weight 30
Samoyed
"Make the dog Brown"
Placeholder Image 3
Placeholder Image 4
Placeholder Image 4
Samoyed
"Make the background snowy"
Placeholder Image 7
Placeholder Image 8
Placeholder Image 4
Chihuahua
"Make the chihuahua red"
Placeholder Image 7
Placeholder Image 8
Placeholder Image 4

1.2 Structural Change Results

1.1.1 Number of inference steps

For this experiment, we fixed the image guidance weight to 1.5 and text guidance weight to 15.

Input Image
Prompt
Step 10 Step 30 Step 50
Samoyed
"Make the Samoyed jump"
Placeholder Image 3
Placeholder Image 4
Placeholder Image 4
Samoyed
"Make the Samoyed lift its left leg"
Placeholder Image 7
Placeholder Image 8
Placeholder Image 4
Chihuahua
"Give the chihuahua wings"
Placeholder Image 7
Placeholder Image 8
Placeholder Image 4

1.1.2 Text Guidance Weights

For this experiment, we fixed the image guidance weight to 1.5, and number of inference steps to 30.

Input Image
Prompt
Weight 1.5 Weight 15 Weight 30
Samoyed
"Make the Samoyed jump"
Placeholder Image 3
Placeholder Image 4
Placeholder Image 4
Samoyed
"Make the Samoyed lift its left leg"
Placeholder Image 7
Placeholder Image 8
Placeholder Image 4
Chihuahua
"Give the chihuahua wings"
Placeholder Image 7
Placeholder Image 8
Placeholder Image 4

Experiment 2: Stable-Diffusion with sketches+text

Method Overview

We use a setup of Stable-Diffusion similar to homework 5. The intuition for our approach here is to use erasing + addition of objects to achieve structural changes, then use stable diffusion to smooth out the artifacts. We first add noise with DDPM for certain numner of steps and then denoise with DDIM to get the edited images.

We again conducted experiments with different input images/prompts/diffusion steps/guidance weights, and observe that while this approach is able to produce better results than Instruct-Pix2Pix, we were unable to isolate the editing to only the object of interest.

2.1 Number of Inference Steps

For this experiment, we fixed the text guidance weight to 15.

We can notice that the number of inference steps and the seeds have huge influence on the quality of generated images. Besides, the best number of inference steps is specific to different images and text prompts (35 for the jumping samoyed and 30 for the samoyed lifting its leg).

Input Image
Prompt
Step 30 Step 35 Step 40
Samoyed
A jumping samoyed Seed 1
Placeholder Image 3
Placeholder Image 4
Placeholder Image 5
Seed 2
Placeholder Image 3
Placeholder Image 4
Placeholder Image 5
Samoyed
A samoyed lifts its leg Seed 1
Placeholder Image 7
Placeholder Image 8
Placeholder Image 9
Seed 2
Placeholder Image 7
Placeholder Image 8
Placeholder Image 9

2.2 Text Guidance Weights

For this experiment, we fixed the number of inference steps to 35 for the jumping samoyed and 30 for the samoyed lifting its leg.

We can notice that the text guidance weight has little influence on the quality of generated images.

Input Image
Prompt
Weight 5 Weight 10 Weight 15
Samoyed
A jumping samoyed Seed 1
Placeholder Image 3
Placeholder Image 4
Placeholder Image 5
Seed 2
Placeholder Image 3
Placeholder Image 4
Placeholder Image 5
Samoyed
A samoyed lifts its leg Seed 1
Placeholder Image 7
Placeholder Image 8
Placeholder Image 9
Seed 2
Placeholder Image 7
Placeholder Image 8
Placeholder Image 9

2.3 Discussions

This method can be applied to other diffusion models as well, such as Versatile Diffusion, another LDM based diffusion model accessible to the public. The results are similar, so we won't include it here.

We also observe that the artifacts come from the stroke based-editing can not be removed when the number of inference steps is less than 30. We believe the nature of LDM structure (diffusion in the latent space rather than in the pixel space) results in this phenomenon. Similar to SDEdit, if the number of inference steps is too large, the generated images are far away from the input image, and if the number of inference steps is too small, the generated images will have artifacts remaining. Unfortunately, we couldn't validate this due to the lack of pixel-based text-to-image diffusion models accessible to the public.

Experiment 3: Pix2Pix-Zero

Method Overview

To only edit the intended parts of the image and minimize side effects, we attempted to use Pix2Pix-zero with some attention map manipulations. The intuition is that cross-attention maps decide the structure of generated images, we can add masks to modify cross-attention maps correspoinding to the target object to be modified.

Pix2pix-zero adopts BLIP to generate text prompts for input images. We can use CLIP to find the correspoinding cross-attention map by finding the word with highest similarity to the target object to be modified. We apply masks to constrain the region to be modified and resize masks to each level of cross-attention maps.

Since pix2pix-zero adopts DDIM to add noise and then denoise, it can better preserve the structure of the image and the identity of the object, compared to SDEdit, which adopts DDPM to add noise and DDIM to denoise.

3.1 Erase some parts of the object

To erase some parts of the target object, we can mask out (set the value to be 0) of the region of corresponding parts in cross-attention maps.

Naively masking out the region can lead to pattern mismatch between the masked region and the rest of the image.

To achieve better quality, we can also specify which object used to fill in the masked out region by setting the value of its cross-attention maps in that region to be 1. We can display the text prompt generated by BLIP so that we can precisely specify the object used to fill in.

We can observe:

  1. Specifying which object to fill in performs significantly better when the cross-attention gudiance is small and the number of inference steps is small. It also can mitigate the pattern mismatch issues when the cross-attention gudiance is large and the number of inference steps is large.
  2. When the cross-attention gudiance is small, increasing the number of inference steps improves editing (no vanishing back legs). When the cross-attention gudiance is large, increasing the number of inference steps impairs editing (face distortion and additional ears).
  3. When the number of inference steps is small, increasing the cross-attention gudiance improves editing (no vanishing back legs). When the number of inference steps is large, increasing the cross-attention gudiance impairs editing (face distortion and additional ears).
  4. The best result is achieved when the cross-attention gudiance = 0.15 and the number of inference steps = 30.
Inference Step
30 40 50
Cross-attention Guidance Weight
Input Image
0.05 Naive
Placeholder Image 3
Placeholder Image 4
Placeholder Image 5
Samoyed
Better
Placeholder Image 3
Placeholder Image 4
Placeholder Image 5
Input Mask
0.1 Naive
Placeholder Image 3
Placeholder Image 4
Placeholder Image 5
Samoyed
Better
Placeholder Image 3
Placeholder Image 4
Placeholder Image 5
Prompt
0.15 Naive
Placeholder Image 3
Placeholder Image 4
Placeholder Image 5
"a white dog standing on top of a lush green field" Better
Placeholder Image 3
Placeholder Image 4
Placeholder Image 5

3.2 Add some parts of the object

To add some parts of the target object, we should first mask out (set the value to be 0) of the original object in that region in original object's cross-attention maps, since the original object has high activation in cross-attention maps. Without masking out the original object will hardly change the image.

Then, we can setting the value of the target object's cross-attention maps in that region to be 1. Again, we can display the text prompt generated by BLIP so that we can precisely specify the original object to mask out.

We can observe:

  1. Increasing the number of inference steps impairs editing (face distortion, additional ears, vanishing added legs).
  2. When the number of inference steps is small, the cross-attention gudiance has little influence on editing.
  3. The best result is achieved when the number of inference steps = 30.
Inference Step
30 40 50
Input Image
Cross-attention Guidance Weight
Samoyed
0.05
Placeholder Image 3
Placeholder Image 4
Placeholder Image 5
Input Mask
0.1
Placeholder Image 3
Placeholder Image 4
Placeholder Image 5
Samoyed
0.15
Placeholder Image 3
Placeholder Image 4
Placeholder Image 5

However, not all masks work similarly, as shown in the table below, our method fails to add a leg horizontal to the ground. The underlying reason is that the verb "standing" in the text prompt constrains the editing process and the dog must standing on top of a lush green field.

We have similar obeservations to the previous part except that increasing the cross-attention gudiance and the number of inference steps impairs the editing.

Inference Step
30 40 50
Input Image
Cross-attention Guidance Weight
Samoyed
0.05
Placeholder Image 3
Placeholder Image 4
Placeholder Image 5
Input Mask
0.1
Placeholder Image 3
Placeholder Image 4
Placeholder Image 5
Samoyed
0.15
Placeholder Image 3
Placeholder Image 4
Placeholder Image 5

3.3 Cross object editing

We also try to do cross object editing. We try to remove / add leg and transfer from dog to cat simultaneously. The results are reasonable though not perfect.

Input Image
Samoyed
Remove leg
Placeholder Image 3
Placeholder Image 4
Add leg
Placeholder Image 3
Placeholder Image 4

Conclusion

We successfully achieve some basic structural changes with our second and third approaches, but there are still some limitations. The second approach modifies unintended regions of the image and the third approach is constrained by text prompts generated by BLIP. Aside from that, since we don't finetune the pre-trained diffusion models, our approaches inherit the bias of pre-trained models. We will leave more flexible approaches and fine-tuning based approaches for future work.