For Bells & Whistles, I will first show some results where we interpolate in the GAN latent space. Next, I will show other results using pre-trained image editing methods and compare the results with our implementations.
For the following experiments, we will try to reconstruct the following image:
prec_wgt=0.01 | prec_wgt=0.05 | prec_wgt=0.1 | prec_wgt=0.5 | |
---|---|---|---|---|
l1=10, l2=0 | ![]() |
![]() |
![]() |
![]() |
l1=0, l2=10 | ![]() |
![]() |
![]() |
![]() |
l1=5, l2=5 | ![]() |
![]() |
![]() |
![]() |
For perceptual loss, if the weight is too small, it will not have very huge effects. If the weight is too large, it will lead to blurry images. Therefore, I am choosing the weight to be 0.1 for the perceptual loss.
The runtime is pretty similar between each combination, usually taking 20-21 seconds to finish training and get the final output.
Target Image | StyleGAN with z | StyleGAN with w | StyleGAN with w+ |
---|---|---|---|
![]() |
![]() |
![]() |
![]() |
The runtime is similar for each combination around 20 seconds for 1000 timesteps.
Target Image | Vanilla | StyleGAN |
---|---|---|
![]() |
![]() |
![]() |
However, the vanilla architecture is faster to run, only taking 7 seconds. In contrast, the StyleGAN takes around 20-21 seconds to run.
Sketch | Mask | Output |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Input | t=500, strength=7.5 | t=750, strength=7.5 | t=500, strength=15 | t=750, strength=15 |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Source Image | Interpolation | Target Image |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
InstructPix2pix is one of the most recent work that can perform image editing in an end-to-end fashion without any fine-tuning. Therefore, I am running the InstructPix2pix model on our image editing task and compare the results.
For the prompt for InstructPix2pix, I am using "Make it a royal painting." as suggested by the InstructPix2Pix paper
Input | InstructPix2pix Output | Our Output |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
For computation statistics, the inference time for InstructPix2pix on A4500 is 15-16 seconds on average, which is a lot faster than the runtime for stable diffusion edits. One reason could be because the implementation is different in our homework and the Huggingface stable-diffusion codebase, and another reason could be the difference in sampling schemes, such as DDPM takes more sampling steps than DDIM.
Overall, the end-to-end models are easier to use, but it is being used as a black-box and therefore are harder to exert fine-grained control. Moreover, these models can have unknown bias due to its training data, and when there exists a gap between your inference domain and the distribution of the training data, they are more likely to fail. On the other hand, methods like SDEdit can exert control based on individual inputs, and can allow users to modify the level of constraints based on their preference, but users will have to perform some level of prompt-tuning and hyper-parameter tuning in order to get the best performing results.