In our project, we investigate GAN inversion and controlled image creation through StyleGAN and Stable Diffusion models. Utilizing StyleGAN, we input scribbles and apply both perceptual and L1 losses to align the guide image with the output image, optimizing the latent code in the process. In the case of Stable Diffusion, we initiate the denoising procedure using a noisy image guide as the starting latent.
Perceptual Loss Weight = 0 | Perceptual Loss Weight = 0.001 | Perceptual Loss Weight = 0.01 | Perceptual Loss Weight = 0.05 | Perceptual Loss Weight = 0.1 | Perceptual Loss Only |
---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
In our experiments, we adjusted the weights for L1 and perceptual losses, as well as for L2 regularization on delta. The perceptual loss is calculated using the L2 distance of the conv_5 features. We optimized all latents across 5000 iterations utilizing the L-BFGS method.
Utilizing only L1 loss results in blurry images because this loss averages across all pixels and is less effective at penalizing local discrepancies. Relying solely on perceptual loss yields slightly inferior details, such as in the upper right corner of images, because the convolutional features employed may not fully capture every detail.
Increasing the perceptual loss weight leads to subtle changes; weights of either 0.001 or 0.01 appear to produce satisfactory outcomes.
L2 Regularization Weight = 0 | L2 Regularization Weight = 0.001 | L2 Regularization Weight = 0.01 | L2 Regularization Weight = 0.1 | L2 Regularization Weight = 1 |
---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
We implemented L2 regularization by penalizing the L2 norm of delta, averaged across the batch and ws dimensions when using w+ latents. With intense regularization, for instance with a weight of 1, delta approaches zero, resulting in an image that closely resembles the one produced using the initial latents. Conversely, smaller weights, such as 0.001, yield results akin to those without any regularization, whereas larger weights generally cause the image details to become blurrier.
Original Image | Vanilla GAN | StyleGAN(w+) |
---|---|---|
![]() |
![]() |
![]() |
Compared to StyleGAN, Vanilla GAN delivers significantly poorer reconstruction outcomes. This discrepancy likely stems from the generator's limited capability, as well as the fact that the manifold of generated images by Vanilla GAN is closer to the real image manifold than that of StyleGAN.
Original Image | StyleGAN with z | StyleGAN with w | StyleGAN with w+ |
---|---|---|---|
![]() |
![]() |
![]() |
![]() |
The use of the z latent space in StyleGAN makes optimization more challenging, as it requires backpropagating gradients through an additional mapping network, resulting in reconstructed images that are closer to the initial rather than the input image. Switching to w or w+ latent spaces significantly enhances outcomes, with w+ providing superior detail and background due to its greater expressiveness and optimization flexibility.
Employing mean initialization aids the z space latents, likely because it reduces the expected distance between the mean and optimal latents. For w and w+ spaces, mean and random initializations yield similar results due to the comparatively easier optimization.
Ultimately, we opt for StyleGAN with w+ latents, a perceptual loss weight of 0.01 on the conv_5 layer, and an L1 loss weight of 10, foregoing L2 regularization for subsequent experiments.
Sketch 1 | Sketch 2 | Sketch 3 |
---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
The outcomes suggest that denser scribbles with filled-in colors enhance the quality of generated images, particularly the facial details. Conversely, sparse scribbles around the silhouette of a cat often yield darker and less detailed facial regions. This may occur because the latents that govern these unmasked areas are not directly supervised with guiding scribbles but are instead indirectly influenced during the optimization of masked regions. Consequently, these latents might overfit to the masked losses and produce less realistic outcomes in the unmasked areas.
Additionally, using less natural colors for the scribbles can result in severe overfitting and unrealistic outcomes when optimized over many iterations. In contrast, natural colors that align with the actual image distribution typically do not exhibit this problem. This issue may arise because the latents for images with unnatural colors gradually deviate from the real (training) image manifold, leading the generator to fail in producing realistic images.
T=500; Strength=10 | T=700; Strength=10 | T=1000; Strength=10 |
---|---|---|
![]() |
![]() |
![]() |
T=700; Strength=5 | T=700; Strength=10 | T=700; Strength=20 |
![]() |
![]() |
![]() |
As observed, increasing the timestep, and consequently the added noise, helps the results adhere more closely to the text prompt, though at the expense of deviating from the user's sketch. Elevating the guidance strength compels the model to produce images with enhanced detail and greater fidelity to the text prompt. However, this approach carries the risk of biasing the model towards certain features or patterns that it strongly associates with the text.