Assignment #3 - Cats Generator Playground

1. Overview

1.1 Introduction

In this project, I delved into the realm of Generative Adversarial Networks (GANs), focusing on their application in image generation and transformation. The project is structured into two main segments, each aimed at deploying GANs for distinct image synthesis tasks using a hands-on coding approach.

Part One delves into the implementation of a Deep Convolutional GAN (DCGAN), a variant of GANs that leverages convolutional neural networks for the generator and discriminator components. This part of the project is dedicated to generating images of grumpy cats from random noise inputs, showcasing the ability of DCGANs to synthesize high-quality images.

Part Two advances to a more complex architecture known as CycleGAN, renowned for its effectiveness in image-to-image translation tasks without the need for paired training data. The primary objective here is to train the CycleGAN model to perform transformations between two distinct types of cat images (Grumpy and Russian Blue) as well as between images of apples and oranges. This section highlights the versatility of GANs in manipulating and transforming images across different domains.

And there are bells and whistles that gave chance to the exploration of diffusion model, VAE, and some improvements over DCGAN and CycleGAN.

2. Implementation

2.1 Deep Convolutional GAN

A DCGAN is simply a GAN that uses a convolutional neural network as the discriminator, and a network composed of transposed convolutions as the generator. To implement the DCGAN, we need to specify three things: 1 the generator, 2 the discriminator, and 3 the training procedure.

To achieve a downsampling factor of 2 in our convolutional layers with a kernel size of K = 4 and a stride of S = 2, we need to calculate the appropriate padding. The formula for determining the output size of a convolutional operation is given by:

Output size = (W - K + 2P) / S + 1

Where W is the input size, K is the kernel size, P is the padding, and S is the stride.

Given that we want to halve the input size, the output size should be W/2. Substituting the known values into the formula gives us:

W/2 = (W - 4 + 2P) / 2 + 1

W = 64

Solving this equation for P (padding), allows us to determine the necessary padding to achieve the desired downsampling effect.

P = 1

Screenshots of discriminator and generator training loss:

Expected Outcomes for Each Condition if GAN learns:

Basic vs. Deluxe Data Preprocessing: Deluxe preprocessing introduces more variation into the training data (e.g., random crops, flips, rotations), which can help the GAN generalize better and become more robust. Thus, for deluxe preprocessing, I would expect a slightly more stable training process with possibly higher initial losses (due to increased task difficulty) but better overall performance in generating diverse and convincing images.

Without vs. With Differentiable Augmentation: Differentiable augmentation introduces additional variability and challenges during training by applying augmentations to images that gradients can flow through, improving robustness and generalization. Training with differentiable augmentation might show higher initial losses for both the discriminator and the generator, as each has to deal with a more complex task. However, as training progresses, this approach can lead to more stable and convergent behavior, as it helps prevent overfitting and encourages the generator to produce more realistic images.

Actual result samples:

2.2 CycleGAN

CycleGAN, short for Cycle-Consistent Generative Adversarial Networks, is a method for training unsupervised image translation models via the GAN architecture. The key idea behind CycleGAN is to learn mappings between two domains (let's call them X and Y) without paired examples. It does this by using two generators and two discriminators:

Generator G: Transforms images from domain X to domain Y.

Generator F: Transforms images from domain Y to domain X. Discriminator Dx: Differentiates between images from domain X and generated images from domain Y.

Generator F: Transforms images from domain Y to domain X.

Discriminator Dx: Differentiates between images from domain X and generated images from domain Y.

Discriminator Dy: Differentiates between images from domain Y and generated images from domain X. The objective of CycleGAN includes an adversarial loss (to match the distribution of generated images to the data distribution in the target domain) and a cycle consistency loss (to prevent the learned mappings from contradicting each other).

Cycle consistency loss is a critical component in the training of CycleGANs. It enforces that an image from one domain (X), when transformed into another domain (Y), and then transformed back to the original domain (X), should look similar to the original image. This helps in preserving the key attributes of the input data and provides a way to enforce learning without paired examples.

When cycle consistency loss is included in the training of CycleGANs, it often leads to improved preservation of input image content. The cycle loss acts as a regularizer that discourages the network from making changes to the input image that cannot be reversed. As a result, transformations that maintain the structural integrity and key characteristics of the image are encouraged, leading to more realistic and coherent translated images.

Without cycle consistency loss, the translations can become incoherent or lose important structures and relationships in the image. The network might still learn to transform images from one domain to another but without the constraint that ensures the transformations can be reversed. This often leads to results that may capture the style of the target domain but fail to preserve the content from the source domain. There might be distortions or changes in the image that don’t make sense when attempting to reverse the transformation.

CycleGAN results using Patch Discriminator for with/without cycle consistency loss:

From the result, we can indeed see that with cycle consistency loss, the result is slightly better in both the catA-catB and apple-orange case. The reason for these differences lies in the goal of the cycle consistency loss: it is to ensure that the network not only learns to map an image from one domain to the other but also to find a mapping that is somewhat reversible. This is a powerful constraint because it requires the network to understand and maintain the underlying structure of the input images. Without this loss, the network has no incentive to preserve these structures and can produce images that look like they belong to the target domain but don't necessarily have a meaningful correspondence to the input images.

CycleGAN results using DC Discriminator for with cycle consistency loss:

DCDiscriminator (Deep Convolutional Discriminator) is the standard discriminator that classifies entire images as real or fake. It processes the whole image through several layers of convolution, downsampling the image into a single scalar output that represents the probability of the image being real.

PatchDiscriminato operates on patches of the input image. Instead of downsampling the image to a single scalar, it outputs a matrix where each element corresponds to a patch of the input image. Each element provides a probability indicating whether that particular patch is real or fake. This means the PatchDiscriminator makes a decision based on local image features rather than the global structure.

Detail and Texture: The PatchDiscriminator tends to focus on finer details and textures within local regions of the image.Therefore, the generated images overall have more realistic and detailed textures compared to those produced with a DCDiscriminator.

Realism: Images generated with a PatchDiscriminator appears more locally realistic since the discriminator has been trained to focus on small patches. It is especially useful for tasks where texture and fine details are crucial, such as the cat's case.

Artifacts: With PatchDiscriminator, there is a potential for patchy artifacts if the generator overfits to fool the discriminator on a patch level but fails to maintain coherence across patches, like the black splits on one of the cats.

2.3 Spectral Normalization(Bells and Whistles)

Spectral normalization is a technique designed to stabilize the training of discriminators in GANs by normalizing the spectral norm of the weight matrices. The process involves wrapping each convolutional layer with torch.nn.utils.spectral_norm in discriminator model definitions (in conv function).

The results show that spectral normalization significantly stabilizes training.

2.4 VAE for Cat Generation (Bells and Whistles)

VAE Model Overview:
The Variational Autoencoder (VAE) model defined consists of two main components: an encoder and a decoder. The encoder processes input images to produce a latent space representation characterized by mean and variance vectors. The decoder reconstructs images from samples drawn from this latent space, using a reparameterization trick to enable gradient-based optimization. VAEs are known for their stable training, theoretical foundation in probabilistic graphical models, and the ability to generate diverse samples.

VAE results after 12500 epochs:

Results Description:
As can seen from the results, the VAE model produces quite pale images and lacking in sharp details compared to the original data. I am not sure if it is due to my implementation, but indeed people say that this effect can arise from VAE's objective, which includes a reconstruction loss that encourages average representations, leading to a potential loss of detail and vividness in the generated images. The regularization term in the loss function, aiming for a well-structured latent space, further contributes to the generation of smoother images.

Comparison Between DCGAN and VAE:
DCGANs and VAEs are both popular models for generating images, but they have distinct characteristics and are suitable for different applications. DCGANs are known for generating high-quality and sharp images, thanks to their adversarial training mechanism that encourages the model to produce realistic samples. However, they can suffer from training instability and mode collapse, making them challenging to train. On the other hand, VAEs offer more stable training and a well-understood theoretical foundation but tend to generate smoother and less detailed images due to their reconstruction-based objective.

2.5 Use Pre-trained Diffusion Model(Bells and Whistles)

Used Huggingface Diffuser repo: https://huggingface.co/docs/diffusers/en/using-diffusers/img2img. Played with image-to-image, text-to-image, text-to-image-to-image, etc.

I then tried to finetune it but it seems that I cannot override whatever was fed into the model. The result just looks like cat in general.