16-726 Learning-Based Image Synthesis

Spring 2024

Assignment #3 - Cats Generator Playground

Yifei Liu

I. Introduction

For this assignment, we experiment with GANs and implemented 2 architectures: DCGAN and CycleGAN. For DCGAN, we trained the network to generate grumpy cats from samples of random noise. For CycleGAN, we trained the network to translate between Grumpy and Russian Blue cats, and between apples and oranges.

II. DCGAN

Method

I first implemented DCGAN, which uses a CNN as discriminator and upsampling+convolutional layers for generator. The generator progressively upsample the input noise to genearte a fake image, which will be passed to the discriminator for classification. The discriminator and generator have the following architecture:

For the discriminator specifically, the padding should be 1 given kernel size k=4 and stride=2, calculated as below:

For the deluxe version of data augmentation, I used the following techniques:

Enlarge (by 1.1 x image_size)
Random crop (to original size)
Random horizontal flip

I originally tried to use random rotation and colorjitter, which randomly change the brightness, contrast, saturation, and hue of an image, but it did not work well with DCGAN. Colorjitter resulted in artifacts in the generated images as the input image are changing too much and the network is not able to learn the features. Random rotation result in cat images that are not upright. I also include the option to apply differentiable augmentation in the training loop of discriminator and generator, to both real and fake images. The training procedure is same as the original GAN.

Results

Here are the discriminator and generator training loss curves for different data preprocessing with batch size 32.

During successful training, the discriminator loss should initially decrease as it quickly learn to distingush between real and fake, start to oscillate as generator improve, and then ideally stabilize around value that suggests a 50-50 guess, meaning it can no longer easily tell difference between real and fake. The generator's loss is expected to be high at the beginning and decrease as it learns to produce more realistic data.

Here are samples of generated images from DCGAN with --data_preprocess=deluxe early and later in training.

and with --use_diffaug.

We can see that initially the generation is blurry and vaguely capture the color of the cat, and gradually the network is able to generate more realistic grumpy cats with more details and texture. The differentiable augmentation helps to learn clearer features and the structure of the cat.

III. CycleGAN

Method

The generator of the CycleGAN consists of 3 stages: 1) encoder that extracts image features, 2) transformation part that consists of residual blocks (I constructed it with 6 ResNet blocks), and 3) decoder that reconstructs the image. The architecture is as followed:

The discriminator is similar to the DCGAN's discriminator, with the only difference being it classifies the patches of the images, which allow the model to focus on the local structure. It will therefore output spatial outputs (4x4) instead of a scalar (1x1).

The mspecialest thing of CycleGAN is the cycle consistency loss, which is used to ensure that the generated image should be able to be translated back to the original image. The loss is the mean squared error between the original and reconstructed obtained by passing through both generators in sequence (ex: X -> Y -> X).

Here are the result samples at with and without cycle-consistency loss.

--disc patch

--disc patch --use_cycle_consistency_loss

--disc patch --X apple2orange/apple --Y apple2orange/orange

--disc patch --use_cycle_consistency_loss --X apple2orange/apple --Y apple2orange/orange

We can see that by including cycle-consistency loss, the model is able to generate more realistic and higher quality images. It is able to capture the texture, color, and overall structure of the cat or fruit, and the generated images have less artifacts than without cycle-consistency loss. The translation (X -> Y -> X) helps the model learn to keep the content of the original image and change only domain-specific features.

Let's also compare the results with using DCDiscriminator instead of the previous PatchDiscriminator, with all samples generated at iteration=10000.

--disc dc --use_cycle_consistency_loss

We can see for most cases the patch discriminator results are better than these results with DCGAN's discriminator. There are more artifacts and blurry images, especially for the grumpy to russian blue cat translation. The model is able to get the location of eyes and nose right, but not the texture. With PatchDiscriminator looking at patches to determine the realness of the image, the model is able to focus on the local structure and pay attention to finer details to produce more realistic result at smaller scale. That's why the results above without patch discriminator capture the overal structure of cat or fruit but does not have good local details.

IV. Bells & Whistles

1. Diffusion Model

I implemented and train a diffusion model on the grumpycat dataset. All the missing parts are completed in diffusion_model.py, diffusion_utils.py, train_ddpm.py, test_ddpm.py It is trained for 2000 epochs with batch size 8. The results are not realistic enough yet, but it is able to capture the general structure of the cat. It will require more debugging of the model architecture and training procedure to generate more realistic images. Given the time constraint, I will simply show some preliminary results.

2. Samples from pre-trained diffusion model

I tried generating cat samples using SDXL Turbo based on Stable Diffusion. The results are as followed. Noticably, the generated images are not realistic and have a similar artistic style, which relates to the concern discussed in class for aesthetic score of these generative models, that can make the model prefer one certain style over others.