Towards Few-Shot Image Synthesis

Zhipeng Bao (zbao), Canbo Ye (canboy)

Overview

Motivated by human cognition, i.e. humans can generalize from a few visual concepts to perceive the world, we aim to explore an interesting and challenging topic in Computer Vision--- Few-shot Image Synthesis. That is, when seeing only one or few images of certain novel categories, the model is trained to generate new and diverse images of those categories. And we hold that generalization of machine vision can be greatly improved when it can derive from few images.

For this target problem, we adopt the prototypical network as the discriminative model and propose a general architecture for image generation networks (Generative Adversarial Networks, GANs). By equiping with our architecture, the generative model can handle the few-shot image generation tasks. We instantiate our architecture with three state-of-the-art models for three different visual tasks:
a. SAGAN for general image generation; [1]
b. HoloGAN for multi-view image synthesis; [2]
c. StarGAN for multi-domain image translation. [3]
The experimental results indicates that the proposed architecture and algorithm can improve the diversity of the generative images and robustness of the model in few-shot regime.


Introduction

Human beings can recognize a new object from very few instances, and hallucinate that new class with a different pose or appearance even though we did not actually see that before. For example, given a new class Gadwall with only a few instances (Fig. 1), humans are able to recognize this kind of bird and also imagine what this bird would look like from different viewpoints. Achieving similar levels of generalization for machines is a crucial problem in computer vision, which is known as few-shot learning.

Figure 1: An example of how human-beings recognize a new class. [4]

The goal of few-shot learning is to classify new data having seen only a few training examples. Fig. 2 shows a typical framework of few-shot learning algorithms, we will first train a base model with abundant examples of these base classes, and then we can finetune the base model with only a few examples of the novel classes. The assumption is that the deep models trained with abundant data have good generalization and robustness so that it could be easily transferred to novel tasks with only a handful of examples. In our project, we aim to introduce the few-shot methods to the era of image synthesis.

Figure 2: A typical few-shot framework. [5]

Specifically, the few-shot method we will mainly rely on in our project is called Prototypical network as is shown in Fig. 3. The idea here is to average the embeddings of the examples from different classes and compute their mean embedding or say prototype here. They then use the similarity between each prototype and the query embedding as a basis for classification. This method is robust to data imbalance by construction so it could handle the few shot setting.
Figure 3: Prototypical network in the few-shot scenarios. [6]

On the other hand, recent years have witnessed great success in the research of image synthesis. Fig. 4 shows some generated images by a state-of-the-art GAN model. These results are so realistic that we can hardly distinguish them from the real ones. However, most of these techniques require thousands of examples to achieve such promising performance. Therefore, we aim to explore the possible combinations of few-shot learning and image synthesis.

Figure 4: Some generated images by StyleGAN. [7]

In the following sections, we will elaborate our proposed methods as well as some experiments to evaluate our approaches.


Proposed Methods

Problem Setting: Few-shot image synthesis

We define the problem of few-shot image synthesis as is shown in Fig. 5:

Figure 5: Problem settings of few-shot image synthesis.
Suppose we have a sufficient number of images in many categories during training, and we train a generator with these images. Then during the test time, when it is provided with a few, like, K images in each unseen category, the generator should be adopted to the new categories within several epochs of fine-tuning.

Prototypical discriminator

Inspired by the success of few-shot classification. Our main spirit for this task is: let the discriminator guide the generator (Fig. 6).
We hold that the generator learns to synthesize by utilizing the signals sent by the discriminator. So if the discriminator can work in the few-shot setting, then it can guide the generator to work for the corresponding few-shot image generation.

Figure 6: The key idea of our proposed methods: let the discrimination guides the image generation.

Therefore, we propsoe a few-shot prototypical discriminator for the target few-shot image generation.

Figure 7: Different kinds of discriminator design in conditional image generation and our proposed method.
By doing literature review, we found there were two main kinds of discriminator designs in conidtional image generation.
(1). As shown in Fig.7-(a), for the image x and the category label y, the discriminator will indivisually embed the image and the category label:
e1 = Embed1(x); e2 = Embed(y)

Then it discriminate the real or fake image with the concatenation of the both.
y_{real/fake} = D([e1,e2])

The SAGAN[1] and the vanilla cGAN[8] are two typical networks that use this architecture. The advantage of this architecture is that it is easy to implement and robust with different datasets. However, the disadvantage is that it does not disentangle the category information and the image information (real of fake), so the generated image is less diverse.

(2). In comparison, another architecture is that the discriminator first predict a shared embedding of the image and then predict both real or fake and the category label with this embedding.

e = Embed(x)

y_{real/fake} = D1(e); y_{cat} = D2(e)

This architecture is mostly used in some complicated architectures to consider the trade-off between several loss functions and different features.

We build our architecture based on the second design. We replace the category discriminator with a prototypical classifier:

y_{cat} = Proto(e)

Specially, we discriminate the features in the latent space so that the model can better utilize the pre-trained feature representations. When the few-shot setting K>1, for the training of real images, we follow the standard prototypical training:
f_{support} = Proto(Embed(x_{support}))

loss = MinMax(f_{support}, Proto(Embed(x_{query})))

Then if K = 1, we have no access to the support set, at this time, we do the min-max with only query images:
loss = MinMax(Proto(Embed(x_{query}))

For the training of the generator, we use the whole training images as the support set and do the minmax with the fake images.
loss = MinMax(Proto(Embed(x_real)),Proto(Embed(x_fake)))


Evaluation

As the prototypical discriminator is a general architecture that can be applied to any conditional GAN models, we verify our model design with several target tasks and networks. We select three different tasks: general image generation, multi-view synthesis and multi-domain translation.

For the evaluation metric, based on Jun-Yan's comments, because we mainly experiment with some fine-grained datasets and the traditional metrics for image generation, such as Inception Score and FID, they cannot actually represent the real performance of the models. Besides, At few-shot setting, since the quality of the generated images are not good, these scores are also too small to show the differences. So we here only compare the qualitative results (the generated images). And we think this is also the siprit of this course, the generated images are actually what we care about. So for the following results, we only show the visual comparison. But we also provide the FID and Inception Score evalution code for our code submission. We can always get these scores to have a comparison.

Few-shot Image Generation

Dataset: CUB dataset [9]. We randomly split 150 classes as the base class for the pre-training and another 50 classes as the novel class for few-shot fine-tuning. We set K=5. Here are some example images at CUB dataset.
Compared Model: SAGAN [1] and SAGAN-proto (Ours)

Figure 8: Example images at CUB dataset.

Our results and comparisons are shown in Fig. 9.

Figure 9 : Left: generated images with SAGAN at base setting; Right: generated images with SAGAN-Proto at base setting.
We can see, both baseline and our model work well for the basic setting, indicating the prototypical discriminator will not hurt the normal training process. Then the comparison with novel few-shot setting is shown in Fig. 10.
Figure 10 : Left: generated images with SAGAN at novel setting; Right: generated images with SAGAN-Proto at novel setting.

We can see, at this time, the visual quality of our model is much better than the baseline. We think the reason is that when we start fine-tune the model, the normal classifier actually "retain" the model with a "good" representation obtained by the pre-training. Because the classification network MUST be retrained due to the mismatch of the classification dimension (150 v.s. 50). So the classification network (not include the embedding/feature network) meet the problem of overfitting and this problem is backward to the generator so that the overall visual quality is not good.
In comparision, our model do the classification directly in the latent space, so we never retrain the model but fine-tune the feature representations so that it is more robust with the novel classes.

Few-shot Multi-View Image Synthesis

Dataset: Celeb-HQ dataset [10]. We randomly select 35 attributes as training attributes and 5 (Black Hair, Gray Hair, Bald, Wearing Hat, and Aging) as few-shot test attributes. K = 5.
Compared Model: HoloGAN [1] and HoloGAN-proto (Ours)

The results at base setting and few-shot setting are shown in Fig. 11 and Fig. 12.

Figure 11 : generated images with HoloGAN and HoloGAN-proto at base setting.
Figure 12 : generated images with HoloGAN and HoloGAN-proto at novel setting.
We can see, in the few-shot case, the baseline tuned to remember some repeat patterns and almost colapse for the 3D identity. We think this is also because of the overfitting problem. However, at this time, since the CelebA-HQ dataset is an attribute dataset, the normal classifier can also utilize the shared feature representation and does not need to retrain the classifer from the scratch, so it can still synthesize meaningful images but overfit to some repeated patterns (the same problem with the third experiment). In comparison, our model can capture the shape, attribute and the 3D identity even with extreme low-data regime.

Few-shot Multi-Domain Image Translation

Dataset: CelebA dataset [11]. We select a set of hair color attributes (blcak hair, blond hair, brown hair, bald and gray hair), which can be regarded as mutually exclusive image domains that have similar semantic definition. We use the first four domains as the training image domains that have abundant images and the last one as the few-shot domain that has only K images.
Compared Model: StarGAN [12], CycleGAN [13] and StarGAN-proto (Ours)

For the task of multi-domain image translation, our proposed method is to have the StarGAN generator working with the prototypical discriminator. We have CycleGAN and StarGAN as the baselines for comparison. The goal of the experiments is to translate multiple hair colors to gray hair. We use the gray hair as the few-shot domain for finetuning and the others for base model training. Figure 13 illustrates the experiment results on the CelebA dataset with different K values.

Figure 13: Experiment results on the CelebA dataset with different K values.

The first row represents the raw input images and the other rows are results with only K images as the novel class dataset. As we can see from the second column, CycleGAN achieves a good translation of hair color, but it overfits and modifies much image details besides hair color. Larger changes are observed when K becomes smaller. In general CycleGAN tends to make the whole image to be brighter instead of aiming at the target attribute in few-shot settings.
As for StarGAN, it overcomes the shortcoming of CycleGAN in retaining details from input images. We could see that StarGAN does learn to change the hair to be whiter and brighter to some extent while keeping other features almost unaltered. However, there is a tendency of overfitting when K gets smaller, which indicates the lack of robustness in the few-shot setting.
Our approach, on the one hand, has a better performance in terms of robustness in the few-shot setting. It can maintain many details of the input even with very limited training images. However, we also admit that our approach does not have a very compelling performance of domain translation. We can see that the hair color does become closer to gray but the translation effect is not very conspicuous, for which further exploration is needed.


Conclusion

In this project, we propose a general prototypical-based architecture for few-shot image synthesis and verify this architecture with three diferent models targeting three different tasks. The overall spirit is that our prototypical discriminator can better utilize the pre-trained features so that the discriminator can give useful feedbacks to guide the generator to work for few-shot setting. The experimental results showed that our architecture is robust and can boost the diversity of the generated images with low-data regime.


References

[1] Zhang, Han, et al. "Self-attention generative adversarial networks." ICML. 2019.
[2] Nguyen-Phuoc, Thu, et al. "Hologan: Unsupervised learning of 3d representations from natural images." ICCV. 2019.
[3] Choi, Yunjey, et al. "Stargan: Unified generative adversarial networks for multi-domain image-to-image translation." CVPR. 2018.
[4] Bao, Zhipeng, Yu-Xiong Wang, and Martial Hebert. "Bowtie Networks: Generative Modeling for Joint Few-Shot Recognition and Novel-View Synthesis." ICLR. 2021.
[5] Hariharan, Bharath, and Ross Girshick. "Low-shot visual recognition by shrinking and hallucinating features." ICCV. 2017.
[6] Snell, Jake, Kevin Swersky, and Richard S. Zemel. "Prototypical networks for few-shot learning." NeurIPS. 2017.
[7] Karras, Tero, Samuli Laine, and Timo Aila. "A style-based generator architecture for generative adversarial networks." CVPR. 2019.
[8] Mirza, Mehdi, and Simon Osindero. "Conditional generative adversarial nets." arXiv preprint arXiv:1411.1784 (2014).
[9] Welinder, Peter, et al. "Caltech-UCSD birds 200." (2010).
[10] Lee, Cheng-Han, et al. "Maskgan: Towards diverse and interactive facial image manipulation." CVPR. 2020.
[11] Liu, Ziwei, et al. "Large-scale celebfaces attributes (celeba) dataset." Retrieved August 15.2018 (2018): 11.
[12] Choi, Yunjey, et al. "Stargan: Unified generative adversarial networks for multi-domain image-to-image translation." CVPR. 2018.
[13] Zhu, Jun-Yan, et al. "Unpaired image-to-image translation using cycle-consistent adversarial networks." ICCV. 2017.