Front Image

GANs to Understand How the Human Brain Makes Sense of Natural Scenes

This endeavor is a part of the Algonauts 2023 competition, which aims to evaluate computational models that anticipate human brain activity when individuals view images of objects. The comprehension of how the human brain functions is a significant obstacle that both science and society confront. With every blink, we are exposed to a vast amount of light particles, yet we process the visual world as organized and meaningful. The central focus of this undertaking is to forecast human brain responses to intricate natural visual environments by employing the most comprehensive brain dataset accessible for this intention.

Overview:

workflow

Dataset overview:

  1. Train Dataset: For each of the 8 subjects, there are [9841, 9841, 9082, 8779, 9841, 9082, 9841, 8779] different images.
  2. fMRI: The corresponding fMRI visual responses of both the left and right hemispheres. The fMRI data is z-scored within each NSD scan session and averaged across image repeats, resulting in 2D arrays with the number of images as rows and as columns, a selection of the vertices that showed reliable responses to images during the NSD experiment. The left (LH) and right (RH) hemisphere files consist of 19,004 and 20,544 vertices.
  3. Test Dataset: For each of the 8 subjects, there are [159, 159, 293, 395, 159, 293, 159, 395] different images.

Region-of-Interest (ROI):
The visual cortex is divided into multiple areas with different functional properties, referred to as regions-of-interest (ROIs).

workflow

Proposed Methods:

  1. Fine-tune stable-diffusion
  2. Generative models with multi discriminator approach
  3. Minimizing Correlation as GAN loss
  4. Customizing Patch Discriminator for fMRI data
  5. Vision-Transfomer as generator

Fine-tune stable-diffusion:

I took inspiration from a related study Link for my initial approach to tackle this problem. In the mentioned study, the authors reconstructed visual images based on the fMRI responses of subjects. They used a pre-trained diffusional model on specific ROIs of the brain, as stated earlier, and conditioned it on the remaining ROIs to infer the subject's thoughts. By examining the following diagrams, one can gain insight into their methodology and findings:

result_related_work method_related_work

I attempted to fine-tune the diffusional model in order to predict fMRI responses based on a given image and subject ID. The initial results of the experiment demonstrated promise, as the correlation score between the predicted and actual fMRI data increased with each epoch. However, due to computational limitations, I was unable to fully explore this approach. The final correlation score achieved using the diffusional model was 0.24

method_related_work

Generative Adversarial Networks (GANs):

One benefit of GANs over diffusion or autoregressive models is their quicker training and inference times. As a result, I was able to explore several different approaches.

To keep things brief, I will only discuss the most potential approach for fMRI prediction through GAN training:

  1. Multi discriminator approach: In contrast to images, fMRI signals have high dimensionality. To better differentiate between generated and actual fMRI data, I hypothesized that training separate discriminators for each brain region of interest (as mentioned earlier) could enhance the generation process. To test this assumption, two discriminators were used: one for the right hemisphere fMRI and another for the left hemisphere. This approach led to a noteworthy enhancement in the correlation score.
  2. Customizing Patch Discriminator for fMRI data: I adapted the Patch Discriminator to handle fMRI data, where each input vertex corresponds to a response from a different brain region. In contrast to its use in image analysis, the Patch Discriminator is customized to focus on the local semantics of the fMRI data. To achieve this, the output dimension (HxW) is set to be the same as the input dimension.
  3. Minimizing Correlation as GAN loss: As our evaluation metric relies on the correlation score, I initially considered making the Pearson correlation differentiable. However, since this loss function lacks convexity or concavity, training becomes unstable. Despite this, the final optimum yielded significantly better results, and model convergence was achieved in fewer epochs.
  4. Vision-Transfomer as a generator: The experiments mentioned earlier involved using a generator with a very shallow architecture. However, replacing this generator with a VIT (Vision Transformer) architecture has been shown to yield impressive results in the literature. As a result, using a VIT has led to a final correlation score of 0.54, which is still increasing.
Model Loss Epochs Pre-trained Config More training Time per epoch Final Correlation Score (Max: 1)
Dreambooth-Stable-Diffusion Mean Squared Error 2 Yes Standard Can improve further 1 day 0.24
Vanilla GANs L1 + GAN loss 25 No Spectral Norm Cannot be improved 5 min 0.15
Vision Transformer GANs L1 + GAN loss 50 Yes Single VIT discriminator Can improve further 2 hrs 0.54
Vanilla GANs Correlation + L1 + GAN loss 6 No Two discriminators
Spectral Norm
Cannot be improved further 8 min 0.30
U-Net Correlation + L1 + GAN loss 1 No Two discriminators
Spectral Norm
Need to stabalize correlation loss 20 min 0.45
Vanilla GANs Correlation + L1 + GAN loss 25 No Custom Patch Multi-discriminators
Spectral Norm
Running... 23 min __
Vision Transformer GANs Correlation +L1 + GAN loss 25 Yes Custom Patch Multi-discriminators Running... 3 hrs __
SOTA Unknown Unknown Unknown Unknown Unknown Unknown 0.61

The results of the above experiment imply that incorporating multiple discriminators, including correlation loss, and utilizing a vision transformer can each lead to a significant improvement in the output. At present, I am conducting an experiment to combine all three enhancements into a single architecture. Additionally, I am experimenting with customized Patch Multi-discriminators.

Currently, the state-of-the-art (SOTA) for the Algonaut competition achieves a correlation score of 0.61, and my VIT-based approach achieved a score of 0.54. However, with the potential integration of all three techniques and a few more epochs of training, the results may surpass the SOTA. The outcome should be available in a week.

Result Visualization:

method_related_work method_related_work

The regions that appear darker indicate a low correlation score.

References:

  1. https://sites.google.com/view/stablediffusion-with-brain/
  2. The Algonauts Project 2023