**16-726 Final Project: Reproducing the Results of EG3D (Efficient Geometry-Aware 3D GANs)** Jeff Tan (jefftan@andrew.cmu.edu) (##) Overview The goal of this project is to reproduce the results of a recent paper called [EG3D](https://matthew-a-chan.github.io/EG3D/), which shows promising results in the area of efficient geometry-aware 3D GANs. Towards this end, we will discuss three papers in the area of 3D GANs and NeRF: [EG3D](https://arxiv.org/abs/2112.07945), [StyleGAN2](https://arxiv.org/abs/1912.04958), [instant-ngp](https://arxiv.org/abs/2201.05989), as well as other related works as needed. We will then report my results trying to reproduce the results of EG3D, heavily inspired by other concurrent attempts to reimplement the results of the paper. (##) Summary Unsupervised generation of high-quality multi-view-consistent images and 3D shapes using only collections of single-view 2D photographs has been a long-standing challenge. Many existing 3D GANs are not 3D consistent which limits shape quality, or are compute-intensive which limits image quality and resolution. EG3D is an innovative architecture for fast and high-quality 3D GANs that leverages existing 2D GAN generators for feature generation, a triplane representation that lifts 2D features into 3D, and NeRF-style neural rendering to produce view-consistent renderings of these 3D triplane features. This architecture is highly expressive because the 2D feature generation component inherits the efficiency and expressive of 2D GAN generators such as StyleGAN2, and the 3D explicit intermediate representation coupled with volume rendering enforces multi-view consistency and high-quality geometry. The output of this generation pipeline is passed directly to a StyleGAN2 discriminator and trained in the same way as StyleGAN2. To improve efficiency on high-resolution outputs, EG3D does not directly render RGB output images using volume rendering. Instead, it uses the previously described generator pipeline to render low-resolution but high-dimensional features corresponding to each latent vector. It then uses a CNN-based super-resolution network to upsample the low-res output to high-res, taking advantage of the computational efficiency and locality of CNNs. As CNNs are not view-consistent, EG3D passes a combined low-res and high-res 6 dimensional input to the discriminator in order to encourage consistency between the low-res and high-res output. (##) Dataset We use the FFHQ dataset to train EG3D. [FFHQ](https://github.com/NVlabs/ffhq-dataset), or Flickr-Faces-HQ, is a high-quality dataset of human faces originally created as a benchmark for StyleGAN. It consists of 70k PNG faces at 1024x1024 resolution, with considerable variation in age, ethnicity, image background, and accessories. An alternative dataset, [AFHQ](https://paperswithcode.com/dataset/afhq) (Animal-Faces-HQ), consists of 15k PNG animal images in three different categories (5k each of cat, dog, and miscellaneous wildlife), and was also used in the EG3D paper. For the purposes of training EG3D, we adopt the same dataset processing strategy as StyleGAN2 and augment the dataset with horizontal flips to increase the dataset size from 70k to 140k. While it is possible to use additional data augmentation techniques for GANs such as differentiable augmentation, we do not do so here. Importantly, the EG3D generator is conditioned on the camera intrinsics and extrinsics that were used to capture each image in the training set. This is represented as a 25-dimensional conditioning vector: a 3x3 camera matrix for the intrinsics (which we fix to a single default value in our code), and a 4x4 homogeneous transform for the camera pose of each image. Following [existing work](https://github.com/DCGM/ffhq-features-dataset), we use off-the-shelf learning-based face gaze detectors (specifically the [Face API](https://azure.microsoft.com/en-us/services/cognitive-services/face/) from Microsoft) to extract the roll, pitch, and yaw of each face in FFHQ. These parameters are stored in a text file and loaded by the dataloader. During runtime, we convert these three Euler angles into a 4x4 homogeneous matrix describing the camera's rotation and translation. Specifically, we assume that a standard forward-facing face is located at $(x,y,z)=(0,0,-1)$ and looking down the $+z$ axis. Due to the prohibitive training time required to train EG3D from scratch on this dataset (~1 week on 8 Tesla V100 GPUs as reported in the paper), we downsample the dataset using Lanczos resampling and treat 64x64 as our low-resolution output and 256x256 as our high-resolution output, as opposed to 256x256 for low-res and 1024x1024 for high-res as described in the official paper. (##) Existing approaches for 3D GANs Many existing architectures exist for 3D GANs. [PlatonicGAN](https://arxiv.org/abs/1811.11606) from 2018 uses a textured 3D voxel grid and a similar 3D GAN architecture that is used in EG3D. [Visual Object Networks](https://arxiv.org/abs/1812.02725) from 2018 decomposes the problem of 3D view-consistent generation into separate components for 3D shape, 2.5D silhouette / depth, and surface texture, and separately optimize each of these stages. [GRAF](https://arxiv.org/abs/2007.02442) from 2020 directly optimizes a NeRF-like implicit radiance field instead of relying on explicit representations like the prior two works. Common across all of these approaches, 3D structure is specifically baked into the generator architecture, sometimes explicitly in the form of a voxel grid or explicit shape / silhouette / depth, and sometimes implicitly in the form of weights on a radiance field. While this 3D structure is responsible for the view-consistency and geometry quality that these 3D GANs can achieve, and is able to accurately render 3D shape and texture from multiple views, these architectures are often slow and difficult to optimize, without the efficiency and image quality guarantees of 2D GAN architectures. (##) Method In this section, we summarize the method described [EG3D paper](https://matthew-a-chan.github.io/EG3D/media/eg3d.pdf), as shown in the following figure from the paper. For full details, as well as their experiments and ablation studies, please see the full text of the paper. (###) Triplane Representation Training high-resolution 3D GANs requires a 3D representation that is both efficient and expressive. The key innovation of EG3D is a hybrid explicit-implicit triplane representation that achieves both of these advantages. In this formulation, we use three explicit 2D feature maps of resolution $N\times N\times C$ on axis-aligned orthogonal planes. A 3D position $(x,y,z)$ is queried by projecting it onto the three feature planes, retrieving corresponding feature vectors with bilinear interpolation, and aggregating the results by summation. The primary advantage of the triplane representation is efficiency - by shifting most of the expressive power onto explicit features, we can keep the MLP decoder small and reduce the computational cost of rendering compared to fully implicit MLP architectures. This principle of high-quality feature representations for NeRF has been explored in other works such as instant-ngp. (###) CNN Generator and Volume Rendering The features of the triplane representation are outputted by a StyleGAN2 CNN generator, where the random latent code and conditioning camera parameters are first processed by a mapping network as in StyleGAN2 to yield an intermediate latent code, which then modulates the convolution kernels of a synthesis network. Rather than producing RGB outputs, the StyleGAN2 backbone produces a $N_L\times N_L\times C$ feature image which is reshaped channel-wise into triplanes and queried during volume rendering using the procedure described above. Using StyleGAN2 as a backbone inherits many of its desirable properties, such as a well-behaved latent space that enables style mixing and latent-space interpolation. Features are sampled from the triplanes, aggregated by summation, and processed by a lightweight MLP decoder following the torch-ngp implementation of NeRF. Volume rendering is implemented using importance sampling and instead of producing RGB images, it produces feature images which include more information that can be effectively used towards super-resolution and image-space refinement as described in the next section. (###) Super Resolution and Dual Discrimination Although the triplane representation is efficient, it is still too slow to naively render at high resolutions such as 512x512 or 1024x1024. Therefore, we perform volume rendering at low resolution and use an image-space CNN to upsample the neural rendering to high resolution. Super-resolution is performed using StyleGAN2-modulated convolutional layers that upsample and refine the volume-rendered feature image into the final RGB image. Here, per-pixel noise inputs are disabled and we reuse the mapping network of the backbone to modulate these layers. The resulting low-resolution and high-resolution renderings are critiqued by a 2D CNN discriminator, following standad GAN training. For this purpose, we use dual discrimination to avoid multi-view consistency between the low resolution and high resolution. By upsampling the low-resolution image and concatenating it with the high-resolution imageto form a 6D image, intuitively the discriminator is able to detect and penalize any inconsistency between the two, which enforces our upsampled outputs to be multiview consistent. The real images fed into the discriminator are also augmented with an appropriately blurred copy of the same image. This approach encourages the final output to match the distribution of real images and encourages the neural rendering to match the distribution of downsampled real images. (###) Pose Conditioning As most real-world datasets such as FFHQ include biases that correlate camera poses with other attributes like facial expressions, such unwanted attributes need to be decoupled during inference for multi-view consistent synthesis. Towards this end, we pass the camera parameters into the backbone mapping network as input to allow the target view to influence synthesis. To prevent the generator from being too dependent on the camera poses and rendering a 2D billboard angled towards the camera, we randomly swap the conditioning pose with a random pose with 50% probability during training. To introduce additional information that guides the generator to learn correct 3D poses, we also make the discriminator aware of the camera poses from which the inputs were generated, by passing the rendering camera's intrinsics and extrinsics into the discriminator as a conditioning label. (##) Code Structure The structure of our code largely follows StyleGAN2, and is summarized as follows. The discriminator is a vanilla StyleGAN2 discriminator that accepts 6D inputs, formed by concatenating low-res and high-res outputs. The generator consists of the following subnetworks: - A **mapping network** (from StyleGAN2) to compute latent codes from latent vectors - A **synthesis network** (from StyleGAN2) to output 2D triplanes - A **NeRF decoder MLP** to bridge the synthesis network and NeRF network by transforming triplane features into the correct dimensionality - A **NeRF network** consisting of a volumetric renderer and density / feature MLP output heads. (Here we don't actually output color directly, rather we output RGB color alongside higher-dimensional features that are passed the super-resolution network). - A **super-resolution network** to upsample the low-res output Our code structure is as follows: We read command-line arguments, initialize config dictionaries, construct the relevant neural networks, and set up multi-GPU training. Then, we begin the training loop. On each iteration, we do the following: - Sample latent vectors, as well as ground-truth images and conditioning vectors from the dataset - Run EG3D generator - **Input**: Latent vector $z\in\mathbb R^{512}$ and conditioning vector $c\in\mathbb R^{16+9}$ - **Output**: Low-resolution image $I_L$ with dimensionality $3\times N_L\times N_L$, high-resolution image $I_H$ with dimensionality $3\times N_H\times N_H$ - Compute latent codes $w$ from $z$ and $c$ using vanilla StyleGAN2 mapping network. Partition $w$ into synthesis codes $w_{synth}$ and super-resolution codes $w_{super}$ using the `num_ws` of the synthesis and super-resolution networks. - Compute triplane features $feat$ with dimensionality $(3\cdot C')\times N_L\times N_L$ from $w_{synth}$ using vanilla StyleGAN2 synthesis network - Apply the NeRF decoder MLP to each triplane feature vector of dimensionality $C'$ to transform it to dimensionality $C$ as required by NeRF - Reshape the triplane features into $F_{xy}$, $F_{yz}$, and $F_{zx}$ triplanes, each with dimensionality $C\times N_L\times N_L$ - Render the raw low-resolution output features $F_L$ with dimensionality $C\times N_L\times N_L$ by sampling rays and performing NeRF volume rendering. Each $(x,y,z)$ point sampled along a ray is used to bilinearly sample from $F_{xy}$, $F_{yz}$, and $F_{zx}$ using the corresponding pair of 2D coordinates, in order to extract $xy$, $yz$, and $zx$ features each of dimensionality $C$ which are then aggregated. - Interpret the first three channels of $F_L$ as low-resolution RGB output $I_L$ with dimensionality $3\times N_L\times N_L$. - Compute high-resolution RGB output $I_H$ with dimensionality $3\times N_H\times N_H$ from the entire low-resolution features $F_L$ and the super-resolution codes $w_{super}$ - Run EG3D discriminator. - **Input**: Low-resolution image $I_L$ with dimensionality $3\times N_L\times N_L$, high-resolution image $I_H$ with dimensionality $3\times N_H\times N_H$, and conditioning vector $c\in\mathbb R^{16_9}$ - **Output**: Real or Fake - Compute the upsampled low-resolution input $I_{L,up}$ with dimensionality $3\times N_H\times N_H$ using Lanczos resampling - Concatenate $I_{L,up}$ and $I_H$ into a 6D image $I$ with dimensionality $6\times N_H\times N_H$ and pass the result into the vanilla StyleGAN2 discriminator - Compute loss function and run optimization as in standard GAN training. As our code and training loop is derived from StyleGAN2, we inherit many of its innovations such as generator normalization, progressive growing, generator regularization, and path space regularization. EG3D's architecture allows it to benefit from advances in 2D GANs as it can use any 2D GAN architecture as a backbone. In future work, it would be interesting to investigate EG3D's performance on additional 2D GAN backbones such as StyleGAN3. The super-resolution networks can be constructed with small changes to the vanilla StyleGAN2 synthesis network, as it simply passes the raw low-resolution output through additional synthesis layers. Our implementation of NeRF volume rendering is derived from the [torch-ngp](https://github.com/ashawkey/torch-ngp) repository on Github, but due to engineering challenges we currently just use their vanilla PyTorch implementation without CUDA-based raymarching or fully-fused MLPs which could provide some additional speedup. While StyleGAN2 and torch-ngp both use FP16 to speed up computation, we use FP32 for all calculations to ensure numerical stability. (##) Results Our implementation is able to perform reasonable quality and multiview-consistent synthesis at low and high resolutions. Shown below are upsampled outputs at 256x256 resolution:
In these examples, the face reconstruction seems to be reasonably consistent even across multiple views. There seems to be some poor behavior with conflicting backgrounds, unusual hair geometry or accessories, and view-dependent facial expressions. For example, in the top left of the third output image the man's mouth always seems to face forward no matter the camera orientation, and in the bottom riht of the first output image the woman's teeth seem to change appearance as the camera pose rotates. Many of these error cases can be attributed to view inconsistency as a result of the super-sampling network, or overreliance on pose conditioning in the generator. There are also some other poor behaviors that can be attributed to poor forground / background separation. In the top left of the second output image, the pink background is treated as part of the woman's hair. In the top right of the second output image, the pink and green background also clash in a weird way, and it is unclear whether the woman' hat ends and where the environment begins. Finally, in the bottom left of the third output image, the girl's hands are poorly modeled and split up across multiple depths, showing that our model has a poor understanding of hands and other geometries that appear rarely in the training set. Due to time constraints and long computation times, I was not able to perform explore further ablation studies or architectural changes on top of what EG3D proposed. However, exploring different resolutions, conditioning strategies, GAN backbones, training / warmup procedures, and speedup techniques would be interesting directions for future work. As an aside, the triplane representation proposed by EG3D is also able to serve as an explicit positional encoding to help speed up NeRF. In this setting of NeRF single-scene optimization, the triplanes are parameters to be optimized directly using reconstruction loss on the NeRF output with respect to ground-truth multi-view images of a scene. Shown below are some of my results from single-scene optimization on Triplane NeRF compared to instant-ngp, onto the Lego NeRF Synthetic scene, both at 10 epochs of optimization (about 5-10 minutes for triplane NeRF). These results show that triplane NeRF is able to achieve comparable quality after just a few minutes of optimization, and has potential to serve as an alternative explicit positional encoding for NeRF in other settings:
instant-ngp NeRF Triplane NeRF
instant-ngp NeRF Triplane NeRF
instant-ngp NeRF Triplane NeRF
(##) Attributions This project primarily refers to the following three papers: - Chan, Eric R. and Lin, Connor Z. and Chan, Matthew A. and Nagano, Koki and Pan, Boxiao and De Mello, Shalini and Gallo, Orazio and Guibas, Leonidas and Tremblay, Jonathan and Khamis, Sameh and Karras, Tero and Wetzstein, Gordon. "Efficient Geometry-Aware 3D Generative Adversarial Networks." December 2021, https://arxiv.org/abs/2112.07945 - Karras, Tero and Laine, Samuli and Aittala, Miika and Hellsten, Janne and Lehtinen, Jaakko and Aila, Timo. "Analyzing and Improving the Image Quality of StyleGAN." December 2019, https://arxiv.org/abs/2112.07945 - Mildenhall, Ben and Srinivasan, Pratul P. and Tancik, Matthew and Barron, Jonathan T. and Ramamoorthi, Ravi and Ng, Ren. "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis." March 2020, https://arxiv.org/abs/2003.08934 My codebase for this project is heavily drawn from three sources: - Official StyleGAN2 implementation in PyTorch, for overall code structure and network components: https://github.com/NVlabs/stylegan2-ada-pytorch - An unofficial re-implementation of instant-ngp by ashawkey, for NeRF volume rendering: https://github.com/ashawkey/torch-ngp - An unofficial re-implementation of EG3D by shoutOutYangJie, for EG3D-specific code structure and details: https://github.com/shoutOutYangJie/EG3D-pytorch This report was generated with Markdeep using the report template from 15-468/668/868 Physics Based Rendering.