**Temporally Consistent Video Retargeting without Dependence on Sequential Data** Student name: Shihao Shen, Abhishek Pavani (#) MOTIVATION Robotics has the potential to transform many industries, from manufacturing to healthcare, and even transportation. However, the performance of robots in low-light environments is often limited due to the lack of available data. Lowlight conditions pose a significant challenge for robots, as the sensors and cameras they rely on often fail to capture clear images. Video retargeting is a promising solution to this problem. By generating artificially created images, video retargeting can augment the existing data and improve the performance of robots in low-light environments. However, the existing image-to-image translation models do not account for temporal consistency, which is crucial for video data. So our project proposes a plausible way to tackle this problem by augmenting existing data for generating temporally consistent videos on unpaired data.

Poor robot performance in low light conditions

The video above shows that robots can malfunction in low-light environments. The robot is unable to detect the pedestrian in this case. Our project can help resolve this problem as we can augement datasets in different target domains as required. (#) PROBLEM STATEMENT The project's primary objective is to create videos that are temporally consistent, which means that they appear to flow naturally from one frame to the next, even if there is no sequential data available. In other words, the project aims to generate videos that look realistic and smooth, even if they are not created using a traditional video sequence. Traditionally, video generation techniques rely heavily on sequential data to create videos that look realistic. This sequential data could be in the form of a pre-existing video or a series of images captured at different points in time. However, this project seeks to generate videos without depending on such data, making it particularly useful in scenarios where sequential data is not readily available or where real-time video generation is required. While there have been methods like CycleGAN, RecycleGAN and some other methods that leverage optical flow to video translation, there are issues with these. Our work tries to make a comparison of these works and showcases how our method outperforms all of these methods for the task of video retargeting. In particular to test this problem, we choose to translate videos from VIPER dataset to CityScapes dataset. Similar to other previous works, our work does not require the data from two domains to be paired and hence it's an unpaird video-to-video translation method. (#) RELATED WORK In this section, we showcase how vanilla CycleGAN and RecycleGAN perform on the VIPER $\longleftrightarrow$ CityScapes task. We also showcase how the optical flow based methods perform on the same dataset. (##) CycleGAN ( Vanilla ) CycleGAN is a type of generative adversarial network (GAN) used for image-to-image translation tasks. It is designed to learn a mapping between two different image domains without the need for paired examples. The name "CycleGAN" comes from the fact that the network is composed of two cycles: a forward cycle that maps images from one domain to the other, and a backward cycle that maps them back. The network architecture consists of two generators and two discriminators. The generators are responsible for mapping images from one domain to the other, while the discriminators try to distinguish between real and generated images. The key idea behind CycleGAN is to use cycle consistency loss to ensure that the mapping is consistent in both directions. This loss is calculated by comparing the original image with the reconstructed image after it has been mapped back and forth between the two domains. The architecture for the same has been shown below

CycleGAN architecture

(##) RecycleGAN RecycleGAN is an extension of CycleGAN which gained popularity for deep fakes. In this architecture, we have the current frame and the previous frame of a input video sequence, and we have two heads. In the first head, we first predict the next frame in the source domain and get the next frame. In the second head, we first translate the previous and the current frame from source domain to a target domain and then predict the next frame in the target domain. We then use another generator to translate the predicted frame back to the source domain. We then compute the recycle loss and recurrent loss between the frames to maintain cycle consistency. This works well for video translation tasks but there is always a need for sequential data, which becomes a bottleneck for real time video translation tasks.

RecycleGAN architecture for video retargeting

(##) Optical Flow Based Method This method is another way to do video translation. Here we first take the current frame and translate it to a target domain. We also take the next frame and compute the optical flow estimate between the previous and the next frame and this flow is then used to warp the translated frame to the next frame. We also simulatneously take the next frame and translate it to a target domain. We enfore cycle consistency loss between the warped frame and the directly translated frame. This method works well but computing optical flow is a very expensive operation and is not suitable for real time video translation tasks. Also this method fails in low light and foggy conditions.

Optical flow based architecture for video retargeting

NOTE: This method adds the same content loss we used in HW4 and HW5, which extracts features at different abstract levels of the generator with a pre-trained VGG-19 net. (#) OUR APPROACH We try to build upon the problems that have been mentioned above. We try to build a model that can do video translation without the need for sequential data. We suggest an architecture that can work well for both images and video. In our approach, we first take the input frame and translate it to a target domain. We then define a fake optical flow, which is nothing but $u$ and $v$ displacements that are randomly sampled from a gaussian distribution, followed by a average filter for smoothing. We use this fake flow to warp the image and then we translate it from target domain back to source domain. We also simulatneously warp the source domain frame and then translate it to a target domain. We then compute the cycle consistency loss between the frames to maintain temporal consistency.

Our architecture for video retargeting

The fake flow here works as a data augmentation scheme and helps us to generate temporally consistent videos. To better understand why the fake flow will work in the video retargeting task without estimating real motion between consecutive frames, we need to rethink the entire problem: !!! Tip First, we need to acknowledge one thing. That is, if we want to learn temporal consistency of video-to-video translation, then during training we will inevitably try to translate at least two images at two consecutive timesteps. Otherwise, there is no concept of “temporal” at all during training. After that, the training is essentially the same across those methods, which is just a series of operations such as translating back to source domain (e.g. in recyclegan), or other operations to finally obtain a pair of images in the same domain so that the consistency loss can be computed. After having this in mind, the problem becomes how we should translate two frames at two consecutive timesteps? RecycleGAN uses a predictor to predict the frame at the next timestep, while the optical flow based method replaces the predictor with an optical flow estimator followed by a warping operation. These two methods assume that they have to translate two “real” consecutive frames, which is true for downstreaming tasks such as motion planning but is not true for video-to-video translation, because all we care about is: give me two consecutive images in source domain, let me translate them to target domain, and let me see if they are still spatially and semantically consistent. So why not start from a frame from the dataset, and create or synthesize our own “next frame”? As long as our model is able to translate both $x_t$ and fake $x_{t+1}$ into spatially, semantically consistent $y_t$, $y_{t+1}$ successfully, then we can be assured that our model will also work on real sequential data. NOTE: The generators and discriminators used for all the architectures mentioned above are the same. Also note that all mentioned methods including ours use the GAN adversarial loss during supervision. Therefore, we can now combine the adversarial loss, two consistency losses, the content loss and the vanilla cycle loss into our final supervision: \begin{equation} \begin{split} \min_G \max_D L(G,D) = L_{gan}(G_X, D_X) + L_{gan}(G_Y, D_Y) \\ + \lambda_1 L_{cons_1}(G_X) + \lambda_1 L_{cons_1}(G_Y) \\ + \lambda_2 L_{cons_2}(G_X, G_Y) + \lambda_2 L_{cons_2}(G_Y, G_X) \\ + \lambda_{cont} L_{cont}(G_X, \text{VGG-19}) + \lambda_{cont} L_{cont}(G_Y, \text{VGG-19}) \\ + \lambda_{cycle} L_{cycle}(G_X, G_Y) + \lambda_{cycle} L_{cycle}(G_Y, G_X) \end{split} \end{equation} (#) IMPLEMENTATION Two datasets we used for this task the VIPER and CityScapes dataset. VIPER and CityScapes are two popular datasets used in the self-driving industry to develop and test autonomous driving technologies. VIPER is a large synthetic dataset collected from Grand Theft Auto V, a modern game that simulates a functioning city and its surroundings in a photorealistic three-dimensional world. We use the same training/testing split as in RecycleGAN where they use it for label-to-image translation. The dataset has about 90,000 images in the training set.

CityScapes is another popular dataset used in the self-driving industry. It is a large-scale dataset of street scenes captured from several cities in Germany, and includes high-resolution images and detailed annotations of objects such as cars, pedestrians, and traffic signs. We download the `leftImg8bit_trainvaltest.zip` and the `leftImg8bit_trainextra.zip` for training, which includes about 24,000 images. Since these two data packages are not sequential, we further download `leftImg8bit_demoVideo.zip` for testing.

We set VIPER as the source domain $X$ and CityScapes as the target domain $Y$. In the following we only show translation results from source domain to target domain because we care about the translation quality from simulation to real world. We train CycleGAN, RecycleGAN and our method on VIPER-to-CityScapes using the same Generator/Discriminator architecture, and training parameters, with a batch size of 1 on a single NVIDIA Tesla T4 GPU. Note that since RecycleGAN requires sequential input, each image file needs to be horizontally concatenated images from {t-1, t, t+1} frames, so the traning set becomes 3 times smaller compared to the one used for CycleGAN and ours. We are not able to train the optical flow based method as the author has not provided an open source implementation. We train all methods for 2 epochs and test them on selected sequences from VIPER as shown below. (#) RESULTS We have not successfully trained RecycleGAN due to some implementation-wise issues. For example, the author has provided two types of model, one is described in the paper (also the one we used to train VIPER-to-CityScapes) and the other one includes vanilla cycle loss from CycleGAN which the author states has better performance in label-to-image experiments on VIPER. We use the default training protocols provided by the author but given the time constraint and the AWS budget, we are not able to re-train it with a different setting. However, compared to CycleGAN, our method has significantly improved the temporal consistency. Although a qualitative comparison between ours and another video retargeting method is not provided, the most important story we try to convey here is that our method does not need squential input nor does it need an additional off-the-shelf optical flow estimator. We have proved that using fake optical flow as motions across frames help the model generalize to real sequential dataset. Our method generates at least comparable result as RecycleGAN and the optical flow based method according to their papers. (##) VIPER Sequence 025
Input Video
CycleGAN
RecycleGAN (Not properly trained)
Ours
(##) VIPER Sequence 042
Input Video
CycleGAN
RecycleGAN (Not properly trained)
Ours
(##) More Results from Ours
Input Video (Seq 028)
Ours
Input Video (Seq 066)
Ours
Input Video (Seq 074)
Ours
Input Video (Seq 076)
Ours
As we can see, because the training data have a huge bias toward daytime images (CityScapes does not have any nighttime data in the training set), when our method is applied to a nighttime sequence, the performance degenerates a lot. In addition, we found the sim-to-real gap between VIPER and CityScapes is not very huge so that the results appear to be merely changing the color tone of the input video. However, generating visually appealing translations between source and target domains is not our goal; instead, we demonstrate that our method is able to generate temporally consistent result with much less overhead compared to other video retargeting methods. In fact, it's straightforward to apply our method to other more meaningful tasks such as day-to-night or label-to-image video retargeting. (#) CONCLUSION & FUTURE WORK In summary, we have implemented an architecture that works well for both unpaired image-to-image translation and unpaired video-to-video translation. Most importantly, it does not require sequential input during training to enforce temporal consistency, which means one can train it on any unsorted dataset, test it on a video and get temporally consistent result. On the other hand, we waive the requirement for one to find an off-the-shelf optical flow estimator and estimate optical flow between consecutive frames, which might not work properly in certain scenerios such as foggy and low-light environments. There are still lots of aspects where we can try to improve our method. We should further tune our hyperparameters ($\lambda$'s in supervision) to find out the trade-off among realisticness (controlled by $L_{gan}$), perceptual quality (controlled by $L_{cont}$), and temporal consistency (controlled by $L_{cons}$). We should try other video retargeting tasks. We should also document the VRAM usage and the training time of other baselines to demonstrate that our method uses much less overhead. (#) REFERENCES [1] [Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks](https://arxiv.org/abs/1703.10593) [2] [ViPER Dataset](https://ieeexplore.ieee.org/document/8237505) [3] [Cityscapes Dataset](https://www.cityscapes-dataset.com/) [4] [Recycle-GAN: Unsupervised Video Retargeting](https://arxiv.org/pdf/1808.05174.pdf) [5] [Preserving Semantic and Temporal Consistency for Unpaired Video-to-Video Translation](https://arxiv.org/abs/1908.07683) Our code is heavily drawn from two sources: - [pytorch-CycleGAN-and-pix2pix](https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix/tree/master) - [Recycle-GAN](https://github.com/aayushbansal/Recycle-GAN/tree/master)