Deep Fakes: A Bird’s-Eye View

This article is based on the survey by Tolosana et al in this paper. My goal is to give a high-level overview of the current state of affairs in the world of DeepFakes and hopefully inspire you to advance the state-of-the-art techniques mentioned here :)

Deep fakes refer to the technique of creating fake videos using deep learning techniques, this most popularly involves swapping the face of a person in a video with the face of another person. These techniques present realistic AI-generated videos of people doing and saying fictional things and as such have the potential to have a significant impact on how people determine the legitimacy of information presented online. Moreover, recent advances in AutoEncoders and Generative Adversarial Neural Networks that are publicly available have made it increasingly easy for anyone to generate fake images and videos without any experience in the field. For instance apps such as FaceApp and ZAO are publicly available.

Source: https://www.faceapp.com/

Within the field, four main directions of research have emerged

  • Face Synthesis
  • Identity Swap
  • Attribute manipulation
  • Expression Swap

We’ll define and explore each of these areas as well as provide some insight into state-of-the-art models and techniques applied in each of these. Resources will be linked in the references for an in-depth study into the algorithms involved. We’ll also discuss open areas of research that can be considered to advance these fields.

We’ll also explore some of the nascent fields such as face morphing and face de-identification and offer a similar though less detailed analysis on these areas.

Before we dive into each of these, some housekeeping is in order so that we’re on the same page as we explore these areas.

  • For a primer on GANs, check out this article by Jason Brownlee.

TLDR; GANs are a model architecture for training a generative model. The GAN model architecture involves two sub-models: a generator model for generating new examples and a discriminator model for classifying whether generated examples are real, from the domain, or fake, generated by the generator model. The two models, the generator and discriminator, are trained together. The generator generates a batch of samples, and these, along with real examples from the domain, are provided to the discriminator and classified as real or fake. The discriminator is then updated to get better at discriminating real and fake samples in the next round, and importantly, the generator is updated based on how well, or not, the generated samples fooled the discriminator.

  • Check out this Quora post to understand the EER metric that we’ll use to evaluate different models.

TLDR; EER is the point where your false identification and false rejection rate are minimal and optimal. The lower your EER, the better your system.

Face Synthesis

This is the creation of synthetic face images mostly through GAN techniques. The two most popular architectures here are ProGAN and StyleGAN

Source: https://arxiv.org/pdf/2001.00179v3.pdf

ProGAN(Progressive Growing of GANs)

The key idea is to grow both the generator and discriminator progressively, starting from a low resolution and then adding layers that model increasingly fine details as the training progresses.

The following steps were followed:

  • Artificially shrink images to a very small image resolution say only 4 by 4 pixels
  • Create a generator with only a few layers to synthesize images at this low resolution and a corresponding discriminator with mirrored architecture. Train this model, this should be relatively quick as you start since the images have really low resolution.
  • Add another layer and double the output resolution of the generated image.
  • Keep the weights from the earlier layers but don’t freeze them and retrain the network, this new layer should be gradually faded in and eventually, the GAN will output convincing images at the new 8 by 8 resolution.
  • Repeat the steps above until the output is of the desired resolution.
Source: https://arxiv.org/abs/1710.10196

StyleGAN

This was motivated by style transfer literature, the generator starts from a learned constant input and adjusts the “style” of the image at each convolution layer based on the latent encoding therefore directly controlling the strength of the image features at different scales. Combined with noise injected directly into the network, this architectural change leads to automatic, unsupervised separation of high-level attributes (e.g., pose, identity) from stochastic variation (e.g., freckles, hair) in the generated images, and enables intuitive scale-specific mixing and interpolation operations.

Source: https://arxiv.org/pdf/1812.04948.pdf

While a traditional generator feeds the latent code through the input layer only, StyleGANs first map the input to an intermediate latent space W, which then controls the generator through adaptive instance normalization (AdaIN) at each convolution layer. Gaussian noise is added after each convolution, before evaluating the nonlinearity. Here “A” stands for a learned affine transform, and “B” applies learned per-channel scaling factors to the noise input. The output of the last layer is then converted to RGB using a separate 1 × 1 convolution.

Datasets

Popular publicly available databases include

  • 100K-Generated Imagescontains 100,000 synthetic images generated using StyleGAN architecture trained using the FFHQ dataset
  • 100K-Faces — contains 100,000 synthetic images trained using 29,000 photos from 69 different models. All the images in the training set were taken in controlled settings i.e with flat backgrounds and therefore the synthetic images produced don’t contain any strange artifacts
  • DFFD(Diverse Fake Face Dataset) — 100,000 and 200,000 fake images generated using ProGAN and StyleGAN respectively.
  • iFakeFaceDB250,000 and 80,000 synthetic images generated using StyleGAN and ProGAN respectively.

Of the above datasets. iFakeFaceDB has become the unofficial gold standard for face synthesis detection because they used GANprintR to remove fingerprints produced by GAN architectures. Fake Images generated through GANs are typically characterized by a specific GAN fingerprint just like natural images are characterized by a device-based fingerprint i.e PRNU. As a matter of fact, this fingerprint is not unique to the GAN architecture but also the specific instance of the GAN architecture.

iFakeFaceDB images have the GAN fingerprint removed while maintaining a very realistic appearance

Source: https://arxiv.org/pdf/1911.05351.pdf

In the GANprintR architecture above, an AutoEncoder is trained using only real face images from a development dataset. In the evaluation phase, fake images are then passed through the same network and thus the AE acts as a non-linear low pass filter removing GAN fingerprints.

It is therefore not surprising that iFakeFaceDB images present a harder challenge for fake detectors compared to other databases as evidenced by the comparison below

Source: https://arxiv.org/pdf/2001.00179v3.pdf

The best performing models at the time of the writing of this paper are by Dang et al and Neves et al. We’ll briefly explore the architecture used by Dang et al.

Their architecture as described in this paper is as follows.

Source: https://arxiv.org/pdf/1911.05351.pdf

Given any backbone network such as XceptionNet or VGG16, an attention-based layer can be inserted into the network. It takes the high-dimensional feature F as input, estimates an attention map M_att, and channel-wise multiplies it with the high-dimensional features from the convolutional neural net, which are fed back into the backbone. In addition to the binary classification supervision loss L_classifier, either a supervised or weakly supervised loss, L_map, can be applied to estimate the attention map, depending on whether the ground truth manipulation map M_gt is available.

Assuming the attention map can highlight the manipulated image regions, it guides the network to detect these regions, this alone should be useful for face forgery detection. In fact, each pixel in the attention map should compute a probability that its receptive field corresponds to a manipulated region in the input image. Digital forensics has shown that camera model identification is possible due to “fingerprints” in the high-frequency information of a real image. It is thus feasible to detect abnormalities in this high-frequency information due to algorithmic processing.

Hence the attention map is inserted into the backbone network where the receptive field corresponds to appropriately sized local patches. Then, the features before the attention map encode the high-frequency fingerprint of the corresponding patch, which helps the model discriminate between real and manipulated regions at the local level.

It is important to mention that there are also databases of real images that are necessary for training fake detectors. These include FFHQ, CelebA, and VGGFace among others.

Identity Swap

This is probably the most popular area of research around deep fakes. It’s the replacing of the face of one person in a video with the face of another person. The most popular algorithms are FakeApp and FaceSwap-GAN.

FakeApp algorithm is not publicly available but the FaceSwap-GAN algorithm was modeled after CycleGAN.

CycleGAN

The high-level idea in CycleGANs is learning to translate an image from a source domain X to a target domain Y in the absence of paired examples and therefore learn a mapping G: X → Y such that the distribution of images from G(X) is indistinguishable from the distribution of Y using an adversarial loss. This problem is then coupled with the cycle consistency loss that’s from the inverse mapping F: Y → X such that F(G(X)) ≈ X and vice versa. This yields the full CycleGAN objective for an unpaired image to image translation.

Source: https://arxiv.org/pdf/1703.10593.pdf

The model as seen above contains two mapping functions G: X → Y and F: Y → X, and associated adversarial discriminators D_Y and D_X. D_Y encourages G to translate X into outputs indistinguishable from domain Y, and vice versa for D_X and F. To further regularize the mappings, they introduced two-cycle consistency losses that capture the intuition that if we translate from one domain to the other and back again we should arrive at where we started: (b) forward cycle-consistency loss: xG(x) → F(G(x)) ≈ x, and © backward cycle-consistency loss: y → F(y) → G(F(y)) ≈ y

faceswap-GAN

The architecture of FaceSwap-GAN builds on this idea with the following architecture.

Source: https://github.com/shaoanlu/faceswap-GAN

During the training phase, you have both sets of real faces and warped faces and you use the adversarial and cycle consistency losses to help the AE learn how to generate the real face from a warped version.

Source: https://github.com/shaoanlu/faceswap-GAN

Datasets

The datasets are split into 1st & 2nd generation databases with the latter having significantly higher realism and few visible artifacts.

Popular 1st gen face-swapping databases include:

  • DeepFake — 620 fake videos (made using CycleGAN) of 32 subjects from the VidTIMIT database. Multi-Task Cascaded Convolution Networks were used for more stable face detection and reliable face alignment and Kalman filters were also used to smooth the counting box position over frames and eliminate jitter on the swapped face
  • FaceForensics++ — 1000 fake videos generated from real youtube videos. Made using a combination of the FaceSwap algorithm and DeepFake techniques

Popular 2nd gen face-swapping databases include:

  • DeepFakeDetection — made using DeepFaceLab
  • DFDC — private dataset made through a collaboration between Facebook, Microsoft, and Amazon.

The difference between 1st generation and 2nd generation databases can be seen below.

Source: https://arxiv.org/pdf/2001.00179v3.pdf

In terms of manipulation detection, Tolosana et al(2020) developed one of the state of the art models in detecting face swaps in the 2nd generation databases(there are already plenty of good algorithms that achieve state-of-the-art on the 1st generation databases). The highlight of their approach was that rather than selecting the entire face as input to the detection algorithm, they passed in specific facial regions such as eyes, nose, mouth, etc. These facial regions are then segmented into 68 pose-invariant landmarks using OpenFace2, an open-source toolbox. The remaining parts of the face are discarded. Each of these facial regions is then mapped onto face detectors as input. The face detectors are primarily Xception or Capsule Network models.

Source: https://arxiv.org/pdf/2004.07532.pdf

There is still a lot of room for improvement especially in 2nd generation databases that consistently seem to have high EER values even for sophisticated algorithms such as the one above by Tolosana et al.

Attribute Manipulation

This is the modification in an image of some attributes of the face such as the color of the hair, the skin, gender, age, etc.

Source: https://arxiv.org/pdf/1904.09709.pdf

(STGANs)Selective Transfer GANs are currently the state of the art for manipulating facial attributes as can be seen from the samples above

STGAN

In general, attribute manipulation can be tackled by incorporating an encoder-decoder or GAN, however, the bottleneck layer in the encoder-decoder usually provides blurry and low-quality manipulation results. STGAN uses selective transfer units incorporated with encoder-decoder to adaptively select and modify encoder features for enhanced attribute editing.

The model thus only considers the attributes to be changed, and selectively concatenates encoder feature in editing attribute-irrelevant regions with decoder feature. In terms of transfer, the modes adaptively modify the encoder features to match the requirement of varying editing tasks, thereby providing a unified model for handling both local and global attributes.

Therefore, instead of the full target attribute vector, STGAN takes the difference between target and source attribute vectors as input to encoder-decoder. Selective Transfer Units, STUs, are added to each pair of encoder and decoder layers, and they take both encoder feature, inner state, and difference attribute vector into consideration for exploiting cross-layer consistency and task specificity.

Source: https://arxiv.org/pdf/1904.09709.pdf

Thus we have the overall structure of STGAN above. On the left is the generator. The top-right figure shows a detailed STU structure, and all variables marked in this figure share the same dimension. The difference attribute vector of adding Eyeglasses and removing Mouth Open attributes is shown on the bottom right.

Datasets

The DFFD is the gold standard for attribute manipulations. It has 18,416 and 79,960 images generated through FaceApp and StarGAN approaches.

The model developed by Dang et al (mentioned in the Face Synthesis section above) combined attention mechanism to a ConvNet backbone to achieve state of the art in detection as can be seen below.

Source: https://arxiv.org/pdf/2001.00179v3.pdf

Expression Swap

This is the modifying of the facial expression of the person in a video. Popular approaches include Face2Face and NeuralTextures. We’ll explore NeuralTextures.

Neural Textures

These are learned feature maps that are trained as part of the scene capture process. Similar to traditional textures, neural textures are stored as maps on top of 3D mesh proxies; however, the high-dimensional feature maps contain significantly more information, which can be interpreted by a deferred neural rendering pipeline. Both neural textures and deferred neural renderer are trained end-to-end, thus synthesizing photo-realistic images even when the original 3D content was imperfect.

In contrast to traditional, black-box 2D generative neural networks, this 3D representation gives the model explicit control over the generated output and allows for a wide range of application domains. For instance, it can synthesize temporally consistent video re-renderings of recorded 3D scenes as the representation is inherently embedded in 3D space. In this way, neural textures can be utilized to coherently re-render or manipulate existing video content in both static and dynamic environments at real-time rates.

Source:https://arxiv.org/pdf/1904.12356.pdf

Given an object with a valid UV-map parameterization and an associated Neural Texture map as input, the standard graphics pipeline is used to render a view-dependent screen-space feature map. The screen space feature map is then converted to photo-realistic imagery based on a Deferred Neural Renderer. This is trained end-to-end to find the best renderer and texture map for a given task.

GAN approaches in this field include InterFaceGAN, UGAN, AttGAN, and STGAN.

Datasets

The only available database for research in this area is the FaceForensics database. Initially, the FaceForensics database was focused on the Face2Face approach. This is a computer graphics approach that transfers the expression of a source video to a target video while maintaining the identity of the target person. This was carried out through manual keyframe selection. Concretely, the first frames of each video were used to obtain a temporary face identity (i.e. a 3D model), and track the expression over the remaining frames. Then, fake videos were generated by transferring the source expression parameters of each frame (i.e. 76 Blendshape coefficients) to the target video.

Later on, the same authors presented in FaceForensics++ a new learning approach based on the NeuralTextures discussed above.

Rössler et al achieved state-of-the-art expression swap detection on the FF+ dataset above.

Source:https://arxiv.org/pdf/1901.08971.pdf

Since the goal is to detect forgeries of facial imagery, they used additional domain-specific information such as mesoscopic and steganalysis features that were extracted from input sequences. This incorporation of domain knowledge improves the overall performance of a forgery detector in comparison to a naïve approach that uses the whole image as input.

Indeed this approach outperforms almost all naïve approaches as seen from the table below

Source:https://arxiv.org/pdf/2001.00179v3.pdf

In addition to the four areas that we elaborated on, there are nascent areas of research in DeepFake techniques that are not as widely discussed but do have the potential for advancement. We’ll briefly explore two such areas:

  • Face Morphing
  • Audio to Video and Text to Video

Face Morphing

This is the creation of artificial biometric face samples that resemble the biometric information of two or more individuals. This means that the new morphed face image could be successfully verified against facial samples of these two or more individuals creating a serious threat to face recognition systems.

There aren’t many publicly available datasets thus there is no benchmark for a fair comparison between models. The FVC private dataset contains 1800 photos of 150 subjects that were generated using 6 different algorithms

Audio to Video & Text to Video

These are lipsync deep fakes where the input can be audio or text as well as many hours of video of the subject. An LSTM is then used to learn the mapping from the raw audio or text features to mouth shapes. Then based on the mouth shape at each frame, it is possible to synthesize high-quality mouth texture, this can then be combined with 3D Pose Matching to create a new video that matches the input audio track.

One such implementation is publicly available here.

The implemented solution at a high level is as follows. Audio to video generation can be formulated as a conditional probability distribution matching problem. This can be solved by minimizing the distance between real video distribution and generated video distribution in an adversarial manner.

Let A ≡ {A_1, A_2, · · · , A_t, · · · , A_K} be the audio sequence where each element A_t represents the audio clip in this audio sequence.

Let I ≡ {I_1, I_2, · · · , I_t, · · · , I_K} be the corresponding real talking face sequence. And I ∗ ∈ I is the identity image which can be any single image chosen from the real sequence of images.

Given the audio sequence A and one single identity image I ∗ as condition, the talking face generation task aims to generate a sequence of frames ˜I = { ˜I1, ˜I_2, · · · , ˜I_t, · · · , ˜I_K}, so that the conditional probability distribution of ˜I given A and I ∗ is close to that of I when given A and I ∗ , i.e., p( ˜I|A, I∗ ) ≈ p(I|A, I∗ ).

One such implementation produced the following results.

Conclusion

All in all, DeepFakes are an exciting area of research and there’s a lot of potential to create realistic content and videos using GANs. Even though the human eye can easily be fooled by current deep fakes, the good news is at least for now most DeepFake detection algorithms are able to spot GAN-generated images. This state of affairs however won’t always be like this since with the advent of techniques like GANprintR it is possible to fool some of the DeepFake detection algorithms.

DeepFake technology is a double-edged sword and has the potential to create exciting products as well as spread misinformation. Therefore, it is upon all of us to practice responsible AI understanding that with great power comes great responsibility.

I believe this meme will suffice to guide your exploration into DeepFakes.

Happy hacking :)

exploited full stack dev for a year currently exploring deep learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store