Skip to main content
Log in

DeepFake detection algorithm based on improved vision transformer

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

A DeepFake is a manipulated video made with generative deep learning technologies, such as generative adversarial networks or auto encoders that anyone can utilize. With the increase in DeepFakes, classifiers consisting of convolutional neural networks (CNN) that can distinguish them have been actively created. However, CNNs have a problem with overfitting and cannot consider the relation between local regions as global feature of image, resulting in misclassification. In this paper, we propose an efficient vision transformer model for DeepFake detection to extract both local and global features. We combine vector-concatenated CNN feature and patch-based positioning to interact with all positions to specify the artifact region. For the distillation token, the logit is trained using binary cross entropy through the sigmoid function. By adding this distillation, the proposed model is generalized to improve performance. From experiments, the proposed model outperforms the SOTA model by 0.006 AUC and 0.013 f1 score on the DFDC test dataset. For 2,500 fake videos, the proposed model correctly predicts 2,313 as fake, whereas the SOTA model predicts 2,276 in the best performance. With the ensemble method, the proposed model outperformed the SOTA model by 0.01 AUC. For Celeb-DF (v2) dataset, the proposed model achieves a high performance of 0.993 AUC and 0.978 f1 score, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. https://www.kaggle.com/c/deepfake-detection-challenge

References

  1. Choi Y-J, Lee YW, Kim B-G (2021) Group-based bi-directional recurrent wavelet neural networks for video super-resolution, arXiv:2106.07190

  2. Jeong D, Kim BG, Dong S-Y (2020) Deep joint spatiotemporal network (djstn) for efficient facial expression recognition. Sensors 20(7):1936

    Article  Google Scholar 

  3. Yeo W-H, Heo Y-J, Choi Y-J, Kim B-G (2020) Place classification algorithm based on semantic segmented objects. Appl Sci 10(24):9069

    Article  Google Scholar 

  4. Heo Y-J, Choi Y-J, Lee Y-W, Kim B-G (2021) Deepfake detection scheme based on vision transformer and distillation, arXiv:2104.01353

  5. Karras T, Laine S, Aila T (2019) A style-based generator architecture for generative adversarial networks. In: proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 4401–4410

  6. Choi Y, Choi M, Kim M, Ha J-W, Kim S, Choo J (2018) Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 8789–8797

  7. Shen Y, Yang C, Tang X, Zhou B (2020) Interfacegan: Interpreting the disentangled face representation learned by gans, IEEE Transactions on Pattern Analysis and Machine Intelligence

  8. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. Advances in neural information processing systems, 27

  9. Kingma DP, Welling M (2014) Stochastic gradient vb and the variational auto-encoder. In: Second international conference on learning representations, ICLR, vol 19, p 121

  10. Dolhansky B, Bitton J, Pflaum B, Lu J, Howes R, Wang M, Ferrer CC (2020) The deepfake detection challenge dataset, arXiv preprint arXiv arXiv:2006.07397

  11. Seferbekov S (2020) https://github.com/selimsef/dfdc_deepfake_challenge. Accessed 24 Jan 2022

  12. Nguyen HH, Yamagishi Y, Echizen I (2019) Use of a capsule network to detect fake images and videos, arXiv:1910.12467

  13. Li Y, Lyu S (2019) Exposing deepfake videos by detecting face warping artifacts. In: CVPR Workshops

  14. Lui S, Deng W (2015) Very deep convolutional neural network based image classification using small training sample size. In: 2015 3rd IAPR Asian conference on pattern recognition (ACPR), p 730–734 IEEE

  15. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  16. Yang X, Li Y, Lyu S (2019) Exposing deep fakes using inconsistent head poses. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 8261–8265

  17. Guarnera L, Giudice O, Battiato S (2020) Deepfake detection by analyzing convolutional traces. In: proceedings of the IEEE/CVF Conference on computer vision and pattern recognition workshops, pp 666–667

  18. Tan M, Le Q (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning. PMLR

  19. Li L, Bao J, Zhang T, Yang H, Chen D, Wen F, Guo B (2020) Face x-ray for more general face forgery detection. In: proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 5001–5010

  20. Mittal T, Bhattacharya U, Chandra R, Bera A, Manocha D (2020) Emotions don’t lie: an audio-visual deepfake detection method using affective cues. In: proceedings of the 28th ACM international conference on multimedia, pp 2823–2832

  21. Montserrat DM, Hao H, Yarlagadda SK, Baireddy S, Shao R, Horváth J, Bartusiak E, Yang J, Guera D, Zhu F et al (2020) Deepfakes detection with automatic face weighting. In: proceedings of the IEEE/CVF Conference on computer vision and pattern recognition workshops, pp 668–669

  22. Güera D, Delp EJ (2018) Deepfake video detection using recurrent neural networks. In: 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, pp 1–6

  23. de Lima O, Franklin S, Basu S, Karwoski B, George A (2020) Deepfake detection using spatiotemporal convolutional networks, arXiv:2006.14749

  24. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on computer vision and pattern recognition, pp 6299–6308

  25. Hara K, Kataoka H, Satoh Y (2017) Learning spatio-temporal features with 3d residual networks for action recognition. In: proceedings of the IEEE International conference on computer vision workshops, pp 3154–3160

  26. Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 6450–6459

  27. Amerini I, Galteri L, Caldelli R, Del Bimbo A (2019) Deepfake video detection through optical flow based cnn. In: proceedings of the IEEE/CVF International conference on computer vision workshops, pp 0–0

  28. Thies J, Zollhofer M, Stamminger M, Theobalt C, Nießner M (2016) Face2face: Real-time face capture and reenactment of rgb videos. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 2387–2395

  29. Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou A (2021) Training data-efficient image transformers & distillation through attention. PMLR

  30. Raghu M, Unterthiner T, Kornblith S, Zhang C, Dosovitskiy A (2021) Do vision transformers see like convolutional neural networks?. Advances in Neural Information Processing Systems, vol 34

  31. Zhang K, Zhang Z, Li Z, Qiao Y (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23(10):1499–1503

    Article  Google Scholar 

  32. Buslaev A, Iglovikov VI, Khvedchenya E, Parinov A, Druzhinin M, Kalinin AA (2020) Albumentations: fast and flexible image augmentations. Information 11(2):125

    Article  Google Scholar 

  33. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale, arXiv:2010.11929

  34. Girdhar R, Carreira J, Doersch C, Zisserman A (2019) Video action transformer network. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 244–253

  35. Neimark D, Bar O, Zohar M, Asselmann D (2021) Video transformer network, arXiv:2102.00719

  36. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows, International Conference on Computer Vision (ICCV)

  37. Lin M, Chen Q, Yan S (2013) Network in network, arXiv:1312.4400

  38. Dolhansky B, Howes R, Pflaum B, Baram N, Ferrer CC (2019) The deepfake detection challenge (dfdc) preview dataset, arXiv:1910.08854

  39. Korshunov P, Marcel S (2018) Deepfakes:, a new threat to face recognition? assessment and detection, arXiv:1812.08685

  40. Rossler A, Cozzolino D, Verdoliva L, Riess C, Thies J, Nießner M (2019) Faceforensics++: Learning to detect manipulated facial images. In: proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1–11

  41. Li Y, Yang X, Sun P, Qi H, Lyu S (2020) Celeb-df: a large-scale challenging dataset for deepfake forensics. In: proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 3207–3216

  42. Zhao H, Cui H, Zhou W (2020) https://github.com/cuihaoleo/kaggle-dfdc. Accessed 24 Jan 2022

  43. Davletshin A (2020) https://github.com/NTech-Lab/deepfake-detection-challengehttps://github.com/NTech-Lab/deepfake-detection-challenge. Accessed 24 Jan 2022

  44. Shao J, Shi H, Yin Z, Fang Z, Yin G, Chen S, Ning N, Liu Y (2020) https://github.com/Siyu-C/RobustForensics. Accessed 24 Jan 2022

  45. Howard J, Pan I (2020) https://github.com/jphdotam/DFDC/. Accessed 24 Jan 2022

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Byung-Gyu Kim.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Heo, YJ., Yeo, WH. & Kim, BG. DeepFake detection algorithm based on improved vision transformer. Appl Intell 53, 7512–7527 (2023). https://doi.org/10.1007/s10489-022-03867-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03867-9

Keywords

Navigation