Skip to main content

Hybrid Spatio-Temporal Network forĀ Face Forgery Detection

  • Conference paper
  • First Online:
Pattern Recognition (ACPR 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14408))

Included in the following conference series:

  • 246 Accesses

Abstract

Facial manipulation techniques have aroused increasing security concerns, leading to various methods to detect forgery videos. However, existing methods suffer from a significant performance gap compared to image manipulation methods, partially because the spatio-temporal information is not well explored. To address the issue, we introduce a Hybrid Spatio-Temporal Network (HSTNet) to integrate spatial and temporal information in the same framework. Specifically, our HSTNet utilizes a hybrid architecture, which consists of a 3D CNN branch and a transformer branch, to jointly learn short- and long-range relations in the spatio-temporal dimension. Due to the feature misalignment between the two branches, we design a Feature Alignment Block (FAB) to recalibrate and efficiently fuse heterogeneous features. Moreover, HSTNet introduces a Vector Selection Block (VSB) to combine the outputs of the two branches and fire important features for classification. Extensive experiments show that HSTNet obtains the best overall performance over state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Afchar, D., Nozick, V., Yamagishi, J., Echizen, I.: Mesonet: a compact facial video forgery detection network. In: 2018 IEEE International Workshop on Information Forensics and Security (WIFS), pp. 1ā€“7. IEEE (2018)

    Google ScholarĀ 

  2. Amerini, I., Galteri, L., Caldelli, R., Del Bimbo, A.: Deepfake video detection through optical flow based CNN. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)

    Google ScholarĀ 

  3. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer (2021)

    Google ScholarĀ 

  4. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization (2016)

    Google ScholarĀ 

  5. Bao, J., Chen, D., Wen, F., Li, H., Hua, G.: Towards open-set identity preserving face synthesis. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6713ā€“6722 (2018)

    Google ScholarĀ 

  6. Bayar, B., Stamm, M.C.: A deep learning approach to universal image manipulation detection using a new convolutional layer. In: Proceedings of the 4th ACM Workshop on Information Hiding and Multimedia Security, pp. 5ā€“10 (2016)

    Google ScholarĀ 

  7. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? (2021)

    Google ScholarĀ 

  8. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213ā€“229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    ChapterĀ  Google ScholarĀ 

  9. Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299ā€“6308 (2017)

    Google ScholarĀ 

  10. Chollet, F.: Xception: Deep learning with DepthWise separable convolutions. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1251ā€“1258 (2017)

    Google ScholarĀ 

  11. Contributors, M.: Openmmlabā€™s next generation video understanding toolbox and benchmark. https://github.com/open-mmlab/mmaction2 (2020)

  12. Dang, H., Liu, F., Stehouwer, J., Liu, X., Jain, A.K.: On the detection of digital face manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5781ā€“5790 (2020)

    Google ScholarĀ 

  13. Deng, J., Guo, J., Ververas, E., Kotsia, I., Zafeiriou, S.: Retinaface: single-shot multi-level face localisation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5203ā€“5212 (2020)

    Google ScholarĀ 

  14. Dolhansky, B., et al.: The deepfake detection challenge (DFDC) dataset (2020)

    Google ScholarĀ 

  15. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale (2020)

    Google ScholarĀ 

  16. Du, M., Pentyala, S., Li, Y., Hu, X.: Towards generalizable forgery detection with locality-aware autoencoder. pp. arXiv-1909 (2019)

    Google ScholarĀ 

  17. Durall, R., Keuper, M., Keuper, J.: Watch your up-convolution: CNN based generative deep neural networks are failing to reproduce spectral distributions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7890ā€“7899 (2020)

    Google ScholarĀ 

  18. Goodfellow, I., et al.: Generative adversarial nets, vol. 27 (2014)

    Google ScholarĀ 

  19. Gu, Z., et al.: Spatiotemporal inconsistency learning for deepfake video detection. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 3473ā€“3481 (2021)

    Google ScholarĀ 

  20. Haliassos, A., Vougioukas, K., Petridis, S., Pantic, M.: Lips donā€™t lie: a generalisable and robust approach to face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5039ā€“5049 (2021)

    Google ScholarĀ 

  21. Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d CNNs retrace the history of 2d CNNs and ImageNet? In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6546ā€“6555 (2018)

    Google ScholarĀ 

  22. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770ā€“778 (2016)

    Google ScholarĀ 

  23. Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUs) (2016)

    Google ScholarĀ 

  24. Islam, M.A., Kowal, M., Jia, S., Derpanis, K.G., Bruce, N.D.: Position, padding and predictions: a deeper look at position information in CNNs (2021)

    Google ScholarĀ 

  25. Jiang, Z., et al.: Token labeling: Training a 85.5% top-1 accuracy vision transformer with 56m parameters on imagenet (2021)

    Google ScholarĀ 

  26. Khodabakhsh, A., Ramachandra, R., Raja, K., Wasnik, P., Busch, C.: Fake face detection methods: can they be generalized? In: 2018 International Conference of the Biometrics Special Interest Group (BIOSIG), pp. 1ā€“6. IEEE (2018)

    Google ScholarĀ 

  27. Li, J., Xie, H., Li, J., Wang, Z., Zhang, Y.: Frequency-aware discriminative feature learning supervised by single-center loss for face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6458ā€“6467 (2021)

    Google ScholarĀ 

  28. Li, L., Bao, J., Yang, H., Chen, D., Wen, F.: Advancing high fidelity identity swapping for forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5074ā€“5083 (2020)

    Google ScholarĀ 

  29. Li, L., et al.: Face x-ray for more general face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5001ā€“5010 (2020)

    Google ScholarĀ 

  30. Li, X., et al.: Sharp multiple instance learning for deepfake video detection. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1864ā€“1872 (2020)

    Google ScholarĀ 

  31. Li, Y., Chang, M.C., Lyu, S.: In ICTU oculi: Exposing AI created fake videos by detecting eye blinking. In: 2018 IEEE International Workshop on Information Forensics and Security (WIFS), pp. 1ā€“7. IEEE (2018)

    Google ScholarĀ 

  32. Li, Y., Lyu, S.: Exposing deepfake videos by detecting face warping artifacts (2018)

    Google ScholarĀ 

  33. Li, Y., Yang, X., Sun, P., Qi, H., Lyu, S.: Celeb-DF: a large-scale challenging dataset for deepfake forensics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3207ā€“3216 (2020)

    Google ScholarĀ 

  34. Liu, H., et al.: Spatial-phase shallow learning: rethinking face forgery detection in frequency domain. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 772ā€“781 (2021)

    Google ScholarĀ 

  35. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows (2021)

    Google ScholarĀ 

  36. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2017)

    Google ScholarĀ 

  37. Mao, M., et al.: Dual-stream network for visual recognition (2021)

    Google ScholarĀ 

  38. Masi, I., Killekar, A., Mascarenhas, R.M., Gurudatt, S.P., AbdAlmageed, W.: Two-branch recurrent network for isolating deepfakes in videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 667ā€“684. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_39

    ChapterĀ  Google ScholarĀ 

  39. Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., Manocha, D.: Emotions donā€™t lie: An audio-visual deepfake detection method using affective cues. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2823ā€“2832 (2020)

    Google ScholarĀ 

  40. Nguyen, H.H., Fang, F., Yamagishi, J., Echizen, I.: Multi-task learning for detecting and segmenting manipulated facial images and videos (2019)

    Google ScholarĀ 

  41. Peng, Z., et al.: Conformer: local features coupling global representations for visual recognition (2021)

    Google ScholarĀ 

  42. Qian, Y., Yin, G., Sheng, L., Chen, Z., Shao, J.: Thinking in frequency: face forgery detection by mining frequency-aware clues. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 86ā€“103. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_6

    ChapterĀ  Google ScholarĀ 

  43. Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., NieƟner, M.: Faceforensics++: learning to detect manipulated facial images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1ā€“11 (2019)

    Google ScholarĀ 

  44. Sabir, E., Cheng, J., Jaiswal, A., AbdAlmageed, W., Masi, I., Natarajan, P.: Recurrent convolutional strategies for face manipulation detection in videos. Interfaces 3, 80ā€“87 (2019)

    Google ScholarĀ 

  45. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618ā€“626 (2017)

    Google ScholarĀ 

  46. Thies, J., Elgharib, M., Tewari, A., Theobalt, C., NieƟner, M.: Neural voice puppetry: audio-driven facial reenactment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12361, pp. 716ā€“731. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_42

    ChapterĀ  Google ScholarĀ 

  47. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks (2015)

    Google ScholarĀ 

  48. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998ā€“6008 (2017)

    Google ScholarĀ 

  49. Vougioukas, K., Petridis, S., Pantic, M.: End-to-end speech-driven realistic facial animation with temporal GANs. In: IEEE conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 37ā€“40 (2019)

    Google ScholarĀ 

  50. Wang, S.Y., Wang, O., Zhang, R., Owens, A., Efros, A.A.: CNN-generated images are surprisingly easy to spot... for now. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8695ā€“8704 (2020)

    Google ScholarĀ 

  51. Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions (2021)

    Google ScholarĀ 

  52. Zhao, H., Zhou, W., Chen, D., Wei, T., Zhang, W., Yu, N.: Multi-attentional deepfake detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2185ā€“2194 (2021)

    Google ScholarĀ 

  53. Zheng, Y., Bao, J., Chen, D., Zeng, M., Wen, F.: Exploring temporal coherence for more general video face forgery detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15044ā€“15054 (2021)

    Google ScholarĀ 

Download references

Acknowledegments

This work was supported by ā€œOne Thousand Planā€ projects in Jiangxi Province Jxsg2023102268 and National Key Laboratory on Automatic Target Recognition 220402.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Baochang Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, X. et al. (2023). Hybrid Spatio-Temporal Network forĀ Face Forgery Detection. In: Lu, H., Blumenstein, M., Cho, SB., Liu, CL., Yagi, Y., Kamiya, T. (eds) Pattern Recognition. ACPR 2023. Lecture Notes in Computer Science, vol 14408. Springer, Cham. https://doi.org/10.1007/978-3-031-47665-5_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-47665-5_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-47664-8

  • Online ISBN: 978-3-031-47665-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics