Hybrid Spatio-Temporal Network for Face Forgery Detection

Liu, Xuhui; Gao, Sicheng; Zhou, Peizhu; Liu, Jianzhuang; Luo, Xiaoyan; Zhang, Luping; Zhang, Baochang

doi:10.1007/978-3-031-47665-5_21

Xuhui Liu¹³,
Sicheng Gao¹³,
Peizhu Zhou¹³,
Jianzhuang Liu¹⁴,
Xiaoyan Luo¹³,
Luping Zhang¹⁵ &
…
Baochang Zhang^13,16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14408))

Included in the following conference series:

Asian Conference on Pattern Recognition

246 Accesses

Abstract

Facial manipulation techniques have aroused increasing security concerns, leading to various methods to detect forgery videos. However, existing methods suffer from a significant performance gap compared to image manipulation methods, partially because the spatio-temporal information is not well explored. To address the issue, we introduce a Hybrid Spatio-Temporal Network (HSTNet) to integrate spatial and temporal information in the same framework. Specifically, our HSTNet utilizes a hybrid architecture, which consists of a 3D CNN branch and a transformer branch, to jointly learn short- and long-range relations in the spatio-temporal dimension. Due to the feature misalignment between the two branches, we design a Feature Alignment Block (FAB) to recalibrate and efficiently fuse heterogeneous features. Moreover, HSTNet introduces a Vector Selection Block (VSB) to combine the outputs of the two branches and fire important features for classification. Extensive experiments show that HSTNet obtains the best overall performance over state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Afchar, D., Nozick, V., Yamagishi, J., Echizen, I.: Mesonet: a compact facial video forgery detection network. In: 2018 IEEE International Workshop on Information Forensics and Security (WIFS), pp. 1–7. IEEE (2018)
Google Scholar
Amerini, I., Galteri, L., Caldelli, R., Del Bimbo, A.: Deepfake video detection through optical flow based CNN. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)
Google Scholar
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer (2021)
Google Scholar
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization (2016)
Google Scholar
Bao, J., Chen, D., Wen, F., Li, H., Hua, G.: Towards open-set identity preserving face synthesis. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6713–6722 (2018)
Google Scholar
Bayar, B., Stamm, M.C.: A deep learning approach to universal image manipulation detection using a new convolutional layer. In: Proceedings of the 4th ACM Workshop on Information Hiding and Multimedia Security, pp. 5–10 (2016)
Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? (2021)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Google Scholar
Chollet, F.: Xception: Deep learning with DepthWise separable convolutions. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1251–1258 (2017)
Google Scholar
Contributors, M.: Openmmlab’s next generation video understanding toolbox and benchmark. https://github.com/open-mmlab/mmaction2 (2020)
Dang, H., Liu, F., Stehouwer, J., Liu, X., Jain, A.K.: On the detection of digital face manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5781–5790 (2020)
Google Scholar
Deng, J., Guo, J., Ververas, E., Kotsia, I., Zafeiriou, S.: Retinaface: single-shot multi-level face localisation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5203–5212 (2020)
Google Scholar
Dolhansky, B., et al.: The deepfake detection challenge (DFDC) dataset (2020)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale (2020)
Google Scholar
Du, M., Pentyala, S., Li, Y., Hu, X.: Towards generalizable forgery detection with locality-aware autoencoder. pp. arXiv-1909 (2019)
Google Scholar
Durall, R., Keuper, M., Keuper, J.: Watch your up-convolution: CNN based generative deep neural networks are failing to reproduce spectral distributions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7890–7899 (2020)
Google Scholar
Goodfellow, I., et al.: Generative adversarial nets, vol. 27 (2014)
Google Scholar
Gu, Z., et al.: Spatiotemporal inconsistency learning for deepfake video detection. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 3473–3481 (2021)
Google Scholar
Haliassos, A., Vougioukas, K., Petridis, S., Pantic, M.: Lips don’t lie: a generalisable and robust approach to face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5039–5049 (2021)
Google Scholar
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d CNNs retrace the history of 2d CNNs and ImageNet? In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6546–6555 (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUs) (2016)
Google Scholar
Islam, M.A., Kowal, M., Jia, S., Derpanis, K.G., Bruce, N.D.: Position, padding and predictions: a deeper look at position information in CNNs (2021)
Google Scholar
Jiang, Z., et al.: Token labeling: Training a 85.5% top-1 accuracy vision transformer with 56m parameters on imagenet (2021)
Google Scholar
Khodabakhsh, A., Ramachandra, R., Raja, K., Wasnik, P., Busch, C.: Fake face detection methods: can they be generalized? In: 2018 International Conference of the Biometrics Special Interest Group (BIOSIG), pp. 1–6. IEEE (2018)
Google Scholar
Li, J., Xie, H., Li, J., Wang, Z., Zhang, Y.: Frequency-aware discriminative feature learning supervised by single-center loss for face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6458–6467 (2021)
Google Scholar
Li, L., Bao, J., Yang, H., Chen, D., Wen, F.: Advancing high fidelity identity swapping for forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5074–5083 (2020)
Google Scholar
Li, L., et al.: Face x-ray for more general face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5001–5010 (2020)
Google Scholar
Li, X., et al.: Sharp multiple instance learning for deepfake video detection. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1864–1872 (2020)
Google Scholar
Li, Y., Chang, M.C., Lyu, S.: In ICTU oculi: Exposing AI created fake videos by detecting eye blinking. In: 2018 IEEE International Workshop on Information Forensics and Security (WIFS), pp. 1–7. IEEE (2018)
Google Scholar
Li, Y., Lyu, S.: Exposing deepfake videos by detecting face warping artifacts (2018)
Google Scholar
Li, Y., Yang, X., Sun, P., Qi, H., Lyu, S.: Celeb-DF: a large-scale challenging dataset for deepfake forensics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3207–3216 (2020)
Google Scholar
Liu, H., et al.: Spatial-phase shallow learning: rethinking face forgery detection in frequency domain. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 772–781 (2021)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows (2021)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2017)
Google Scholar
Mao, M., et al.: Dual-stream network for visual recognition (2021)
Google Scholar
Masi, I., Killekar, A., Mascarenhas, R.M., Gurudatt, S.P., AbdAlmageed, W.: Two-branch recurrent network for isolating deepfakes in videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 667–684. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_39
Chapter Google Scholar
Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., Manocha, D.: Emotions don’t lie: An audio-visual deepfake detection method using affective cues. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2823–2832 (2020)
Google Scholar
Nguyen, H.H., Fang, F., Yamagishi, J., Echizen, I.: Multi-task learning for detecting and segmenting manipulated facial images and videos (2019)
Google Scholar
Peng, Z., et al.: Conformer: local features coupling global representations for visual recognition (2021)
Google Scholar
Qian, Y., Yin, G., Sheng, L., Chen, Z., Shao, J.: Thinking in frequency: face forgery detection by mining frequency-aware clues. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 86–103. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_6
Chapter Google Scholar
Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: Faceforensics++: learning to detect manipulated facial images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1–11 (2019)
Google Scholar
Sabir, E., Cheng, J., Jaiswal, A., AbdAlmageed, W., Masi, I., Natarajan, P.: Recurrent convolutional strategies for face manipulation detection in videos. Interfaces 3, 80–87 (2019)
Google Scholar
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)
Google Scholar
Thies, J., Elgharib, M., Tewari, A., Theobalt, C., Nießner, M.: Neural voice puppetry: audio-driven facial reenactment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12361, pp. 716–731. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_42
Chapter Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks (2015)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Vougioukas, K., Petridis, S., Pantic, M.: End-to-end speech-driven realistic facial animation with temporal GANs. In: IEEE conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 37–40 (2019)
Google Scholar
Wang, S.Y., Wang, O., Zhang, R., Owens, A., Efros, A.A.: CNN-generated images are surprisingly easy to spot... for now. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8695–8704 (2020)
Google Scholar
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions (2021)
Google Scholar
Zhao, H., Zhou, W., Chen, D., Wei, T., Zhang, W., Yu, N.: Multi-attentional deepfake detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2185–2194 (2021)
Google Scholar
Zheng, Y., Bao, J., Chen, D., Zeng, M., Wen, F.: Exploring temporal coherence for more general video face forgery detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15044–15054 (2021)
Google Scholar

Download references

Acknowledegments

This work was supported by “One Thousand Plan” projects in Jiangxi Province Jxsg2023102268 and National Key Laboratory on Automatic Target Recognition 220402.

Author information

Authors and Affiliations

Beihang University, Beijing, China
Xuhui Liu, Sicheng Gao, Peizhu Zhou, Xiaoyan Luo & Baochang Zhang
Shenzhen Institute of Advanced Technology, Shenzhen, China
Jianzhuang Liu
National University of Defense Technology, Changsha, China
Luping Zhang
Nanchang Institute of Technology, Nanchang, China
Baochang Zhang

Authors

Xuhui Liu
View author publications
You can also search for this author in PubMed Google Scholar
Sicheng Gao
View author publications
You can also search for this author in PubMed Google Scholar
Peizhu Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jianzhuang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyan Luo
View author publications
You can also search for this author in PubMed Google Scholar
Luping Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Baochang Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Baochang Zhang .

Editor information

Editors and Affiliations

Kyushu Institute of Technology, Kitakyushu, Fukuoka, Japan
Huimin Lu
The University of Sydney, Sydney, NSW, Australia
Michael Blumenstein
Yonsei University, Seoul, Korea (Republic of)
Sung-Bae Cho
Chinese Academy of Sciences, Bejing, China
Cheng-Lin Liu
Osaka University, Osaka, Ibaraki, Japan
Yasushi Yagi
Kyushu Institute of Technology, Kitakyushu, Japan
Tohru Kamiya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, X. et al. (2023). Hybrid Spatio-Temporal Network for Face Forgery Detection. In: Lu, H., Blumenstein, M., Cho, SB., Liu, CL., Yagi, Y., Kamiya, T. (eds) Pattern Recognition. ACPR 2023. Lecture Notes in Computer Science, vol 14408. Springer, Cham. https://doi.org/10.1007/978-3-031-47665-5_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-47665-5_21
Published: 05 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47664-8
Online ISBN: 978-3-031-47665-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Hybrid Spatio-Temporal Network for Face Forgery Detection