Skip to main content
Log in

SI-Net: spatial interaction network for deepfake detection

  • Special Issue Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

As manipulated faces become more realistic and indistinguishable, there is a high demand for efficiently and accurately detecting deepfakes. Existing CNN-based deepfake detection methods either learn a global feature representation of the whole face or learn multiple local features. However, these methods learn the global and local features independently, thus neglect the spatial correlations between the local features and global context, which are vital in identifying different forgery patterns. Therefore, in this paper, we propose Spatial Interaction Network (SI-Net), a deepfake detection method to mine potential complementary and co-occurrent features between local texture and global context concurrently. Specifically, we first utilize a region feature extractor that distills local features from the global features, to simplify the procedure of local feature extraction. We then propose spatial-aware transformer to learn the co-occurrence feature from local texture and global context, concurrently. We capture the attended feature from the local regions according to their importance. The final prediction is made through the composite considerations of the aforementioned modules. Experimental results on two public datasets, FaceForensics++ and WildDeepfake, demonstrate the superior performance of SI-Net compared with the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Afchar, D., Nozick. V., Yamagishi, J., et al.: Mesonet: a compact facial video forgery detection network. In: IEEE International Workshop on Information Forensics and Security (2018)

  2. Bappy, J..H., Simons, C., Nataraj, L., et al.: Hybrid lstm and encoder–decoder architecture for detection of image forgeries. IEEE Transact. Image Process 28(7), 3286–3300 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  3. Bayar, B., Stamm, M.C.: A deep learning approach to universal image manipulation detection using a new convolutional layer. In: ACM Workshop on Information Hiding and Multimedia Security, pp 5–10 (2016)

  4. Chai, L., Bau, D., Lim, S.N., et. al.: What makes fake images detectable? understanding properties that generalize. In: European Conference on Computer Vision, pp 103–120 (2020)

  5. Chen, Z., Yang, H.: Manipulated face detector: Joint spatial and frequency domain attention network. (2020) arXiv preprint arXiv:2005.02958

  6. Chi, C., Wei, F., Hu, H.: Relationnet++: Bridging visual representations for object detection via transformer decoder. In: Annual Conference on Neural Information Processing Systems (2020)

  7. Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 1251–1258 (2017)

  8. Chugh, K., Gupta, P., Dhall, A., et al.: Not made for each other- audio-visual dissonance-based deepfake detection and localization. In: ACM International Conference on Multimedia, pp 439–447, (2020)https://doi.org/10.1145/3394171.3413700

  9. Ciftci, U.A., Demir, I., Yin, L.: Fakecatcher: detection of synthetic portrait videos using biological signals. IEEE Transact. Pattern Anal. Mach. Intellig. (2020). https://doi.org/10.1109/TPAMI.2020.3009287

    Article  Google Scholar 

  10. Cozzolino, D., Verdoliva, L.: Noiseprint: A CNN-based camera model fingerprint. IEEE Transact. Informat. Forens. Security 15, 144–159 (2020). https://doi.org/10.1109/TIFS.2019.2916364

    Article  Google Scholar 

  11. Cozzolino, D., Poggi, G., Verdoliva, L.: Recasting residual-based local descriptors as convolutional neural networks: an application to image forgery detection. In: ACM Workshop on Information Hiding and Multimedia Security, pp 159–164 (2017)

  12. Dang, H., Liu, F., Stehouwer, J., et al.: On the detection of digital face manipulation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 5781–5790 (2020)

  13. Deepfakes (2020-01-06) Deepfakes. https://github.com/deepfakes/faceswap

  14. Di, D., Shang, X., Zhang, W., et al.: Multiple hypothesis video relation detection. In: IEEE International Conference on Multimedia Big Data, pp 287–291 (2019)

  15. Du, X.Y., Yang, Y., Yang, L., et al.: Captioning videos using large-scale image corpus. J. Comp. Sci. Technol. 32(3), 480–493 (2017)

    Article  Google Scholar 

  16. Du, Y., Yuan, C., Li, B., et al.: Interaction-aware spatio-temporal pyramid attention networks for action classification. In: European Conference on Computer Vision, pp 373–389 (2018)

  17. Durall, R., Keuper, M., Keuper, J.: Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 7890–7899 (2020)

  18. Frank, J., Eisenhofer, T., Schönherr, L., et al.: Leveraging frequency analysis for deep fake image recognition. In: International Conference on Machine Learning, pp 3247–3258 (2020)

  19. Fridrich, J., Kodovsky, J.: Rich models for steganalysis of digital images. IEEE Transact. Informat. Forens. Security 7(3), 868–882 (2012)

    Article  Google Scholar 

  20. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778 (2016)

  21. He, K., Gkioxari, G., Dollár, P., et al.: Mask r-cnn. In: IEEE International Conference on Computer Vision, pp 2961– 2969 (2017)

  22. Hou, Y., Xu, J., Liu, M., et al.: Nlh: a blind pixel-level non-local method for real-world image denoising. IEEE Transact. Image Process. 29, 5121–5135 (2020)

    Article  MATH  Google Scholar 

  23. Huang, L., Wang, W., Chen, J., et al.: Attention on attention for image captioning. In: IEEE International Conference on Computer Vision, pp 4634–4643 (2019)

  24. Huang, Y., Juefei-Xu, F., Wang, R., et al.: Fakelocator: Robust localization of gan-based face manipulations via semantic segmentation networks with bells and whistles. arXiv preprint arXiv:2001.09598 (2020)

  25. King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)

    Google Scholar 

  26. Kowalski (2020-07-22) Faceswap. https://github.com/MarekKowalski/FaceSwap/

  27. Li, L., Bao, J., Zhang, T., et al.: (2020) Face x-ray for more general face forgery detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 5001–5010

  28. Li, Y., Chang, M.C., Lyu, S.: In ictu oculi: Exposing ai created fake videos by detecting eye blinking. In: IEEE International Workshop on Information Forensics and Security (2018a)

  29. Li, Y., Zeng, J., Shan, S., et al.: Occlusion aware facial expression recognition using cnn with attention mechanism. IEEE Transact. Image Process. 28(5), 2439–2450 (2018)

    Article  MathSciNet  Google Scholar 

  30. Li, Y., Yang, X., Shang, X., et al.: Interventional video relation detection. In: ACM International Conference on Multimedia, pp 4091–4099 (2021)

  31. de Lima, O., Franklin, S., Basu, S., et al.: Deepfake detection using spatiotemporal convolutional networks. arXiv preprint arXiv:2006.14749 (2020)

  32. Liu, H., Feng, J., Qi, M., et al.: End-to-end comparative attention networks for person re-identification. IEEE Transact. Image Process. 26(7), 3492–3506 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  33. Liu, X., Yang, X., Wang, M., et al.: Deep neighborhood component analysis for visual similarity modeling. ACM Transact. Intell. Syst. Technol. 11(3), 1–15 (2020)

    Google Scholar 

  34. Liu, Z., Qi, X., Torr, P.H.: Global texture enhancement for fake face detection in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 8060–8069 (2020b)

  35. Masi, I., Killekar, A., Mascarenhas, R.M., et al.: Two-branch recurrent network for isolating deepfakes in videos. In: European Conference on Computer Vision, pp 667–684 (2020)

  36. Matern, F., Riess, C., Stamminger, M.: Exploiting visual artifacts to expose deepfakes and face manipulations. In: IEEE Winter Applications of Computer Vision Workshops, pp 83–92 (2019)

  37. Neekhara, P., Hussain, S., Jere, M., et al.: Adversarial deepfakes: Evaluating vulnerability of deepfake detectors to adversarial examples. arXiv preprint arXiv:2002.12749 (2020)

  38. Nguyen, H.H., Fang, F., Yamagishi, J., et al.: Multi-task learning for detecting and segmenting manipulated facial images and videos. In: International Conference on Biometrics Theory, Applications and Systems (2019)

  39. Ondyari (2020-04-13) FaceForensics. https://github.com/ondyari/FaceForensics

  40. Ozbulak U (2020-12-13) Pytorch cnn visualizations. https://github.com/utkuozbulak/pytorch-cnn-visualizations

  41. Peng, Y., He, X., Zhao, J.: Object-part attention model for fine-grained image classification. IEEE Transact. Image Process. 27(3), 1487–1500 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  42. Qian, Y., Yin, G., Sheng, L., et al.: Thinking in frequency: Face forgery detection by mining frequency-aware clues. In: European Conference on Computer Vision, pp 86–103 (2020)

  43. Rahmouni, N., Nozick, V., Yamagishi, J., et al.: Distinguishing computer graphics from natural images using convolution neural networks. In: IEEE International Workshop on Information Forensics and Security (2017)

  44. Rossler, A., Cozzolino, D., Verdoliva, L., et al.: Faceforensics++: Learning to detect manipulated facial images. In: IEEE International Conference on Computer Vision (2019)

  45. Selvaraju, R.R., Cogswell, M., Das, A., et al.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: IEEE International Conference on Computer Vision, pp 618–626, (2017) https://doi.org/10.1109/ICCV.2017.74

  46. Shang, X., Di, D., Xiao, J., et al.: Annotating objects and relations in user-generated videos. In: International Conference on Multimedia Retrieval, pp 279–287 (2019)

  47. Tan, Y., Hao, Y., He, X., et al.: Selective dependency aggregation for action classification. In: ACM International Conference on Multimedia, pp 592–601 (2021)

  48. Thies, J., Zollhofer, M., Stamminger, M., et al.: Face2face: Real-time face capture and reenactment of rgb videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2387–2395 (2016)

  49. Thies, J., Zollhöfer, M., Nießner, M.: Deferred neural rendering: image synthesis using neural textures. ACM Transact. Graph. 38(4), 66:1-66:12 (2019)

    Google Scholar 

  50. Tolosana, R., Romero-Tapiador, S., Fierrez, J., et al.: Deepfakes evolution: Analysis of facial regions and fake detection performance. arXiv preprint arXiv:2004.07532 (2020)

  51. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Annual Conference on Neural Information Processing Systems pp 5998–6008 (2017)

  52. Wang, S.Y., Wang, O., Zhang, R., et al.: Cnn-generated images are surprisingly easy to spot... for now. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 8692–8701 (2020)

  53. Wang, X., Girshick, R., Gupta, A., et al.: Non-local neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 7794–7803 (2018)

  54. Wang, Z., Bovik, A.C., Sheikh, H.R., et al.: Image quality assessment: from error visibility to structural similarity. IEEE Transact. Image Process. 13(4), 600–612 (2004)

    Article  Google Scholar 

  55. Xiao, J., Shang, X., Yang, X., et al.: Visual relation grounding in videos. In: European Conference on Computer Vision, Springer, pp 447–464 (2020)

  56. Yang, C., Ding, L., Chen, Y., et al.: Defending against gan-based deepfake attacks via transformation-aware adversarial faces. arXiv preprint arXiv:2006.07421 (2020a)

  57. Yang, X., Dong, J., Cao, Y., et al.: Tree-augmented cross-modal encoding for complex-query video retrieval. In: ACM SIGIR Conference on Research and Development in Information Retrieval, pp 1339–1348 (2020b)

  58. Yang, X., Liu, X., Jian, M., et al.: Weakly-supervised video object grounding by exploring spatio-temporal contexts. In: ACM International Conference on Multimedia, pp 1939–1947 (2020c)

  59. Yang, X., Feng, F., Ji, W., et al.: Deconfounded video moment retrieval with causal intervention. In: SIGIR (2021)

  60. Zhang, D., Zhang, H., Tang, J., et al.: Feature pyramid transformer. In: European Conference on Computer Vision, pp 323–339(2020)

  61. Zhang, K., Zhang, Z., Li, Z., et al.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Sig. Process. Lett. 23(10), 1499–1503 (2016)

    Article  Google Scholar 

  62. Zhou, P., Han, X., Morariu, V.I., et al.: Two-stream neural networks for tampered face detection. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 1831–1839,(2017) https://doi.org/10.1109/CVPRW.2017.229

  63. Zhu, F., Fang, C., Ma, K.K.: Pnen: Pyramid non-local enhanced networks. IEEE Transact. Image Process. 29, 8831–8841 (2020)

    Article  MATH  Google Scholar 

  64. Zi, B., Chang, M., Chen, J., et al.:Wilddeepfake: A challenging real-world dataset for deepfake detection. In: ACM International Conference on Multimedia, pp 2382–2390 (2020)

Download references

Acknowledgements

This work is supported by Open Funding Project of the State Key Laboratory of Communication Content Cognition (No. 20K03) and National Natural Science Foundation of China under Grant 62076131.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu Cheng.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, J., Du, X., Cheng, Y. et al. SI-Net: spatial interaction network for deepfake detection. Multimedia Systems 29, 3139–3150 (2023). https://doi.org/10.1007/s00530-023-01114-w

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-023-01114-w

Keywords

Navigation