Abstract
As manipulated faces become more realistic and indistinguishable, there is a high demand for efficiently and accurately detecting deepfakes. Existing CNN-based deepfake detection methods either learn a global feature representation of the whole face or learn multiple local features. However, these methods learn the global and local features independently, thus neglect the spatial correlations between the local features and global context, which are vital in identifying different forgery patterns. Therefore, in this paper, we propose Spatial Interaction Network (SI-Net), a deepfake detection method to mine potential complementary and co-occurrent features between local texture and global context concurrently. Specifically, we first utilize a region feature extractor that distills local features from the global features, to simplify the procedure of local feature extraction. We then propose spatial-aware transformer to learn the co-occurrence feature from local texture and global context, concurrently. We capture the attended feature from the local regions according to their importance. The final prediction is made through the composite considerations of the aforementioned modules. Experimental results on two public datasets, FaceForensics++ and WildDeepfake, demonstrate the superior performance of SI-Net compared with the state-of-the-art methods.
Similar content being viewed by others
References
Afchar, D., Nozick. V., Yamagishi, J., et al.: Mesonet: a compact facial video forgery detection network. In: IEEE International Workshop on Information Forensics and Security (2018)
Bappy, J..H., Simons, C., Nataraj, L., et al.: Hybrid lstm and encoder–decoder architecture for detection of image forgeries. IEEE Transact. Image Process 28(7), 3286–3300 (2019)
Bayar, B., Stamm, M.C.: A deep learning approach to universal image manipulation detection using a new convolutional layer. In: ACM Workshop on Information Hiding and Multimedia Security, pp 5–10 (2016)
Chai, L., Bau, D., Lim, S.N., et. al.: What makes fake images detectable? understanding properties that generalize. In: European Conference on Computer Vision, pp 103–120 (2020)
Chen, Z., Yang, H.: Manipulated face detector: Joint spatial and frequency domain attention network. (2020) arXiv preprint arXiv:2005.02958
Chi, C., Wei, F., Hu, H.: Relationnet++: Bridging visual representations for object detection via transformer decoder. In: Annual Conference on Neural Information Processing Systems (2020)
Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 1251–1258 (2017)
Chugh, K., Gupta, P., Dhall, A., et al.: Not made for each other- audio-visual dissonance-based deepfake detection and localization. In: ACM International Conference on Multimedia, pp 439–447, (2020)https://doi.org/10.1145/3394171.3413700
Ciftci, U.A., Demir, I., Yin, L.: Fakecatcher: detection of synthetic portrait videos using biological signals. IEEE Transact. Pattern Anal. Mach. Intellig. (2020). https://doi.org/10.1109/TPAMI.2020.3009287
Cozzolino, D., Verdoliva, L.: Noiseprint: A CNN-based camera model fingerprint. IEEE Transact. Informat. Forens. Security 15, 144–159 (2020). https://doi.org/10.1109/TIFS.2019.2916364
Cozzolino, D., Poggi, G., Verdoliva, L.: Recasting residual-based local descriptors as convolutional neural networks: an application to image forgery detection. In: ACM Workshop on Information Hiding and Multimedia Security, pp 159–164 (2017)
Dang, H., Liu, F., Stehouwer, J., et al.: On the detection of digital face manipulation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 5781–5790 (2020)
Deepfakes (2020-01-06) Deepfakes. https://github.com/deepfakes/faceswap
Di, D., Shang, X., Zhang, W., et al.: Multiple hypothesis video relation detection. In: IEEE International Conference on Multimedia Big Data, pp 287–291 (2019)
Du, X.Y., Yang, Y., Yang, L., et al.: Captioning videos using large-scale image corpus. J. Comp. Sci. Technol. 32(3), 480–493 (2017)
Du, Y., Yuan, C., Li, B., et al.: Interaction-aware spatio-temporal pyramid attention networks for action classification. In: European Conference on Computer Vision, pp 373–389 (2018)
Durall, R., Keuper, M., Keuper, J.: Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 7890–7899 (2020)
Frank, J., Eisenhofer, T., Schönherr, L., et al.: Leveraging frequency analysis for deep fake image recognition. In: International Conference on Machine Learning, pp 3247–3258 (2020)
Fridrich, J., Kodovsky, J.: Rich models for steganalysis of digital images. IEEE Transact. Informat. Forens. Security 7(3), 868–882 (2012)
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778 (2016)
He, K., Gkioxari, G., Dollár, P., et al.: Mask r-cnn. In: IEEE International Conference on Computer Vision, pp 2961– 2969 (2017)
Hou, Y., Xu, J., Liu, M., et al.: Nlh: a blind pixel-level non-local method for real-world image denoising. IEEE Transact. Image Process. 29, 5121–5135 (2020)
Huang, L., Wang, W., Chen, J., et al.: Attention on attention for image captioning. In: IEEE International Conference on Computer Vision, pp 4634–4643 (2019)
Huang, Y., Juefei-Xu, F., Wang, R., et al.: Fakelocator: Robust localization of gan-based face manipulations via semantic segmentation networks with bells and whistles. arXiv preprint arXiv:2001.09598 (2020)
King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)
Kowalski (2020-07-22) Faceswap. https://github.com/MarekKowalski/FaceSwap/
Li, L., Bao, J., Zhang, T., et al.: (2020) Face x-ray for more general face forgery detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 5001–5010
Li, Y., Chang, M.C., Lyu, S.: In ictu oculi: Exposing ai created fake videos by detecting eye blinking. In: IEEE International Workshop on Information Forensics and Security (2018a)
Li, Y., Zeng, J., Shan, S., et al.: Occlusion aware facial expression recognition using cnn with attention mechanism. IEEE Transact. Image Process. 28(5), 2439–2450 (2018)
Li, Y., Yang, X., Shang, X., et al.: Interventional video relation detection. In: ACM International Conference on Multimedia, pp 4091–4099 (2021)
de Lima, O., Franklin, S., Basu, S., et al.: Deepfake detection using spatiotemporal convolutional networks. arXiv preprint arXiv:2006.14749 (2020)
Liu, H., Feng, J., Qi, M., et al.: End-to-end comparative attention networks for person re-identification. IEEE Transact. Image Process. 26(7), 3492–3506 (2017)
Liu, X., Yang, X., Wang, M., et al.: Deep neighborhood component analysis for visual similarity modeling. ACM Transact. Intell. Syst. Technol. 11(3), 1–15 (2020)
Liu, Z., Qi, X., Torr, P.H.: Global texture enhancement for fake face detection in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 8060–8069 (2020b)
Masi, I., Killekar, A., Mascarenhas, R.M., et al.: Two-branch recurrent network for isolating deepfakes in videos. In: European Conference on Computer Vision, pp 667–684 (2020)
Matern, F., Riess, C., Stamminger, M.: Exploiting visual artifacts to expose deepfakes and face manipulations. In: IEEE Winter Applications of Computer Vision Workshops, pp 83–92 (2019)
Neekhara, P., Hussain, S., Jere, M., et al.: Adversarial deepfakes: Evaluating vulnerability of deepfake detectors to adversarial examples. arXiv preprint arXiv:2002.12749 (2020)
Nguyen, H.H., Fang, F., Yamagishi, J., et al.: Multi-task learning for detecting and segmenting manipulated facial images and videos. In: International Conference on Biometrics Theory, Applications and Systems (2019)
Ondyari (2020-04-13) FaceForensics. https://github.com/ondyari/FaceForensics
Ozbulak U (2020-12-13) Pytorch cnn visualizations. https://github.com/utkuozbulak/pytorch-cnn-visualizations
Peng, Y., He, X., Zhao, J.: Object-part attention model for fine-grained image classification. IEEE Transact. Image Process. 27(3), 1487–1500 (2017)
Qian, Y., Yin, G., Sheng, L., et al.: Thinking in frequency: Face forgery detection by mining frequency-aware clues. In: European Conference on Computer Vision, pp 86–103 (2020)
Rahmouni, N., Nozick, V., Yamagishi, J., et al.: Distinguishing computer graphics from natural images using convolution neural networks. In: IEEE International Workshop on Information Forensics and Security (2017)
Rossler, A., Cozzolino, D., Verdoliva, L., et al.: Faceforensics++: Learning to detect manipulated facial images. In: IEEE International Conference on Computer Vision (2019)
Selvaraju, R.R., Cogswell, M., Das, A., et al.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: IEEE International Conference on Computer Vision, pp 618–626, (2017) https://doi.org/10.1109/ICCV.2017.74
Shang, X., Di, D., Xiao, J., et al.: Annotating objects and relations in user-generated videos. In: International Conference on Multimedia Retrieval, pp 279–287 (2019)
Tan, Y., Hao, Y., He, X., et al.: Selective dependency aggregation for action classification. In: ACM International Conference on Multimedia, pp 592–601 (2021)
Thies, J., Zollhofer, M., Stamminger, M., et al.: Face2face: Real-time face capture and reenactment of rgb videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2387–2395 (2016)
Thies, J., Zollhöfer, M., Nießner, M.: Deferred neural rendering: image synthesis using neural textures. ACM Transact. Graph. 38(4), 66:1-66:12 (2019)
Tolosana, R., Romero-Tapiador, S., Fierrez, J., et al.: Deepfakes evolution: Analysis of facial regions and fake detection performance. arXiv preprint arXiv:2004.07532 (2020)
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Annual Conference on Neural Information Processing Systems pp 5998–6008 (2017)
Wang, S.Y., Wang, O., Zhang, R., et al.: Cnn-generated images are surprisingly easy to spot... for now. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 8692–8701 (2020)
Wang, X., Girshick, R., Gupta, A., et al.: Non-local neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 7794–7803 (2018)
Wang, Z., Bovik, A.C., Sheikh, H.R., et al.: Image quality assessment: from error visibility to structural similarity. IEEE Transact. Image Process. 13(4), 600–612 (2004)
Xiao, J., Shang, X., Yang, X., et al.: Visual relation grounding in videos. In: European Conference on Computer Vision, Springer, pp 447–464 (2020)
Yang, C., Ding, L., Chen, Y., et al.: Defending against gan-based deepfake attacks via transformation-aware adversarial faces. arXiv preprint arXiv:2006.07421 (2020a)
Yang, X., Dong, J., Cao, Y., et al.: Tree-augmented cross-modal encoding for complex-query video retrieval. In: ACM SIGIR Conference on Research and Development in Information Retrieval, pp 1339–1348 (2020b)
Yang, X., Liu, X., Jian, M., et al.: Weakly-supervised video object grounding by exploring spatio-temporal contexts. In: ACM International Conference on Multimedia, pp 1939–1947 (2020c)
Yang, X., Feng, F., Ji, W., et al.: Deconfounded video moment retrieval with causal intervention. In: SIGIR (2021)
Zhang, D., Zhang, H., Tang, J., et al.: Feature pyramid transformer. In: European Conference on Computer Vision, pp 323–339(2020)
Zhang, K., Zhang, Z., Li, Z., et al.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Sig. Process. Lett. 23(10), 1499–1503 (2016)
Zhou, P., Han, X., Morariu, V.I., et al.: Two-stream neural networks for tampered face detection. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 1831–1839,(2017) https://doi.org/10.1109/CVPRW.2017.229
Zhu, F., Fang, C., Ma, K.K.: Pnen: Pyramid non-local enhanced networks. IEEE Transact. Image Process. 29, 8831–8841 (2020)
Zi, B., Chang, M., Chen, J., et al.:Wilddeepfake: A challenging real-world dataset for deepfake detection. In: ACM International Conference on Multimedia, pp 2382–2390 (2020)
Acknowledgements
This work is supported by Open Funding Project of the State Key Laboratory of Communication Content Cognition (No. 20K03) and National Natural Science Foundation of China under Grant 62076131.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, J., Du, X., Cheng, Y. et al. SI-Net: spatial interaction network for deepfake detection. Multimedia Systems 29, 3139–3150 (2023). https://doi.org/10.1007/s00530-023-01114-w
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-023-01114-w