Abstract
Recently, researchers have attracted much attention to the realistic nature of multi-modal deepfake content. They have employed plenty of handcrafted, learned features, and deep learning techniques to achieve promising performances for recognizing facial deepfakes. However, attackers continue to create deepfakes that outperform their earlier works by focusing on changes in many modalities, making deepfake identification under multiple modalities difficult. To exploit the merits of attention-based network architecture, we propose a novel cross-modal attention architecture on a bi-directional recurrent convolutional network to capture fake content in audio and video. For effective deepfake detection, the system records the spatial–temporal deformations of audio–video sequences and investigates the correlation in these modalities. We propose a self-attenuated VGG16 deep model for extracting visual features for facial fake recognition. Besides, the system incorporates a recurrent neural network with self-attention to extract false audio elements effectively. The cross-modal attention mechanism effectively learns the divergence between two modalities. Besides, we include multi-modal fake examples to create a well-balanced bespoke dataset to address the drawbacks of small and unbalanced training samples. We test the effectiveness of our proposed multi-modal deepfake detection strategy in comparison to state-of-the-art methods on a variety of existing datasets.








Similar content being viewed by others
Data availability
The data that support the fndings of this study are openly available in DFDC at https://paperswithcode.com/dataset/dfdc, MMDFD at https://dl.acm.org/doi/10.1145/3607947.3608013, FakeAVCeleb at https://paperswithcode.com/dataset/fakeavceleb.
References
Masood, M., Nawaz, M., Malik, K.M., Javed, A., Irtaza, A., Malik, H.: Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward. Appl. Intell. 53(4), 3974–4026 (2023)
Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing obama: learning lip sync from audio. ACM Trans. Graph. (ToG) 36(4), 1–13 (2017)
News Desk. Fabricated video of vladimir putin takes twitter by storm. 2020. https://www.globalvillagespace.com/fabricated-video-of-vladimir-putin-takes-twitter-by-storm. Accessed 27 Aug 2023
Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 484–492. Seattle WA, USA (2020)
Jia, Y., Zhang, Y., Weiss, R., Wang, Q., Shen, J., Ren, F., Nguyen, P., Pang, R., Moreno, I. Lopez, Wu, Y. et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In: Advances in neural information processing systems, pp. 1–11 (2018)
Liu, C., Tang, T., Lv, K., Wang, M.: Multi-feature based emotion recognition for video clips. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 630–634. Boulder CO USA (2018)
Lu, C., Zheng, W., Li, C., Tang, C., Liu, S., Yan, S., Zong, Y.: Multiple spatio-temporal feature learning for video-based emotion recognition in the wild. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 646–652 (2018)
Yuhas, B.P., Goldstein, M.H., Sejnowski, T.J.: Integration of acoustic and visual speech signals using neural networks. IEEE Commun. Mag. 27(11), 65–71 (1989)
Chen, L.S., Huang, T.S., Miyasato, T., Nakatsu, R.: Multimodal human emotion/expression recognition. Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, pp. 366–371. IEEE (1998)
Attabi, Y., Dumouchel, P.: Anchor models and wccn normalization for speaker trait classification. In: Thirteenth Annual Conference of the International Speech Communication Association., Oregon, USA (2012)
Liang, P.P., Salakhutdinov, R., Morency, L.-P.: Computational modeling of human multimodal language: the mosei dataset and interpretable dynamic fusion. In: First Workshop and Grand Challenge on Computational Modeling of Human Multimodal Language, Melbourne (2018)
Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.-P.: Memory fusion network for multi-view sequential learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32(1). Washington, DC, USA, (2018)]
Roy, R., Joshi, I., Das, A., Dantcheva, A.: 3d cnn architectures and attention mechanisms for deepfake detection. Handbook of Digital Face Manipulation and Detection, pp. 213–234. Springer, Cham (2022)
Das, A., Das, S., Dantcheva, A.: Demystifying attention mechanisms for deepfake detection. In: 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), pp. 1–7. IEEE (2021)
Korshunov, P., Marcel, S.: Deepfakes: a new threat to face recognition? Assessment and detection. arXiv preprint arXiv:1812.08685 (2018)
Li, Y., Yang, X., Sun, P., Qi, H., Lyu, S.: Celeb-df: a large-scale challenging dataset for deepfake forensics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3207–3216. California, USA (2020)
Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: Faceforensics++: Learning to detect manipulated facial images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1–11 (2019)
Dufour, N., Gully, A.: Contributing data to deepfake detection research. Google AI Blog 1(2), 3 (2019)
Dolhansky, B., Howes, R., Pflaum, B., Baram, N., Ferrer, C.C.: The deepfake detection challenge (dfdc) preview dataset. arXiv preprint arXiv:1910.08854 (2019)
Khalid, H., Tariq, S., Kim, M., Woo, S.S.: Fakeavceleb: a novel audio-video multimodal deepfake dataset. arXiv preprint arXiv:2108.05080 (2021)
Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., Manocha, D.: “Emotions don” lie: a deepfake detection method using audio-visual affective cues. In: Proceedings of the 28th ACM International Conference on Multimedia. ACM, pp. 2823–2832 (2020)
Lewis, J.K., Toubal, I.E., Chen, H., Sandesera, V., Lomnitz, M., Hampel-Arias, Z., Prasad, C., Palaniappan, K.: Deepfake video detection based on spatial, spectral, and temporal inconsistencies using multimodal deep learning. In: IEEE Applied Imagery Pattern Recognition Workshop (AIPR), vol. 2020, pp. 1–9. IEEE (2020)
Lomnitz, M., Hampel-Arias, Z., Sandesara, V., Hu, S.: Multimodal approach for deepfake detection. In: IEEE Applied Imagery Pattern Recognition Workshop (AIPR), vol. 2020, pp. 1–9. IEEE (2020)
Chugh, K., Gupta, P., Dhall, A., Subramanian, R.: Not made for each other-audio-visual dissonance-based deepfake detection and localization. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 439–447. United States, Seattle (2020)
Hosler, B., Salvi, D., Murray, A., Antonacci, F., Bestagini, P., Tubaro, S., Stamm, M.C.: Do deepfakes feel emotions? A semantic approach to detecting deepfakes via emotional inconsistencies. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, California, USA, pp. 1013–1022 (2021)
Khalid, H., Kim, M., Tariq, S., Woo, S.S.: Evaluation of an audio-video multimodal deepfake dataset using unimodal and multimodal detectors. In: Proceedings of the 1st Workshop on Synthetic Multimedia-Audiovisual Deepfake Generation and Detection, pp. 7–15 (2021)
Liu, X., Yu, Y., Li, X., Zhao, Y.: Mcl: multimodal contrastive learning for deepfake detection. IEEE Transactions on Circuits and Systems for Video Technology (2023)
Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. Adv. Neural Inf. Process. Syst. 27, 1–9 (2014)
Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., Tang, X.: Residual attention network for image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, USA, pp. 3156–3164 (2017)
Lin, Y.-B., Wang, Y.-C.F.: Audiovisual transformer with instance attention for audio-visual event localization. In: Proceedings of the Asian Conference on Computer Vision, Macao, China (2020)
Choi, H., Cho, K., Bengio, Y.: Fine-grained attention mechanism for neural machine translation. Neurocomputing 284, 171–176 (2018)
Ge, H., Yan, Z., Yu, W., Sun, L.: An attention mechanism based convolutional lstm network for video action recognition. Multimed. Tools Appl. 78(14), 20533–20556 (2019)
Hsiao, P.-W., Chen, C.-P.: Effective attention mechanism in dynamic models for speech emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2018, pp. 2526–2530. IEEE (2018)
Ganguly, S., Mohiuddin, S., Malakar, S., Cuevas, E., Sarkar, R.: Visual attention-based deepfake video forgery detection. Pattern Anal. Appl. 25, 1–12 (2022)
Zhou, Y., Lim, S.-N.: Joint audio-visual deepfake detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14800–14809 (2021)
Yu, Y., Liu, X., Ni, R., Yang, S., Zhao, Y., Kot, A.C.: Pvass-mdd: predictive visual-audio alignment self-supervision for multimodal deepfake detection. IEEE Transactions on Circuits and Systems for Video Technology (2023)
Salvi, D., Liu, H., Mandelli, S., Bestagini, P., Zhou, W., Zhang, W., Tubaro, S.: A robust approach to multimodal deepfake detection. J. Imaging 9(6), 122 (2023)
Kharel, A., Paranjape, M., Bera, A.: Df-transfusion: Multimodal deepfake detection via lip-audio cross-attention and facial self-attention. arXiv preprint arXiv:2309.06511 (2023)
Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014)
Machado, G.R., Silva, E., Goldschmidt, R.R.: A non-deterministic method to construct ensemble-based classifiers to protect decision support systems against adversarial images: a case study. In: Proceedings of the XV Brazilian Symposium on Information Systems. ACM, p. 72 (2019)
“Dlib python api tutorials link,” http://dlib.net/python/index.html (2015)
Horn, B.K., Schunck, B.G.: Determining optical flow. Artif. Intell. 17(1–3), 185–203 (1981)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Dang, H., Liu, F., Stehouwer, J., Liu, X., Jain, A.K.: On the detection of digital face manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition, USA, pp. 5781–5790 (2020)
O’Shaughnessy, D.: Automatic speech recognition: history, methods and challenges. Pattern Recognit. 41(10), 2965–2979 (2008)
Baveye, Y., Chamaret, C., Dellandréa, E., Chen, L.: Affective video content analysis: a multidisciplinary insight. IEEE Trans. Affect. Comput. 9(4), 396–409 (2017)
Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. Adv. Neural Inf. Process. Syst. 28, 1–9 (2015)
Chen, J., Jiang, D., Zhang, Y.: A hierarchical bidirectional gru model with attention for eeg-based emotion classification. IEEE Access 7, 118 530-118 540 (2019)
Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018)
Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning. PMLR, pp. 6105–6114 (2019)
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI Conference on Artificial Intelligence. California, USA (2017)
Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Hawaii, USA, pp. 1251–1258 (2017)
Afchar, D., Nozick, V., Yamagishi, J., Echizen, I.: Mesonet: a compact facial video forgery detection network. In: IEEE International Workshop on Information Forensics and Security (WIFS), vol. 2018, pp. 1–7. IEEE (2018)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition, Hawaii, USA, pp. 6450–6459 (2018)
McKeown, G., Valstar, M., Cowie, R., Pantic, M., Schroder, M.: The semaine database: annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans. Affect. Comput. 3(1), 5–17 (2011)
Author information
Authors and Affiliations
Contributions
AS: Conceived and designed the analysis: contributed to the conceptualization of the research, including the formulation of the research questions and objectives. Collected the data: contributed to designing the research methodology, including data collection. Software development: developed and implemented the software used in the experiments. Performed the analysis: performed analysis using provided data and model. Writing, review and editing: responsible for the initial drafting of the manuscript, reviewed and edited the manuscript for clarity and coherence. PV: Conceived and designed the analysis: contributed to the conceptualization of the research, including the formulation of the research questions and objectives. Collected the data: involved in collecting and organizing the research data. Data analysis: Conducted data analysis and contributed to the interpretation of results. Experimental design: Contributed to the design of the experimental setup. Review and editing: reviewed and edited the manuscript for clarity and coherence. VGM: Data analysis: conducted data analysis and contributed to the interpretation of results. Experimental design: contributed to the design of the experimental setup. Writing—review and editing: contributed to manuscript review and editing. Supervision: provided overall supervision and guidance throughout the research project.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Communicated by I. Ide.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Asha, S., Vinod, P. & Menon, V.G. A defensive attention mechanism to detect deepfake content across multiple modalities. Multimedia Systems 30, 56 (2024). https://doi.org/10.1007/s00530-023-01248-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-023-01248-x