Abstract
Multimodal sarcasm detection aims to determine whether conflicting semantics arise in different modalities. Existing research, primarily relying on direct interaction between image and text, limits model performance in sarcasm detection due to the difficulty in cross-modal alignment and information integration caused by semantic differences between the modalities. In this paper, we propose a progressive interaction approach. First, unlike the traditional direct interaction approach, the pre-interaction approach is adopted by bridging the image and text through attributes to reduce the semantic difference between them. Then, contrastive learning is employed to align image and text features for better synchronization of image-text semantics. Finally, sarcasm cues are captured through the interaction between image and text for detecting sarcasm. In the pre-interaction phase, we design different components for image and text respectively for their interaction with attributes. Experiments demonstrate the excellent performance of our method on a multimodal sarcasm detection task.







Similar content being viewed by others
References
Joshi A, Bhattacharyya P, Carman MJ (2017) Automatic sarcasm detection: a survey. ACM Comput Surv (CSUR) 50(5):1–22
Ravi K, Ravi V (2017) A novel automatic satire and irony detection using ensembled feature selection and data mining. Knowledge-Based Syst 120:15–33
Barbieri F, Saggion H (2014) Modelling irony in twitter. In Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 56–64
Verma P, Shukla N, Shukla AP (2021) Techniques of sarcasm detection: A review. In 2021 international conference on advance computing and innovative technologies in engineering (ICACITE), pages 968–972
Chia ZL, Ptaszynski M, Masui F, Leliwa G, Wroczynski M (2021) Machine Learning and feature engineering-based study into sarcasm and irony classification with application to cyberbullying detection. Inf Process Manag 58(4):102600
Wicana SG, İbisoglu TY, Yavanoglu U (2017) A review on sarcasm detection from machine-learning perspective. In 2017 IEEE 11th International Conference on Semantic Computing (ICSC), pages 469–476
Jain T, Agrawal N, Goyal G, Aggrawal N (2017) Sarcasm detection of tweets: a comparative study. In 2017 Tenth International Conference on Contemporary Computing (IC3), pages 1–6
Gandhi A, Adhvaryu K, Poria S, Cambria E, Hussain A (2023) Multimodal sentiment analysis: a systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf Fusion 91:424–444
Castro S, Hazarika D, Pérez-Rosas V, Zimmermann R, Mihalcea R, Poria S (2019) Towards multimodal sarcasm detection (an _obviously_ perfect paper). arXiv preprint arXiv:1906.01815
Gupta S, Shah A, Shah M, Syiemlieh L, Maurya C (2021) Filming multimodal sarcasm detection with attention. In Neural Information Processing: 28th International Conference, ICONIP 2021, Sanur, Bali, Indonesia, December 8–12, 2021, Proceedings, Part V 28, pages 178–186
Zhao X, Huang J, Yang H (2021) CANs: coupled-attention networks for sarcasm detection on social media. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8
Pan H, Lin Z, Fu P, Qi Y, Wang W (2020) Modeling intra and inter-modality incongruity for multi-modal sarcasm detection. In Findings Assoc Comput Linguistics: EMNLP 2020:1383–1392
Liang B, Lou C, Li X, Gui L, Yang M, Xu R (2021) Multi-modal sarcasm detection with interactive in-modal and cross-modal graphs. In Proceedings of the 29th ACM international conference on multimedia, pages 4707–4715
Tian Y, Xu N, Zhang R, Mao W (2023) Dynamic routing transformer network for multimodal sarcasm detection. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2468–2480
Wang J, Yang Y, Jiang Y, Ma M, Xie Z, Li T (2024) Cross-modal incongruity aligning and collaborating for multi-modal sarcasm detection. Inf Fusion 103:102132
Wei Y, Yuan S, Yang R, Shen L, Li Z, Wang L, Chen M (2023) Tackling modality heterogeneity with multi-view calibration network for multimodal sentiment detection. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5240–5252
Cai Y, Cai H, Wan X (2019) Multi-modal sarcasm detection in twitter with hierarchical fusion model. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 2506–2515
Liang B, Lou C, Li X, Yang M, Gui L, He Y, Pei W, Xu R (2022) Multi-modal sarcasm detection via cross-modal graph convolutional network. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1767–1777
Yue T, Mao R, Wang H, Hu Z, Cambria E (2023) KnowleNet: knowledge fusion network for multimodal sarcasm detection. Inf Fusion 100:101921
Xu N, Zeng Z, Mao W (2020) Reasoning with multimodal sarcastic tweets via modeling cross-modality contrast and semantic association. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 3777–3786
Riloff E, Qadir A, Surve P, De Silva L, Gilbert N, Huang R (2013) Sarcasm as contrast between a positive sentiment and negative situation. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 704–714
Shunxiang Z, Zhu A, Zhu G, Wei Z, Li K (2023) Building fake review detection model based on sentiment intensity and PU learning. IEEE Trans Neural Netw Learn Syst 34(10):6926–6939
Joshi A, Sharma V, Bhattacharyya P (2015) Harnessing context incongruity for sarcasm detection. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 757–762
Asgshr A, Shruthi G, Shruthi HR, Upadhyaya M, Ray AP, Manjunath TC (2021) Sarcasm detection in natural language processing. Mater Today: Proceed 37:3324–3331
Zhang M, Zhang Y, Fu G (2016) Tweet sarcasm detection using deep neural network. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: technical papers, pages 2449–2460
Baruah A, Das K, Barbhuiya F, Dey K (2020) Context-aware sarcasm detection using BERT. In Proceedings of the Second Workshop on Figurative Language Processing, pages 83–87
Zhu G, Pan Z, Wang Q, Zhang S, Li K (2020) Building multi-subtopic Bi-level network for micro-blog hot topic based on feature Co-occurrence and semantic community division. J Netw Comput Appl 170:102815
Pandey R, Singh JP (2023) BERT-LSTM model for sarcasm detection in code-mixed social media post. J Intell Inf Syst 60(1):235–254
Sangwan S, Akhtar MS, Behera P, Ekbal A (2020) I didn’t mean what I wrote! Exploring Multimodality for Sarcasm Detection. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8
Yin Z, You F (2021) Multi-Modal Sarcasm Detection in Weibo. In 2021 6th International Symposium on Computer and Information Processing Technology (ISCIPT), pages 740–743
Razali MS, Halin AA, Ye L, Doraisamy S, Norowi NM (2021) Sarcasm detection using deep learning with contextual features. IEEE Access 9:68609–68618
Wang X, Sun X, Yang T, Wang H (2020) Building a bridge: a method for image-text sarcasm detection without pretraining on image-text data. In Proceedings of the first international workshop on natural language processing beyond text, pages 19–29
Schifanella R, Juan P De, Tetreault J, Cao L (2016) Detecting sarcasm in multimodal social platforms. In Proceedings of the 24th ACM international conference on Multimedia, pages 1136–1145
Wu Y, Zhao Y, Lu X, Qin B, Wu Y, Sheng J, Li J (2021) Modeling incongruity between modalities for multimodal sarcasm detection. IEEE MultiMed 28(2):86–95
Fu H, Liu H, Wang H, Xu L, Lin J, Jiang D (2024) Multi-modal sarcasm detection with sentiment word embedding. Electronics 13(5):855
Liu H, Wang W, Li H (2022) Towards multi-modal sarcasm detection via hierarchical congruity modeling with knowledge enhancement. arXiv preprint arXiv:2210.03501
Jia M, Xie C, Jing L (2024) Debiasing multimodal sarcasm detection with contrastive learning. In Proceed AAAI Conf Artif Intell 38(16):18354–18362
Lu Q, Long Y, Sun X, Feng J, Zhang H (2024) Fact-sentiment incongruity combination network for multimodal sarcasm detection. Inf Fusion 104:102203
Liang B, Gui L, He Y, Cambria E, Xu R (2024) Fusion and Discrimination: a multimodal graph contrastive learning framework for multimodal sarcasm detection. IEEE Transactions on Affective Computing
Qiao Y, Jing L, Song X, Chen X, Zhu L, Nie L (2023) Mutual-enhanced incongruity learning network for multi-modal sarcasm detection. In Proceed AAAI Conf Artif Intell 37(8):9507–9515
Jing L, Song X, Ouyang K, Jia M, Nie L (2023) Multi-source semantic graph-based multimodal sarcasm explanation generation. arXiv preprint arXiv:2306.16650
Desai P, Chakraborty T, Akhtar MS (2022) Nice perfume. how long did you marinate in it? multimodal sarcasm explanation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, number 10, pages 10563–10571
Liu Y (2019) Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
Dosovitskiy A (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Wang Z, Liu X, Li H, Sheng L, Yan J, Wang X, Shao J (2019) Camp: cross-modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5764–5773
Zhao S, Yao X, Yang J, Jia G, Ding G, Chua T-S, Schuller BW, Keutzer K (2021) Affective image content analysis: two decades review and new perspectives. IEEE Trans Pattern Anal Mach Intell 44(10):6729–6751
Li R, Chen H, Feng F, Ma Z, Wang X, Hovy E (2021) Dual graph convolutional networks for aspect-based sentiment analysis. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6319–6329
Xu M, Wang D, Feng S, Yang Z, Zhang Y (2022) Kc-isa: An implicit sentiment analysis model combining knowledge enhancement and context features. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6906–6915
Malik M, Tomás D, Rosso P (2023) How challenging is multimodal irony detection? In International Conference on Applications of Natural Language to Information Systems, pages 18–32
Zhou Y, Ren T, Zhu C, Sun X, Liu J, Ding X, Xu M, Ji R (2021) Trar: Routing the attention spans in transformer for visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2074–2084
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607
Rakhlin A (2016) Convolutional neural networks for sentence classification. GitHub 6:25
Huang Z, Xu W, Yu K (2015) Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991
Tay Y, Tuan LA, Hui SC, Su J (2018) Reasoning with sarcasm by reading in-between. arXiv preprint arXiv:1805.02856
Xiong T, Zhang P, Zhu H, Yang Y (2019) Sarcasm detection with self-matching networks and low-rank bilinear pooling. In The World Wide Web Conference
Devlin J (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778
Liunian L Harold, Yatskar M, Yin D, Hsieh C-J, Chang K-W (2019) Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557
Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in neural information processing systems, volume 32
Lu Qiang et al (2024) Fact-sentiment incongruity combination network for multimodal sarcasm detection. Inf Fusion 104:102203
Liu Hao et al (2024) Sarcasm driven by sentiment: A sentiment-aware hierarchical fusion network for multimodal sarcasm detection. Inf Fusion 108:102353
Liujing Song, et al (2023) "Global-aware attention network for multi-modal sarcasm detection." 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE
Fang Hong et al (2024) Multi-modal sarcasm detection based on multi-channel enhanced fusion model. Neurocomputing 578:127440
Ou Lisong, Li Zhixin (2025) Modeling inter-modal incongruous sentiment expressions for multi-modal sarcasm detection. Neurocomputing 616:128874
Gao et al (2021) "Simcse: Simple contrastive learning of sentence embeddings." arxiv preprint arxiv:2104.08821
Maity K, Jha P, Saha S, Bhattacharyya P (2022) A multitask framework for sentiment, emotion and sarcasm aware cyberbullying detection from multi-modal codemixed memes, in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1739-1749
Funding
This work was supported by the Graduate Innovation Fund project of Anhui University of Science and Technology (Grant NO.2024cx2120), the National Natural Science Foundation of China (Grant NO.62476005, 62076006), and the Opening Foundation of State Key Laboratory of Cognitive Intelligence, iFLYTEK (Grant NO.COGOS-2023HE02).
Author information
Authors and Affiliations
Contributions
Y.Z wrote the original draft, validated the results, developed the software, designed the methodology, performed formal analysis, curated the data, and conceptualized the study; G.Z validated the results, administered the project, and curated the data; Y.D reviewed and edited the manuscript and conducted the investigation; Z.W reviewed and edited the manuscript, provided resources, and curated the data; L.C reviewed and edited the manuscript and curated the data; K.-C.L reviewed and edited the manuscript and provided resources. All authors reviewed the manuscript.
Corresponding authors
Ethics declarations
Confict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethical approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, Y., Zhu, G., Ding, Y. et al. A progressive interaction model for multimodal sarcasm detection. J Supercomput 81, 624 (2025). https://doi.org/10.1007/s11227-025-07110-3
Accepted:
Published:
DOI: https://doi.org/10.1007/s11227-025-07110-3