A progressive interaction model for multimodal sarcasm detection

Zhang, Yulei; Zhu, Guangli; Ding, Yuanyuan; Wei, Zhongliang; Chen, Lei; Li, Kuan-Ching

doi:10.1007/s11227-025-07110-3

A progressive interaction model for multimodal sarcasm detection

Published: 14 March 2025

Volume 81, article number 624, (2025)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Yulei Zhang^1,2,
Guangli Zhu^1,2,3,
Yuanyuan Ding^1,2,
Zhongliang Wei¹,
Lei Chen³ &
…
Kuan-Ching Li¹

119 Accesses
Explore all metrics

Abstract

Multimodal sarcasm detection aims to determine whether conflicting semantics arise in different modalities. Existing research, primarily relying on direct interaction between image and text, limits model performance in sarcasm detection due to the difficulty in cross-modal alignment and information integration caused by semantic differences between the modalities. In this paper, we propose a progressive interaction approach. First, unlike the traditional direct interaction approach, the pre-interaction approach is adopted by bridging the image and text through attributes to reduce the semantic difference between them. Then, contrastive learning is employed to align image and text features for better synchronization of image-text semantics. Finally, sarcasm cues are captured through the interaction between image and text for detecting sarcasm. In the pre-interaction phase, we design different components for image and text respectively for their interaction with attributes. Experiments demonstrate the excellent performance of our method on a multimodal sarcasm detection task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Image-Text Sarcasm Detection for Enhanced Understanding

Image-text fusion transformer network for sarcasm detection

Article 13 October 2023

Representation and Granularity Joint Alignment Framework for Multimodal Sarcasm Detection on Social Media

References

Joshi A, Bhattacharyya P, Carman MJ (2017) Automatic sarcasm detection: a survey. ACM Comput Surv (CSUR) 50(5):1–22
Article MATH Google Scholar
Ravi K, Ravi V (2017) A novel automatic satire and irony detection using ensembled feature selection and data mining. Knowledge-Based Syst 120:15–33
Article MATH Google Scholar
Barbieri F, Saggion H (2014) Modelling irony in twitter. In Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 56–64
Verma P, Shukla N, Shukla AP (2021) Techniques of sarcasm detection: A review. In 2021 international conference on advance computing and innovative technologies in engineering (ICACITE), pages 968–972
Chia ZL, Ptaszynski M, Masui F, Leliwa G, Wroczynski M (2021) Machine Learning and feature engineering-based study into sarcasm and irony classification with application to cyberbullying detection. Inf Process Manag 58(4):102600
Article Google Scholar
Wicana SG, İbisoglu TY, Yavanoglu U (2017) A review on sarcasm detection from machine-learning perspective. In 2017 IEEE 11th International Conference on Semantic Computing (ICSC), pages 469–476
Jain T, Agrawal N, Goyal G, Aggrawal N (2017) Sarcasm detection of tweets: a comparative study. In 2017 Tenth International Conference on Contemporary Computing (IC3), pages 1–6
Gandhi A, Adhvaryu K, Poria S, Cambria E, Hussain A (2023) Multimodal sentiment analysis: a systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf Fusion 91:424–444
Article Google Scholar
Castro S, Hazarika D, Pérez-Rosas V, Zimmermann R, Mihalcea R, Poria S (2019) Towards multimodal sarcasm detection (an _obviously_ perfect paper). arXiv preprint arXiv:1906.01815
Gupta S, Shah A, Shah M, Syiemlieh L, Maurya C (2021) Filming multimodal sarcasm detection with attention. In Neural Information Processing: 28th International Conference, ICONIP 2021, Sanur, Bali, Indonesia, December 8–12, 2021, Proceedings, Part V 28, pages 178–186
Zhao X, Huang J, Yang H (2021) CANs: coupled-attention networks for sarcasm detection on social media. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8
Pan H, Lin Z, Fu P, Qi Y, Wang W (2020) Modeling intra and inter-modality incongruity for multi-modal sarcasm detection. In Findings Assoc Comput Linguistics: EMNLP 2020:1383–1392
MATH Google Scholar
Liang B, Lou C, Li X, Gui L, Yang M, Xu R (2021) Multi-modal sarcasm detection with interactive in-modal and cross-modal graphs. In Proceedings of the 29th ACM international conference on multimedia, pages 4707–4715
Tian Y, Xu N, Zhang R, Mao W (2023) Dynamic routing transformer network for multimodal sarcasm detection. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2468–2480
Wang J, Yang Y, Jiang Y, Ma M, Xie Z, Li T (2024) Cross-modal incongruity aligning and collaborating for multi-modal sarcasm detection. Inf Fusion 103:102132
Article Google Scholar
Wei Y, Yuan S, Yang R, Shen L, Li Z, Wang L, Chen M (2023) Tackling modality heterogeneity with multi-view calibration network for multimodal sentiment detection. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5240–5252
Cai Y, Cai H, Wan X (2019) Multi-modal sarcasm detection in twitter with hierarchical fusion model. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 2506–2515
Liang B, Lou C, Li X, Yang M, Gui L, He Y, Pei W, Xu R (2022) Multi-modal sarcasm detection via cross-modal graph convolutional network. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1767–1777
Yue T, Mao R, Wang H, Hu Z, Cambria E (2023) KnowleNet: knowledge fusion network for multimodal sarcasm detection. Inf Fusion 100:101921
Article Google Scholar
Xu N, Zeng Z, Mao W (2020) Reasoning with multimodal sarcastic tweets via modeling cross-modality contrast and semantic association. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 3777–3786
Riloff E, Qadir A, Surve P, De Silva L, Gilbert N, Huang R (2013) Sarcasm as contrast between a positive sentiment and negative situation. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 704–714
Shunxiang Z, Zhu A, Zhu G, Wei Z, Li K (2023) Building fake review detection model based on sentiment intensity and PU learning. IEEE Trans Neural Netw Learn Syst 34(10):6926–6939
Article MATH Google Scholar
Joshi A, Sharma V, Bhattacharyya P (2015) Harnessing context incongruity for sarcasm detection. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 757–762
Asgshr A, Shruthi G, Shruthi HR, Upadhyaya M, Ray AP, Manjunath TC (2021) Sarcasm detection in natural language processing. Mater Today: Proceed 37:3324–3331
MATH Google Scholar
Zhang M, Zhang Y, Fu G (2016) Tweet sarcasm detection using deep neural network. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: technical papers, pages 2449–2460
Baruah A, Das K, Barbhuiya F, Dey K (2020) Context-aware sarcasm detection using BERT. In Proceedings of the Second Workshop on Figurative Language Processing, pages 83–87
Zhu G, Pan Z, Wang Q, Zhang S, Li K (2020) Building multi-subtopic Bi-level network for micro-blog hot topic based on feature Co-occurrence and semantic community division. J Netw Comput Appl 170:102815
Article Google Scholar
Pandey R, Singh JP (2023) BERT-LSTM model for sarcasm detection in code-mixed social media post. J Intell Inf Syst 60(1):235–254
Article MATH Google Scholar
Sangwan S, Akhtar MS, Behera P, Ekbal A (2020) I didn’t mean what I wrote! Exploring Multimodality for Sarcasm Detection. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8
Yin Z, You F (2021) Multi-Modal Sarcasm Detection in Weibo. In 2021 6th International Symposium on Computer and Information Processing Technology (ISCIPT), pages 740–743
Razali MS, Halin AA, Ye L, Doraisamy S, Norowi NM (2021) Sarcasm detection using deep learning with contextual features. IEEE Access 9:68609–68618
Article MATH Google Scholar
Wang X, Sun X, Yang T, Wang H (2020) Building a bridge: a method for image-text sarcasm detection without pretraining on image-text data. In Proceedings of the first international workshop on natural language processing beyond text, pages 19–29
Schifanella R, Juan P De, Tetreault J, Cao L (2016) Detecting sarcasm in multimodal social platforms. In Proceedings of the 24th ACM international conference on Multimedia, pages 1136–1145
Wu Y, Zhao Y, Lu X, Qin B, Wu Y, Sheng J, Li J (2021) Modeling incongruity between modalities for multimodal sarcasm detection. IEEE MultiMed 28(2):86–95
Article MATH Google Scholar
Fu H, Liu H, Wang H, Xu L, Lin J, Jiang D (2024) Multi-modal sarcasm detection with sentiment word embedding. Electronics 13(5):855
Article Google Scholar
Liu H, Wang W, Li H (2022) Towards multi-modal sarcasm detection via hierarchical congruity modeling with knowledge enhancement. arXiv preprint arXiv:2210.03501
Jia M, Xie C, Jing L (2024) Debiasing multimodal sarcasm detection with contrastive learning. In Proceed AAAI Conf Artif Intell 38(16):18354–18362
MATH Google Scholar
Lu Q, Long Y, Sun X, Feng J, Zhang H (2024) Fact-sentiment incongruity combination network for multimodal sarcasm detection. Inf Fusion 104:102203
Article Google Scholar
Liang B, Gui L, He Y, Cambria E, Xu R (2024) Fusion and Discrimination: a multimodal graph contrastive learning framework for multimodal sarcasm detection. IEEE Transactions on Affective Computing
Qiao Y, Jing L, Song X, Chen X, Zhu L, Nie L (2023) Mutual-enhanced incongruity learning network for multi-modal sarcasm detection. In Proceed AAAI Conf Artif Intell 37(8):9507–9515
MATH Google Scholar
Jing L, Song X, Ouyang K, Jia M, Nie L (2023) Multi-source semantic graph-based multimodal sarcasm explanation generation. arXiv preprint arXiv:2306.16650
Desai P, Chakraborty T, Akhtar MS (2022) Nice perfume. how long did you marinate in it? multimodal sarcasm explanation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, number 10, pages 10563–10571
Liu Y (2019) Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
Dosovitskiy A (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Wang Z, Liu X, Li H, Sheng L, Yan J, Wang X, Shao J (2019) Camp: cross-modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5764–5773
Zhao S, Yao X, Yang J, Jia G, Ding G, Chua T-S, Schuller BW, Keutzer K (2021) Affective image content analysis: two decades review and new perspectives. IEEE Trans Pattern Anal Mach Intell 44(10):6729–6751
Article Google Scholar
Li R, Chen H, Feng F, Ma Z, Wang X, Hovy E (2021) Dual graph convolutional networks for aspect-based sentiment analysis. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6319–6329
Xu M, Wang D, Feng S, Yang Z, Zhang Y (2022) Kc-isa: An implicit sentiment analysis model combining knowledge enhancement and context features. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6906–6915
Malik M, Tomás D, Rosso P (2023) How challenging is multimodal irony detection? In International Conference on Applications of Natural Language to Information Systems, pages 18–32
Zhou Y, Ren T, Zhu C, Sun X, Liu J, Ding X, Xu M, Ji R (2021) Trar: Routing the attention spans in transformer for visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2074–2084
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607
Rakhlin A (2016) Convolutional neural networks for sentence classification. GitHub 6:25
MATH Google Scholar
Huang Z, Xu W, Yu K (2015) Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991
Tay Y, Tuan LA, Hui SC, Su J (2018) Reasoning with sarcasm by reading in-between. arXiv preprint arXiv:1805.02856
Xiong T, Zhang P, Zhu H, Yang Y (2019) Sarcasm detection with self-matching networks and low-rank bilinear pooling. In The World Wide Web Conference
Devlin J (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778
Liunian L Harold, Yatskar M, Yin D, Hsieh C-J, Chang K-W (2019) Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557
Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in neural information processing systems, volume 32
Lu Qiang et al (2024) Fact-sentiment incongruity combination network for multimodal sarcasm detection. Inf Fusion 104:102203
Article MATH Google Scholar
Liu Hao et al (2024) Sarcasm driven by sentiment: A sentiment-aware hierarchical fusion network for multimodal sarcasm detection. Inf Fusion 108:102353
Article Google Scholar
Liujing Song, et al (2023) "Global-aware attention network for multi-modal sarcasm detection." 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE
Fang Hong et al (2024) Multi-modal sarcasm detection based on multi-channel enhanced fusion model. Neurocomputing 578:127440
Article MATH Google Scholar
Ou Lisong, Li Zhixin (2025) Modeling inter-modal incongruous sentiment expressions for multi-modal sarcasm detection. Neurocomputing 616:128874
Article MATH Google Scholar
Gao et al (2021) "Simcse: Simple contrastive learning of sentence embeddings." arxiv preprint arxiv:2104.08821
Maity K, Jha P, Saha S, Bhattacharyya P (2022) A multitask framework for sentiment, emotion and sarcasm aware cyberbullying detection from multi-modal codemixed memes, in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1739-1749

Download references

Funding

This work was supported by the Graduate Innovation Fund project of Anhui University of Science and Technology (Grant NO.2024cx2120), the National Natural Science Foundation of China (Grant NO.62476005, 62076006), and the Opening Foundation of State Key Laboratory of Cognitive Intelligence, iFLYTEK (Grant NO.COGOS-2023HE02).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Anhui University of Science and Technology, Huainan, 232001, China
Yulei Zhang, Guangli Zhu, Yuanyuan Ding, Zhongliang Wei & Kuan-Ching Li
Hefei Comprehensive National Science Center, Artificial Intelligence Research Institute, Hefei, 230000, China
Yulei Zhang, Guangli Zhu & Yuanyuan Ding
School of Computer, Huainan Normal University, Huainan, 232001, China
Guangli Zhu & Lei Chen

Authors

Yulei Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Guangli Zhu
View author publications
You can also search for this author inPubMed Google Scholar
Yuanyuan Ding
View author publications
You can also search for this author inPubMed Google Scholar
Zhongliang Wei
View author publications
You can also search for this author inPubMed Google Scholar
Lei Chen
View author publications
You can also search for this author inPubMed Google Scholar
Kuan-Ching Li
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Y.Z wrote the original draft, validated the results, developed the software, designed the methodology, performed formal analysis, curated the data, and conceptualized the study; G.Z validated the results, administered the project, and curated the data; Y.D reviewed and edited the manuscript and conducted the investigation; Z.W reviewed and edited the manuscript, provided resources, and curated the data; L.C reviewed and edited the manuscript and curated the data; K.-C.L reviewed and edited the manuscript and provided resources. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Guangli Zhu or Kuan-Ching Li.

Ethics declarations

Confict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, Y., Zhu, G., Ding, Y. et al. A progressive interaction model for multimodal sarcasm detection. J Supercomput 81, 624 (2025). https://doi.org/10.1007/s11227-025-07110-3

Download citation

Accepted: 21 February 2025
Published: 14 March 2025
DOI: https://doi.org/10.1007/s11227-025-07110-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A progressive interaction model for multimodal sarcasm detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Image-Text Sarcasm Detection for Enhanced Understanding

Image-text fusion transformer network for sarcasm detection

Representation and Granularity Joint Alignment Framework for Multimodal Sarcasm Detection on Social Media

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Confict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now