Skip to main content
Log in

A progressive interaction model for multimodal sarcasm detection

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Multimodal sarcasm detection aims to determine whether conflicting semantics arise in different modalities. Existing research, primarily relying on direct interaction between image and text, limits model performance in sarcasm detection due to the difficulty in cross-modal alignment and information integration caused by semantic differences between the modalities. In this paper, we propose a progressive interaction approach. First, unlike the traditional direct interaction approach, the pre-interaction approach is adopted by bridging the image and text through attributes to reduce the semantic difference between them. Then, contrastive learning is employed to align image and text features for better synchronization of image-text semantics. Finally, sarcasm cues are captured through the interaction between image and text for detecting sarcasm. In the pre-interaction phase, we design different components for image and text respectively for their interaction with attributes. Experiments demonstrate the excellent performance of our method on a multimodal sarcasm detection task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Joshi A, Bhattacharyya P, Carman MJ (2017) Automatic sarcasm detection: a survey. ACM Comput Surv (CSUR) 50(5):1–22

    Article  MATH  Google Scholar 

  2. Ravi K, Ravi V (2017) A novel automatic satire and irony detection using ensembled feature selection and data mining. Knowledge-Based Syst 120:15–33

    Article  MATH  Google Scholar 

  3. Barbieri F, Saggion H (2014) Modelling irony in twitter. In Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 56–64

  4. Verma P, Shukla N, Shukla AP (2021) Techniques of sarcasm detection: A review. In 2021 international conference on advance computing and innovative technologies in engineering (ICACITE), pages 968–972

  5. Chia ZL, Ptaszynski M, Masui F, Leliwa G, Wroczynski M (2021) Machine Learning and feature engineering-based study into sarcasm and irony classification with application to cyberbullying detection. Inf Process Manag 58(4):102600

    Article  Google Scholar 

  6. Wicana SG, İbisoglu TY, Yavanoglu U (2017) A review on sarcasm detection from machine-learning perspective. In 2017 IEEE 11th International Conference on Semantic Computing (ICSC), pages 469–476

  7. Jain T, Agrawal N, Goyal G, Aggrawal N (2017) Sarcasm detection of tweets: a comparative study. In 2017 Tenth International Conference on Contemporary Computing (IC3), pages 1–6

  8. Gandhi A, Adhvaryu K, Poria S, Cambria E, Hussain A (2023) Multimodal sentiment analysis: a systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf Fusion 91:424–444

    Article  Google Scholar 

  9. Castro S, Hazarika D, Pérez-Rosas V, Zimmermann R, Mihalcea R, Poria S (2019) Towards multimodal sarcasm detection (an _obviously_ perfect paper). arXiv preprint arXiv:1906.01815

  10. Gupta S, Shah A, Shah M, Syiemlieh L, Maurya C (2021) Filming multimodal sarcasm detection with attention. In Neural Information Processing: 28th International Conference, ICONIP 2021, Sanur, Bali, Indonesia, December 8–12, 2021, Proceedings, Part V 28, pages 178–186

  11. Zhao X, Huang J, Yang H (2021) CANs: coupled-attention networks for sarcasm detection on social media. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8

  12. Pan H, Lin Z, Fu P, Qi Y, Wang W (2020) Modeling intra and inter-modality incongruity for multi-modal sarcasm detection. In Findings Assoc Comput Linguistics: EMNLP 2020:1383–1392

    MATH  Google Scholar 

  13. Liang B, Lou C, Li X, Gui L, Yang M, Xu R (2021) Multi-modal sarcasm detection with interactive in-modal and cross-modal graphs. In Proceedings of the 29th ACM international conference on multimedia, pages 4707–4715

  14. Tian Y, Xu N, Zhang R, Mao W (2023) Dynamic routing transformer network for multimodal sarcasm detection. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2468–2480

  15. Wang J, Yang Y, Jiang Y, Ma M, Xie Z, Li T (2024) Cross-modal incongruity aligning and collaborating for multi-modal sarcasm detection. Inf Fusion 103:102132

    Article  Google Scholar 

  16. Wei Y, Yuan S, Yang R, Shen L, Li Z, Wang L, Chen M (2023) Tackling modality heterogeneity with multi-view calibration network for multimodal sentiment detection. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5240–5252

  17. Cai Y, Cai H, Wan X (2019) Multi-modal sarcasm detection in twitter with hierarchical fusion model. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 2506–2515

  18. Liang B, Lou C, Li X, Yang M, Gui L, He Y, Pei W, Xu R (2022) Multi-modal sarcasm detection via cross-modal graph convolutional network. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1767–1777

  19. Yue T, Mao R, Wang H, Hu Z, Cambria E (2023) KnowleNet: knowledge fusion network for multimodal sarcasm detection. Inf Fusion 100:101921

    Article  Google Scholar 

  20. Xu N, Zeng Z, Mao W (2020) Reasoning with multimodal sarcastic tweets via modeling cross-modality contrast and semantic association. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 3777–3786

  21. Riloff E, Qadir A, Surve P, De Silva L, Gilbert N, Huang R (2013) Sarcasm as contrast between a positive sentiment and negative situation. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 704–714

  22. Shunxiang Z, Zhu A, Zhu G, Wei Z, Li K (2023) Building fake review detection model based on sentiment intensity and PU learning. IEEE Trans Neural Netw Learn Syst 34(10):6926–6939

    Article  MATH  Google Scholar 

  23. Joshi A, Sharma V, Bhattacharyya P (2015) Harnessing context incongruity for sarcasm detection. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 757–762

  24. Asgshr A, Shruthi G, Shruthi HR, Upadhyaya M, Ray AP, Manjunath TC (2021) Sarcasm detection in natural language processing. Mater Today: Proceed 37:3324–3331

    MATH  Google Scholar 

  25. Zhang M, Zhang Y, Fu G (2016) Tweet sarcasm detection using deep neural network. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: technical papers, pages 2449–2460

  26. Baruah A, Das K, Barbhuiya F, Dey K (2020) Context-aware sarcasm detection using BERT. In Proceedings of the Second Workshop on Figurative Language Processing, pages 83–87

  27. Zhu G, Pan Z, Wang Q, Zhang S, Li K (2020) Building multi-subtopic Bi-level network for micro-blog hot topic based on feature Co-occurrence and semantic community division. J Netw Comput Appl 170:102815

    Article  Google Scholar 

  28. Pandey R, Singh JP (2023) BERT-LSTM model for sarcasm detection in code-mixed social media post. J Intell Inf Syst 60(1):235–254

    Article  MATH  Google Scholar 

  29. Sangwan S, Akhtar MS, Behera P, Ekbal A (2020) I didn’t mean what I wrote! Exploring Multimodality for Sarcasm Detection. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8

  30. Yin Z, You F (2021) Multi-Modal Sarcasm Detection in Weibo. In 2021 6th International Symposium on Computer and Information Processing Technology (ISCIPT), pages 740–743

  31. Razali MS, Halin AA, Ye L, Doraisamy S, Norowi NM (2021) Sarcasm detection using deep learning with contextual features. IEEE Access 9:68609–68618

    Article  MATH  Google Scholar 

  32. Wang X, Sun X, Yang T, Wang H (2020) Building a bridge: a method for image-text sarcasm detection without pretraining on image-text data. In Proceedings of the first international workshop on natural language processing beyond text, pages 19–29

  33. Schifanella R, Juan P De, Tetreault J, Cao L (2016) Detecting sarcasm in multimodal social platforms. In Proceedings of the 24th ACM international conference on Multimedia, pages 1136–1145

  34. Wu Y, Zhao Y, Lu X, Qin B, Wu Y, Sheng J, Li J (2021) Modeling incongruity between modalities for multimodal sarcasm detection. IEEE MultiMed 28(2):86–95

    Article  MATH  Google Scholar 

  35. Fu H, Liu H, Wang H, Xu L, Lin J, Jiang D (2024) Multi-modal sarcasm detection with sentiment word embedding. Electronics 13(5):855

    Article  Google Scholar 

  36. Liu H, Wang W, Li H (2022) Towards multi-modal sarcasm detection via hierarchical congruity modeling with knowledge enhancement. arXiv preprint arXiv:2210.03501

  37. Jia M, Xie C, Jing L (2024) Debiasing multimodal sarcasm detection with contrastive learning. In Proceed AAAI Conf Artif Intell 38(16):18354–18362

    MATH  Google Scholar 

  38. Lu Q, Long Y, Sun X, Feng J, Zhang H (2024) Fact-sentiment incongruity combination network for multimodal sarcasm detection. Inf Fusion 104:102203

    Article  Google Scholar 

  39. Liang B, Gui L, He Y, Cambria E, Xu R (2024) Fusion and Discrimination: a multimodal graph contrastive learning framework for multimodal sarcasm detection. IEEE Transactions on Affective Computing

  40. Qiao Y, Jing L, Song X, Chen X, Zhu L, Nie L (2023) Mutual-enhanced incongruity learning network for multi-modal sarcasm detection. In Proceed AAAI Conf Artif Intell 37(8):9507–9515

    MATH  Google Scholar 

  41. Jing L, Song X, Ouyang K, Jia M, Nie L (2023) Multi-source semantic graph-based multimodal sarcasm explanation generation. arXiv preprint arXiv:2306.16650

  42. Desai P, Chakraborty T, Akhtar MS (2022) Nice perfume. how long did you marinate in it? multimodal sarcasm explanation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, number 10, pages 10563–10571

  43. Liu Y (2019) Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

  44. Dosovitskiy A (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  45. Wang Z, Liu X, Li H, Sheng L, Yan J, Wang X, Shao J (2019) Camp: cross-modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5764–5773

  46. Zhao S, Yao X, Yang J, Jia G, Ding G, Chua T-S, Schuller BW, Keutzer K (2021) Affective image content analysis: two decades review and new perspectives. IEEE Trans Pattern Anal Mach Intell 44(10):6729–6751

    Article  Google Scholar 

  47. Li R, Chen H, Feng F, Ma Z, Wang X, Hovy E (2021) Dual graph convolutional networks for aspect-based sentiment analysis. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6319–6329

  48. Xu M, Wang D, Feng S, Yang Z, Zhang Y (2022) Kc-isa: An implicit sentiment analysis model combining knowledge enhancement and context features. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6906–6915

  49. Malik M, Tomás D, Rosso P (2023) How challenging is multimodal irony detection? In International Conference on Applications of Natural Language to Information Systems, pages 18–32

  50. Zhou Y, Ren T, Zhu C, Sun X, Liu J, Ding X, Xu M, Ji R (2021) Trar: Routing the attention spans in transformer for visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2074–2084

  51. Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607

  52. Rakhlin A (2016) Convolutional neural networks for sentence classification. GitHub 6:25

    MATH  Google Scholar 

  53. Huang Z, Xu W, Yu K (2015) Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991

  54. Tay Y, Tuan LA, Hui SC, Su J (2018) Reasoning with sarcasm by reading in-between. arXiv preprint arXiv:1805.02856

  55. Xiong T, Zhang P, Zhu H, Yang Y (2019) Sarcasm detection with self-matching networks and low-rank bilinear pooling. In The World Wide Web Conference

  56. Devlin J (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  57. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778

  58. Liunian L Harold, Yatskar M, Yin D, Hsieh C-J, Chang K-W (2019) Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557

  59. Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in neural information processing systems, volume 32

  60. Lu Qiang et al (2024) Fact-sentiment incongruity combination network for multimodal sarcasm detection. Inf Fusion 104:102203

    Article  MATH  Google Scholar 

  61. Liu Hao et al (2024) Sarcasm driven by sentiment: A sentiment-aware hierarchical fusion network for multimodal sarcasm detection. Inf Fusion 108:102353

    Article  Google Scholar 

  62. Liujing Song, et al (2023) "Global-aware attention network for multi-modal sarcasm detection." 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE

  63. Fang Hong et al (2024) Multi-modal sarcasm detection based on multi-channel enhanced fusion model. Neurocomputing 578:127440

    Article  MATH  Google Scholar 

  64. Ou Lisong, Li Zhixin (2025) Modeling inter-modal incongruous sentiment expressions for multi-modal sarcasm detection. Neurocomputing 616:128874

    Article  MATH  Google Scholar 

  65. Gao et al (2021) "Simcse: Simple contrastive learning of sentence embeddings." arxiv preprint arxiv:2104.08821

  66. Maity K, Jha P, Saha S, Bhattacharyya P (2022) A multitask framework for sentiment, emotion and sarcasm aware cyberbullying detection from multi-modal codemixed memes, in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1739-1749

Download references

Funding

This work was supported by the Graduate Innovation Fund project of Anhui University of Science and Technology (Grant NO.2024cx2120), the National Natural Science Foundation of China (Grant NO.62476005, 62076006), and the Opening Foundation of State Key Laboratory of Cognitive Intelligence, iFLYTEK (Grant NO.COGOS-2023HE02).

Author information

Authors and Affiliations

Authors

Contributions

Y.Z wrote the original draft, validated the results, developed the software, designed the methodology, performed formal analysis, curated the data, and conceptualized the study; G.Z validated the results, administered the project, and curated the data; Y.D reviewed and edited the manuscript and conducted the investigation; Z.W reviewed and edited the manuscript, provided resources, and curated the data; L.C reviewed and edited the manuscript and curated the data; K.-C.L reviewed and edited the manuscript and provided resources. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Guangli Zhu or Kuan-Ching Li.

Ethics declarations

Confict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Zhu, G., Ding, Y. et al. A progressive interaction model for multimodal sarcasm detection. J Supercomput 81, 624 (2025). https://doi.org/10.1007/s11227-025-07110-3

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11227-025-07110-3

Keywords