skip to main content
10.1145/3581783.3612528acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Your tone speaks louder than your face! Modality Order Infused Multi-modal Sarcasm Detection

Published:27 October 2023Publication History

ABSTRACT

Figurative language is an essential component of human communication, and detecting sarcasm in text has become a challenging yet highly popular task in natural language processing. As humans, we rely on a combination of visual and auditory cues, such as facial expressions and tone of voice, to comprehend a message. Our brains are implicitly trained to integrate information from multiple senses to form a complete understanding of the message being conveyed, a process known as multi-sensory integration. The combination of different modalities not only provides additional information but also amplifies the information conveyed by each modality in relation to the others. Thus, the infusion order of different modalities also plays a significant role in multimodal processing. In this paper, we investigate the impact of different modality infusion orders for identifying sarcasm in dialogues. We propose a modality order-driven module integrated into a transformer network, MO-Sarcation that fuses modalities in an ordered manner. Our model outperforms several state-of-the-art models by 1-3% across various metrics, demonstrating the crucial role of modality order in sarcasm detection. The obtained improvements and detailed analysis show that audio tone should be infused with textual content, followed by visual information to identify sarcasm efficiently. The code and dataset are available at https://github.com/mohit2b/MO-Sarcation.

Skip Supplemental Material Section

Supplemental Material

mmfp3770-video.mp4

mp4

105.4 MB

References

  1. Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, Vol. 33 (2020), 12449--12460.Google ScholarGoogle Scholar
  2. Manjot Bedi, Shivani Kumar, Md Shad Akhtar, and Tanmoy Chakraborty. 2021. Multi-modal sarcasm detection and humor classification in code-mixed conversations. IEEE Transactions on Affective Computing (2021).Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Erik Blasch, Robert Cruise, Alexander Aved, Uttam Majumder, and Todd Rovito. 2019. Methods of AI for multimodal sensing and action for complex situations. AI Magazine, Vol. 40, 4 (2019), 50--65.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Santiago Castro, Devamanyu Hazarika, Verónica Pérez-Rosas, Roger Zimmermann, Rada Mihalcea, and Soujanya Poria. 2019. Towards multimodal sarcasm detection (an _obviously_ perfect paper). arXiv preprint arXiv:1906.01815 (2019).Google ScholarGoogle Scholar
  5. Dushyant Singh Chauhan, SR Dhanush, Asif Ekbal, and Pushpak Bhattacharyya. 2020. Sentiment and emotion help sarcasm? a multi-task learning framework for multi-modal sarcasm, sentiment and emotion analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4351--4360.Google ScholarGoogle ScholarCross RefCross Ref
  6. Xu Chen, Hanxiong Chen, Hongteng Xu, Yongfeng Zhang, Yixin Cao, Zheng Qin, and Hongyuan Zha. 2019. Personalized fashion recommendation with visual explanations based on multimodal attention network: Towards visually explainable recommendation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 765--774.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Dmitry Davidov, Oren Tsur, and Ari Rappoport. 2010. Semi-supervised recognition of sarcasm in Twitter and Amazon. In Proceedings of the fourteenth conference on computational natural language learning. 107--116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Karan Desai and Justin Johnson. 2021. Virtex: Learning visual representations from textual annotations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11162--11173.Google ScholarGoogle ScholarCross RefCross Ref
  9. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google ScholarGoogle Scholar
  10. Abhijeet Dubey, Aditya Joshi, and Pushpak Bhattacharyya. 2019. Deep models for converting sarcastic utterances into their non sarcastic interpretation. In Proceedings of the ACM India Joint International Conference on Data Science and Management of Data. 289--292.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia. 1459--1462.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Simona Frenda, Alessandra Teresa Cignarella, Valerio Basile, Cristina Bosco, Viviana Patti, and Paolo Rosso. 2022. The unbearable hurtfulness of sarcasm. Expert Systems with Applications, Vol. 193 (2022), 116398.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Debanjan Ghosh, Alexander R Fabbri, and Smaranda Muresan. 2018. Sarcasm analysis using conversation context. Computational Linguistics, Vol. 44, 4 (2018), 755--792.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Walid Hariri. 2023. Unlocking the Potential of ChatGPT: A Comprehensive Exploration of its Applications, Advantages, Limitations, and Future Directions in Natural Language Processing. arXiv preprint arXiv:2304.02017 (2023).Google ScholarGoogle Scholar
  15. Md Kamrul Hasan, Sangwu Lee, Wasifur Rahman, Amir Zadeh, Rada Mihalcea, Louis-Philippe Morency, and Ehsan Hoque. 2021. Humor knowledge enriched transformer for understanding multimodal humor. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 12972--12980.Google ScholarGoogle ScholarCross RefCross Ref
  16. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  17. Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning. PMLR, 2790--2799.Google ScholarGoogle Scholar
  18. Aditya Joshi, Pushpak Bhattacharyya, and Mark J Carman. 2017. Automatic sarcasm detection: A survey. ACM Computing Surveys (CSUR), Vol. 50, 5 (2017), 1--22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Aditya Joshi, Vinita Sharma, and Pushpak Bhattacharyya. 2015. Harnessing context incongruity for sarcasm detection. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 757--762.Google ScholarGoogle ScholarCross RefCross Ref
  20. Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning. PMLR, 5583--5594.Google ScholarGoogle Scholar
  21. Roger Kreuz and Gina Caucci. 2007. Lexical influences on the perception of sarcasm. In Proceedings of the Workshop on computational approaches to Figurative Language. 1--4.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Shivani Kumar, Atharva Kulkarni, Md Shad Akhtar, and Tanmoy Chakraborty. 2022. When did you become so smart, oh wise one?! Sarcasm Explanation in Multi-modal Multi-party Dialogues. arXiv preprint arXiv:2203.06419 (2022).Google ScholarGoogle Scholar
  23. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).Google ScholarGoogle Scholar
  24. Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, and Tat-seng Chua. 2018. Knowledge-aware multimodal dialogue systems. In Proceedings of the 26th ACM international conference on Multimedia. 801--809.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).Google ScholarGoogle Scholar
  26. Yaochen Liu, Yazhou Zhang, Qiuchi Li, Benyou Wang, and Dawei Song. 2021. What does your smile mean? Jointly detecting multi-modal sarcasm and sentiment using quantum probability. In Findings of the Association for Computational Linguistics: EMNLP 2021. 871--880.Google ScholarGoogle ScholarCross RefCross Ref
  27. Brian McFee, Colin Raffel, Dawen Liang, Daniel P Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. 2015. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, Vol. 8. 18--25.Google ScholarGoogle ScholarCross RefCross Ref
  28. Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, and Chen Sun. 2021. Attention bottlenecks for multimodal fusion. Advances in Neural Information Processing Systems, Vol. 34 (2021), 14200--14213.Google ScholarGoogle Scholar
  29. Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5206--5210.Google ScholarGoogle ScholarCross RefCross Ref
  30. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, Vol. 32 (2019).Google ScholarGoogle Scholar
  31. Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. 2017. A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion, Vol. 37 (2017), 98--125.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Rachel Rakov and Andrew Rosenberg. 2013. "sure, i did the right thing": a system for sarcasm detection in speech.. In Interspeech. 842--846.Google ScholarGoogle Scholar
  33. Tulika Saha, Dhawal Gupta, Sriparna Saha, and Pushpak Bhattacharyya. 2021a. Emotion aided dialogue act classification for task-independent conversations in a multi-modal framework. Cognitive Computation, Vol. 13 (2021), 277--289.Google ScholarGoogle ScholarCross RefCross Ref
  34. Tulika Saha, Aditya Patra, Sriparna Saha, and Pushpak Bhattacharyya. 2020. Towards Emotion-aided Multi-modal Dialogue Act Classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 4361--4372. https://doi.org/10.18653/v1/2020.acl-main.402Google ScholarGoogle ScholarCross RefCross Ref
  35. Tulika Saha, Aditya Prakash Patra, Sriparna Saha, and Pushpak Bhattacharyya. 2022a. Meta-Learning based Deferred Optimisation for Sentiment and Emotion aware Multi-modal Dialogue Act Classification. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online only, 978--990. https://aclanthology.org/2022.aacl-main.71Google ScholarGoogle Scholar
  36. Tulika Saha, Apoorva Upadhyaya, Sriparna Saha, and Pushpak Bhattacharyya. 2021b. Towards Sentiment and Emotion aided Multi-modal Speech Act Classification in Twitter. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 5727--5737. https://doi.org/10.18653/v1/2021.naacl-main.456Google ScholarGoogle ScholarCross RefCross Ref
  37. Tulika Saha, Apoorva Upadhyaya, Sriparna Saha, and Pushpak Bhattacharyya. 2022b. A Multitask Multimodal Ensemble Model for Sentiment- and Emotion-Aided Tweet Act Classification. IEEE Transactions on Computational Social Systems, Vol. 9, 2 (2022), 508--517. https://doi.org/10.1109/TCSS.2021.3088714Google ScholarGoogle ScholarCross RefCross Ref
  38. Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019).Google ScholarGoogle Scholar
  39. Abhisek Tiwari, Sriparna Saha, Shubhashis Sengupta, Anutosh Maitra, Roshni Ramnani, and Pushpak Bhattacharyya. 2022. Persona or Context? Towards Building Context adaptive Personalized Persuasive Virtual Sales Assistant. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing. 1035--1047.Google ScholarGoogle Scholar
  40. Christopher Town. 2007. Multi-sensory and multi-modal fusion for sentient computing. International Journal of Computer Vision, Vol. 71 (2007), 235--253.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, Vol. 34 (2021), 200--212.Google ScholarGoogle Scholar
  42. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).Google ScholarGoogle Scholar
  43. Jiquan Wang, Lin Sun, Yi Liu, Meizhi Shao, and Zengwei Zheng. 2022. Multimodal Sarcasm Target Identification in Tweets. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 8164--8175.Google ScholarGoogle ScholarCross RefCross Ref
  44. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. Huggingface's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).Google ScholarGoogle Scholar
  45. Baosong Yang, Jian Li, Derek F Wong, Lidia S Chao, Xing Wang, and Zhaopeng Tu. 2019. Context-aware self-attention networks. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 387--394.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Your tone speaks louder than your face! Modality Order Infused Multi-modal Sarcasm Detection

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '23: Proceedings of the 31st ACM International Conference on Multimedia
      October 2023
      9913 pages
      ISBN:9798400701085
      DOI:10.1145/3581783

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 October 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia
    • Article Metrics

      • Downloads (Last 12 months)237
      • Downloads (Last 6 weeks)37

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader