skip to main content
10.1145/3606039.3613112acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

JTMA: Joint Multimodal Feature Fusion and Temporal Multi-head Attention for Humor Detection

Published: 29 October 2023 Publication History

Abstract

In this paper, we propose a model named Joint multimodal feature fusion and Temporal Multi-head Attention (JTMA) to solve the MuSe-Humor sub-challenge in Multimodal Sentiment Analysis Challenge 2023. The goal of MuSe-Humor sub-challenge is to predict whether humor occurs in the given dataset that includes data from multiple modalities (e.g., text, audio and video). The cross-cultural testing presents a new challenge that makes it different from the previous years. To solve the above problems, the proposed model JTMA firstly uses a 1-D CNN to aggregate temporal information within the unimodal feature. Then the interactions of inter-modality and intra-modality are performed by the multimodal feature encoder module. Finally, we integrate the high-level representations learned from multiple modalities to accurately predict humor. The effectiveness of our proposed model is demonstrated through experimental results obtained on the official test set. Our model achieves an impressive AUC score of 0.8889, surpassing the performance of all other participants in the competition, and securing the Top 1 ranking.

References

[1]
Shahin Amiriparian, Maurice Gerczuk, Sandra Ottl, Nicholas Cummins, Michael Freitag, Sergey Pugachevskiy, Alice Baird, and Björn Schuller. 2017. Snore sound classification using image-based deep spectrum features. (2017), 3512--3516.
[2]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, Vol. 33 (2020), 12449--12460.
[3]
Dushyant Singh Chauhan, Md Shad Akhtar, Asif Ekbal, and Pushpak Bhattacharyya. 2019. Context-aware interactive attention for multi-modal sentiment and emotion analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 5647--5657.
[4]
Haifeng Chen, Dongmei Jiang, and Hichem Sahli. 2020. Transformer encoder with multi-modal multi-head attention for continuous affect recognition. IEEE Transactions on Multimedia, Vol. 23 (2020), 4171--4183.
[5]
Shizhe Chen, Qin Jin, Jinming Zhao, and Shuai Wang. 2017. Multimodal multi-task learning for dimensional and continuous emotion recognition. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. 19--26.
[6]
Lukas Christ, Shahin Amiriparian, Alice Baird, Alexander Kathan, Niklas Müller, Steffen Klug, Chris Gagne, Panagiotis Tzirakis, Eva-Maria Meßner, Andreas König, et al. 2023. The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked Emotions, Cross-Cultural Humour, and Personalisation. arXiv preprint arXiv:2305.03369 (2023).
[7]
Lukas Christ, Shahin Amiriparian, Alice Baird, Panagiotis Tzirakis, Alexander Kathan, Niklas Müller, Lukas Stappen, Eva-Maria Meßner, Andreas König, Alan Cowen, et al. 2022. The muse 2022 multimodal sentiment analysis challenge: humor, emotional reactions, and stress. In Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge. 5--14.
[8]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019).
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[10]
Abhinav Dhall, Roland Goecke, Simon Lucey, Tom Gedeon, et al. 2012. Collecting large, richly annotated facial-expression databases from movies. IEEE multimedia, Vol. 19, 3 (2012), 34.
[11]
Ning Ding, Sheng-wei Tian, and Long Yu. 2022. A multimodal fusion method for sarcasm detection based on late fusion. Multimedia Tools and Applications, Vol. 81, 6 (2022), 8597--8616.
[12]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[13]
Florian Eyben, Klaus R Scherer, Björn W Schuller, Johan Sundberg, Elisabeth André, Carlos Busso, Laurence Y Devillers, Julien Epps, Petri Laukka, Shrikanth S Narayanan, et al. 2015. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE transactions on affective computing, Vol. 7, 2 (2015), 190--202.
[14]
Wei Han, Hui Chen, and Soujanya Poria. 2021. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. arXiv preprint arXiv:2109.00412 (2021).
[15]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[16]
Guimin Hu, Ting-En Lin, Yi Zhao, Guangming Lu, Yuchuan Wu, and Yongbin Li. 2022. Unimse: Towards unified multimodal sentiment analysis and emotion recognition. arXiv preprint arXiv:2211.11256 (2022).
[17]
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. 2022. Visual prompt tuning. In European Conference on Computer Vision. 709--727.
[18]
Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291 (2019).
[19]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980--2988.
[20]
Huaishao Luo, Lei Ji, Yanyong Huang, Bin Wang, Shenggong Ji, and Tianrui Li. 2021. Scalevlad: Improving multimodal sentiment analysis via multi-scale fusion of locally descriptors. arXiv preprint arXiv:2112.01368 (2021).
[21]
Lokesh Singh, Rekh Ram Janghel, and Satya Prakash Sahu. 2022. A hybrid feature fusion strategy for early fusion and majority voting for late fusion towards melanocytic skin lesion detection. International Journal of Imaging Systems and Technology, Vol. 32, 4 (2022), 1231--1250.
[22]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017), 6000--6010.
[23]
Yifan Wang, Xing Xu, Wei Yu, Ruicong Xu, Zuo Cao, and Heng Tao Shen. 2021. Combine early and late fusion together: A hybrid fusion framework for image-text matching. In Proceedings of the IEEE International Conference on Multimedia and Expo. 1--6.
[24]
Orion Weller and Kevin Seppi. 2019. Humor detection: A transformer gets the last laugh. arXiv preprint arXiv:1909.00252 (2019).
[25]
Haojie Xu, Weifeng Liu, Jiangwei Liu, Mingzheng Li, Yu Feng, Yasi Peng, Yunwei Shi, Xiao Sun, and Meng Wang. 2022. Hybrid Multimodal Fusion for Humor Detection. In Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge. 15--21.
[26]
Fanglei Xue, Qiangchang Wang, Zichang Tan, Zhongsong Ma, and Guodong Guo. 2022. Vision transformer with attentive pooling for robust facial expression recognition. IEEE Transactions on Affective Computing (2022), 1--13.
[27]
Wenmeng Yu, Hua Xu, Ziqi Yuan, and Jiele Wu. 2021. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In Proceedings of the AAAI conference on artificial intelligence. 10790--10797.
[28]
Zhengyou Zhang. 1999. Feature-based facial expression recognition: Sensitivity analysis and experiments with a multilayer perceptron. International journal of pattern recognition and Artificial Intelligence, Vol. 13, 06 (1999), 893--911. io

Cited By

View all
  • (2024)AMTN: Attention-Enhanced Multimodal Temporal Network for Humor DetectionProceedings of the 5th on Multimodal Sentiment Analysis Challenge and Workshop: Social Perception and Humor10.1145/3689062.3689375(65-69)Online publication date: 28-Oct-2024
  • (2024)The MuSe 2024 Multimodal Sentiment Analysis Challenge: Social Perception and Humor RecognitionProceedings of the 5th on Multimodal Sentiment Analysis Challenge and Workshop: Social Perception and Humor10.1145/3689062.3689088(1-9)Online publication date: 28-Oct-2024
  • (2024)Social Perception Prediction for MuSe 2024: Joint Learning of Multiple PerceptionsProceedings of the 5th on Multimodal Sentiment Analysis Challenge and Workshop: Social Perception and Humor10.1145/3689062.3689087(52-59)Online publication date: 28-Oct-2024
  • Show More Cited By

Index Terms

  1. JTMA: Joint Multimodal Feature Fusion and Temporal Multi-head Attention for Humor Detection

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MuSe '23: Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation
      November 2023
      113 pages
      ISBN:9798400702709
      DOI:10.1145/3606039
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 29 October 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. humor detection
      2. multimodal fusion
      3. multimodal sentiment analysis
      4. temporal multi-head attention

      Qualifiers

      • Research-article

      Funding Sources

      • National Key R&D Programme of China
      • Major Project of Anhui Province

      Conference

      MM '23
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 14 of 17 submissions, 82%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)87
      • Downloads (Last 6 weeks)4
      Reflects downloads up to 08 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)AMTN: Attention-Enhanced Multimodal Temporal Network for Humor DetectionProceedings of the 5th on Multimodal Sentiment Analysis Challenge and Workshop: Social Perception and Humor10.1145/3689062.3689375(65-69)Online publication date: 28-Oct-2024
      • (2024)The MuSe 2024 Multimodal Sentiment Analysis Challenge: Social Perception and Humor RecognitionProceedings of the 5th on Multimodal Sentiment Analysis Challenge and Workshop: Social Perception and Humor10.1145/3689062.3689088(1-9)Online publication date: 28-Oct-2024
      • (2024)Social Perception Prediction for MuSe 2024: Joint Learning of Multiple PerceptionsProceedings of the 5th on Multimodal Sentiment Analysis Challenge and Workshop: Social Perception and Humor10.1145/3689062.3689087(52-59)Online publication date: 28-Oct-2024
      • (2024)DPP: A Dual-Phase Processing Method for Cross-Cultural Humor DetectionProceedings of the 5th on Multimodal Sentiment Analysis Challenge and Workshop: Social Perception and Humor10.1145/3689062.3689080(70-78)Online publication date: 28-Oct-2024
      • (2023)MuSe 2023 Challenge: Multimodal Prediction of Mimicked Emotions, Cross-Cultural Humour, and Personalised Recognition of AffectsProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3610943(9723-9725)Online publication date: 26-Oct-2023

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media