skip to main content
10.1145/3581783.3612872acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Building Robust Multimodal Sentiment Recognition via a Simple yet Effective Multimodal Transformer

Published:27 October 2023Publication History

ABSTRACT

In this paper, we present the solutions to the MER-MULTI and MER-NOISE sub-challenges of the Multimodal Emotion Recognition Challenge (MER 2023). For the tasks MER-MULTI and MER-NOISE, participants are required to recognize both discrete and dimensional emotions. Particularly, in MER-NOISE, the test videos are corrupted with noise, necessitating the consideration of modality robustness. Our empirical findings indicate that different modalities contribute differently to the tasks, with a significant impact from the audio and visual modalities, while the text modality plays a weaker role in emotion prediction. To facilitate subsequent multimodal fusion, and considering that language information is implicitly embedded in large pre-trained speech models, we have made the deliberate choice to abandon the text modality and solely utilize visual and acoustic modalities for these sub-challenges. To address the potential underfitting of individual modalities during multimodal training, we propose to jointly train all modalities via a weighted blending of supervision signals. Furthermore, to enhance the robustness of our model, we employ a range of data augmentation techniques at the image level, waveform level, and spectrogram level. Experimental results show that our model ranks 1st in both MER-MULTI (0.7005) and MER-NOISE (0.6846) sub-challenges, validating the effectiveness of our method. Our code is publicly available at https://github.com/dingchaoyue/Multimodal-Emotion-Recognition-MER-and-MuSe-2023-Challenges.

References

  1. Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. In NeurIPS. 12449--12460.Google ScholarGoogle Scholar
  2. Tadas Baltruvs aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41 (2018), 423--443.Google ScholarGoogle Scholar
  3. Chun-Fu Richard Chen, Quanfu Fan, and Rameswar Panda. 2021. Crossvit: Cross-attention multi-scale vision transformer for image classification. In ICCV. 357--366.Google ScholarGoogle Scholar
  4. Junyan Cheng, Iordanis Fostiropoulos, Barry Boehm, and Mohammad Soleymani. 2021. Multimodal phased transformer for sentiment analysis. In EMNLP. 2447--2458.Google ScholarGoogle Scholar
  5. Lukas Christ, Shahin Amiriparian, Alice Baird, Alexander Kathan, Niklas Müller, Steffen Klug, Chris Gagne, Panagiotis Tzirakis, Eva-Maria Meßner, Andreas König, et al. 2023. The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked Emotions, Cross-Cultural Humour, and Personalisation. arXiv preprint arXiv:2305.03369 (2023).Google ScholarGoogle Scholar
  6. Chaoyue Ding, Jiakui Li, Daoming Zong, Baoxiang Li, TianHao Zhang, and Qunyan Zhou. 2023 b. Stable Speech Emotion Recognition with Head-k-Pooling Loss. In INTERSPEECH.Google ScholarGoogle Scholar
  7. Chaoyue Ding, Jiakui Li, Martin Zong, and Baoxiang Li. 2023 a. Speed-Robust Keyword Spotting Via Soft Self-Attention on Multi-Scale Features. In SLT. IEEE, 1104--1111.Google ScholarGoogle Scholar
  8. Kevin Ding, Martin Zong, Jiakui Li, and Baoxiang Li. 2022. LETR: A lightweight and efficient transformer for keyword spotting. In ICASSP. IEEE, 7987--7991.Google ScholarGoogle Scholar
  9. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).Google ScholarGoogle Scholar
  10. Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. 2013. Challenges in representation learning: A report on three machine learning contests. In ICONIP. 117--124.Google ScholarGoogle Scholar
  11. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.Google ScholarGoogle Scholar
  12. Yu He, Licai Sun, Zheng Lian, Bin Liu, Jianhua Tao, Meng Wang, and Yuan Cheng. 2022. Multimodal Temporal Attention in Sentiment Analysis. In Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge. 61--66.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 29 (2021), 3451--3460.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yu Huang, Chenzhuang Du, Zihui Xue, Xuanyao Chen, Hang Zhao, and Longbo Huang. 2021. What makes multi-modal learning better than single (provably). NeurIPS, Vol. 34 (2021), 10944--10956.Google ScholarGoogle Scholar
  15. Yu Huang, Junyang Lin, Chang Zhou, Hongxia Yang, and Longbo Huang. 2022. Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably). In ICML. PMLR, 9226--9259.Google ScholarGoogle Scholar
  16. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  17. Jia Li, Ziyang Zhang, Junjie Lang, Yueqi Jiang, Liuwei An, Peng Zou, Yangyang Xu, Sheng Gao, Jie Lin, Chunxiao Fan, et al. 2022. Hybrid multimodal feature extraction, mining and fusion for sentiment analysis. In Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge. 81--88.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Shan Li, Weihong Deng, and JunPing Du. 2017. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In CVPR. 2852--2861.Google ScholarGoogle Scholar
  19. Zheng Lian, Bin Liu, and Jianhua Tao. 2021. CTNet: Conversational transformer network for emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 29 (2021), 985--1000.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Zheng Lian, Haiyang Sun, Licai Sun, Jinming Zhao, Ye Liu, Bin Liu, Jiangyan Yi, Meng Wang, Erik Cambria, Guoying Zhao, et al. 2023. MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning. arXiv preprint arXiv:2304.08981 (2023).Google ScholarGoogle Scholar
  21. Ye Liu, Siyuan Li, Yang Wu, Chang-Wen Chen, Ying Shan, and Xiaohu Qie. 2022. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In CVPR. 3042--3051.Google ScholarGoogle Scholar
  22. Fengmao Lv, Xiang Chen, Yanyong Huang, Lixin Duan, and Guosheng Lin. 2021. Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In CVPR. 2554--2562.Google ScholarGoogle Scholar
  23. Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, and Chen Sun. 2021. Attention bottlenecks for multimodal fusion. NeurIPS (2021), 14200--14213.Google ScholarGoogle Scholar
  24. Xiaokang Peng, Yake Wei, Andong Deng, Dong Wang, and Di Hu. 2022. Balanced multimodal learning via on-the-fly gradient modulation. In CVPR. 8238--8247.Google ScholarGoogle Scholar
  25. David Snyder, Guoguo Chen, and Daniel Povey. 2015. Musan: A music, speech, and noise corpus. arXiv preprint arXiv:1510.08484 (2015).Google ScholarGoogle Scholar
  26. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, Vol. 15, 1 (2014), 1929--1958.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Licai Sun, Zheng Lian, Bin Liu, and Jianhua Tao. 2023. Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis. IEEE Transactions on Affective Computing (2023).Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Licai Sun, Zheng Lian, Jianhua Tao, Bin Liu, and Mingyue Niu. 2020. Multi-modal continuous dimensional emotion recognition using recurrent neural network and self-attention mechanism. In Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-Life Media Challenge and Workshop. 27--34.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Licai Sun, Mingyu Xu, Zheng Lian, Bin Liu, Jianhua Tao, Meng Wang, and Yuan Cheng. 2021. Multimodal emotion recognition and sentiment analysis via attention enhanced recurrent model. In Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge. 15--20.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Transformer for Unaligned Multimodal Language Sequences. CoRR, Vol. abs/1906.00295 (2019). arxiv: 1906.00295 http://arxiv.org/abs/1906.00295Google ScholarGoogle Scholar
  31. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS. 5998--6008.Google ScholarGoogle Scholar
  32. Weiyao Wang, Du Tran, and Matt Feiszli. 2020. What makes training multi-modal classification networks hard?. In CVPR. 12695--12705.Google ScholarGoogle Scholar
  33. Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, et al. 2022. Wenetspeech: A 10000 hours multi-domain mandarin corpus for speech recognition. In ICASSP. 6182--6186.Google ScholarGoogle Scholar
  34. Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters, Vol. 23 (04 2016).Google ScholarGoogle ScholarCross RefCross Ref
  35. Zengqun Zhao, Qingshan Liu, and Shanmin Wang. 2021. Learning deep global multi-scale and local attention features for facial expression recognition in the wild. IEEE Transactions on Image Processing, Vol. 30 (2021), 6544--6556.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Building Robust Multimodal Sentiment Recognition via a Simple yet Effective Multimodal Transformer

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MM '23: Proceedings of the 31st ACM International Conference on Multimedia
          October 2023
          9913 pages
          ISBN:9798400701085
          DOI:10.1145/3581783

          Copyright © 2023 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 27 October 2023

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate995of4,171submissions,24%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia
        • Article Metrics

          • Downloads (Last 12 months)188
          • Downloads (Last 6 weeks)26

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader