skip to main content
10.1145/3606039.3613113acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multimodal Sentiment Analysis via Efficient Multimodal Transformer and Modality-Aware Adaptive Training Strategy

Published:29 October 2023Publication History

ABSTRACT

In this paper, we present the solution to the MuSe-Mimic subchallenge of the 4th Multimodal Sentiment Analysis Challenge. This sub-challenge aims to predict the level of approval, disappointment and uncertainty in user-generated video clips. In our experiments, we found that naive joint training of multiple modalities by late fusion would result in insufficient learning of unimodal features. Moreover, different modalities contribute differently to MuSe-Mimic. Relying solely on multimodal features or treating unimodal features equally may limit the model's generalization performance. To address these challenges, we propose an efficient multimodal transformer equipped with a modality-aware adaptive training strategy to facilitate optimal joint training on multimodal sequence inputs. This framework holds promise in leveraging cross-modal interactions while ensuring adequate learning of unimodal features. Our model achieves the mean Pearson's Correlation Coefficient of .729 (ranking 2nd), outperforming official baseline result of .473. Our code is available at https://github.com/dingchaoyue/Multimodal-Emotion-Recognition-MER-and-MuSe-2023-Challenges.

References

  1. Shahin Amiriparian, Lukas Christ, Andreas König, Eva-Maria Messner, Alan Cowen, Erik Cambria, and Björn W. Schuller. 2023. MuSe 2023 Challenge: Multimodal Prediction of Mimicked Emotions, Cross-Cultural Humour, and Personalised Recognition of Affects. In ACM Multimedia.Google ScholarGoogle Scholar
  2. Shahin Amiriparian, Maurice Gerczuk, Sandra Ottl, Nicholas Cummins, Michael Freitag, Sergey Pugachevskiy, and Björn Schuller. 2017. Snore Sound Classification Using Image-based Deep Spectrum Features. In INTERSPEECH. 3512--3516.Google ScholarGoogle Scholar
  3. Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. In NeurIPS, Vol. 33. 12449--12460.Google ScholarGoogle Scholar
  4. Tadas Baltruvs aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 2 (2018), 423--443.Google ScholarGoogle Scholar
  5. Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In ICCV. 9650--9660.Google ScholarGoogle Scholar
  6. Purnima Chandrasekar, Santosh Chapaneri, and Deepak Jayaswal. 2014. Automatic speech emotion recognition: A survey. In CSCITA. IEEE, 341--346.Google ScholarGoogle Scholar
  7. Aayushi Chaudhari, Chintan Bhatt, Achyut Krishna, and Pier Luigi Mazzeo. 2022. ViTFER: Facial Emotion Recognition with Vision Transformers. Applied System Innovation , Vol. 5, 4 (2022), 80.Google ScholarGoogle ScholarCross RefCross Ref
  8. Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. 2018. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In ICML. PMLR, 794--803.Google ScholarGoogle Scholar
  9. Lukas Christ, Shahin Amiriparian, Alice Baird, Alexander Kathan, Niklas Müller, Steffen Klug, Chris Gagne, Panagiotis Tzirakis, Eva-Maria Meßner, Andreas König, et al. 2023. The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked Emotions, Cross-Cultural Humour, and Personalisation. arXiv preprint arXiv:2305.03369 (2023).Google ScholarGoogle Scholar
  10. Lukas Christ, Shahin Amiriparian, Alice Baird, Panagiotis Tzirakis, Alexander Kathan, Niklas Müller, Lukas Stappen, Eva-Maria Meßner, Andreas König, Alan Cowen, et al. 2022. The muse 2022 multimodal sentiment analysis challenge: humor, emotional reactions, and stress. In Proceedings of the International on Multimodal Sentiment Analysis Workshop and Challenge. 5--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020).Google ScholarGoogle Scholar
  12. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. Minneapolis, Minnesota, 4171--4186.Google ScholarGoogle Scholar
  13. Chaoyue Ding, Jiakui Li, Martin Zong, and Baoxiang Li. 2023. Speed-Robust Keyword Spotting Via Soft Self-Attention on Multi-Scale Features. In IEEE Spoken Language Technology Workshop. 1104--1111.Google ScholarGoogle Scholar
  14. Kevin Ding, Martin Zong, Jiakui Li, and Baoxiang Li. 2022. LETR: A lightweight and efficient transformer for keyword spotting. In ICASSP. IEEE, 7987--7991.Google ScholarGoogle Scholar
  15. Paul Ekman and Wallace V Friesen. 1978. Facial action coding system. Environmental Psychology & Nonverbal Behavior (1978).Google ScholarGoogle Scholar
  16. Florian Eyben, Klaus R Scherer, Björn W Schuller, Johan Sundberg, Elisabeth André, Carlos Busso, Laurence Y Devillers, Julien Epps, Petri Laukka, Shrikanth S Narayanan, et al. 2015. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing , Vol. 7, 2 (2015), 190--202.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In ACM Multimedia. Association for Computing Machinery, Firenze, Italy, 1459--1462.Google ScholarGoogle Scholar
  18. Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. 2013. Challenges in representation learning: A report on three machine learning contests. In ICONIP. 117--124.Google ScholarGoogle Scholar
  19. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.Google ScholarGoogle Scholar
  20. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654 (2020).Google ScholarGoogle Scholar
  21. Yu He, Licai Sun, Zheng Lian, Bin Liu, Jianhua Tao, Meng Wang, and Yuan Cheng. 2022. Multimodal Temporal Attention in Sentiment Analysis. In International on Multimodal Sentiment Analysis Workshop and Challenge. 61--66.Google ScholarGoogle Scholar
  22. Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, et al. 2017. CNN architectures for large-scale audio classification. In ICASSP. IEEE, 131--135.Google ScholarGoogle Scholar
  23. Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing , Vol. 29 (2021), 3451--3460.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Yu Huang, Junyang Lin, Chang Zhou, Hongxia Yang, and Longbo Huang. 2022. Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably). In ICML. PMLR, 9226--9259.Google ScholarGoogle Scholar
  25. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  26. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).Google ScholarGoogle Scholar
  27. Jia Li, Ziyang Zhang, Junjie Lang, Yueqi Jiang, Liuwei An, Peng Zou, Yangyang Xu, Sheng Gao, Jie Lin, Chunxiao Fan, et al. 2022. Hybrid multimodal feature extraction, mining and fusion for sentiment analysis. In Proceedings of the International on Multimodal Sentiment Analysis Workshop and Challenge. 81--88.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Shan Li, Weihong Deng, and JunPing Du. 2017. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In CVPR. 2852--2861.Google ScholarGoogle Scholar
  29. Kuan Liu, Yanen Li, Ning Xu, and Prem Natarajan. 2018. Learn to combine modalities in multimodal deep learning. arXiv preprint arXiv:1805.11730 (2018).Google ScholarGoogle Scholar
  30. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).Google ScholarGoogle Scholar
  31. R. Lotfian and C. Busso. 2019. Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech From Existing Podcast Recordings. IEEE Transactions on Affective Computing , Vol. 10 (2019), 471--483.Google ScholarGoogle ScholarCross RefCross Ref
  32. Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, and Chen Sun. 2021. Attention bottlenecks for multimodal fusion. In NeurIPS. 14200--14213.Google ScholarGoogle Scholar
  33. Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP. 1532--1543.Google ScholarGoogle Scholar
  34. Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. 2020. Pre-trained models for natural language processing: A survey. Science China Technological Sciences , Vol. 63, 10 (2020), 1872--1897.Google ScholarGoogle ScholarCross RefCross Ref
  35. Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862 (2019).Google ScholarGoogle Scholar
  36. Sefik Ilkin Serengil and Alper Ozpinar. 2020. LightFace: A Hybrid Deep Face Recognition Framework. In Innovations in Intelligent Systems and Applications Conference. 23--27.Google ScholarGoogle Scholar
  37. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research , Vol. 15, 1 (2014), 1929--1958.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Lukas Stappen, Alice Baird, Lukas Christ, Lea Schumann, Benjamin Sertolli, Eva-Maria Messner, Erik Cambria, Guoying Zhao, and Björn W Schuller. 2021. The MuSe 2021 multimodal sentiment analysis challenge: sentiment, emotion, physiological-emotion, and stress. In Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge. 5--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Lukas Stappen, Alice Baird, Georgios Rizos, Panagiotis Tzirakis, Xinchen Du, Felix Hafner, Lea Schumann, Adria Mallol-Ragolta, Björn W Schuller, Iulia Lefter, et al. 2020. Muse 2020 challenge and workshop: Multimodal sentiment analysis, emotion-target engagement and trustworthiness detection in real-life media: Emotional car reviews in-the-wild. In Proceedings of the International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop. 35--44.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Licai Sun, Zheng Lian, Bin Liu, and Jianhua Tao. 2023. Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis. IEEE Transactions on Affective Computing (2023).Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Lorenzo Vaiani, Moreno La Quatra, Luca Cagliero, and Paolo Garza. 2022. ViPER: Video-Based Perceiver for Emotion Recognition. In Proceedings of the International on Multimodal Sentiment Analysis Workshop and Challenge. 67--73.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS, Vol. 30.Google ScholarGoogle Scholar
  43. J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Burkhardt, F. Eyben, and B. W. Schuller. 2023. Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap. IEEE Transactions on Pattern Analysis & Machine Intelligence 01 (2023), 1--13.Google ScholarGoogle Scholar
  44. Weiyao Wang, Du Tran, and Matt Feiszli. 2020b. What makes training multi-modal classification networks hard?. In CVPR. 12695--12705.Google ScholarGoogle Scholar
  45. Yikai Wang, Wenbing Huang, Fuchun Sun, Tingyang Xu, Yu Rong, and Junzhou Huang. 2020a. Deep multimodal fusion by channel exchanging. In NeurIPS, Vol. 33. 4835--4845.Google ScholarGoogle Scholar
  46. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In NeurIPS, Vol. 32.Google ScholarGoogle Scholar
  47. Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters , Vol. 23 (04 2016).Google ScholarGoogle ScholarCross RefCross Ref
  48. Zengqun Zhao, Qingshan Liu, and Shanmin Wang. 2021. Learning deep global multi-scale and local attention features for facial expression recognition in the wild. IEEE Transactions on Image Processing , Vol. 30 (2021), 6544--6556. ioGoogle ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Multimodal Sentiment Analysis via Efficient Multimodal Transformer and Modality-Aware Adaptive Training Strategy

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MuSe '23: Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation
          November 2023
          113 pages
          ISBN:9798400702709
          DOI:10.1145/3606039

          Copyright © 2023 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 29 October 2023

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate14of17submissions,82%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia
        • Article Metrics

          • Downloads (Last 12 months)118
          • Downloads (Last 6 weeks)24

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader