research-article

Multimodal Sentiment Analysis via Efficient Multimodal Transformer and Modality-Aware Adaptive Training Strategy

Authors:
Chaoyue Ding

SenseTime Research, Beijing, China

SenseTime Research, Beijing, China

0009-0000-0161-4838
View Profile

,
Daoming Zong

SenseTime Research, Beijing, China

SenseTime Research, Beijing, China

0009-0004-8109-2943
View Profile

,
Baoxiang Li

SenseTime Research, Beijing, China

SenseTime Research, Beijing, China

0009-0009-4490-2157
View Profile

,
Song Zhang

SenseTime Research, Beijing, China

SenseTime Research, Beijing, China

0009-0004-5856-9969
View Profile

,
Xiaoxu Zhu

SenseTime Research, Beijing, China

SenseTime Research, Beijing, China

0009-0004-3562-4507
View Profile

,
Guiping Zhong

SenseTime Research, Beijing, China

SenseTime Research, Beijing, China

0009-0004-1530-0783
View Profile

,
Dinghao Zhou

SenseTime Research, Beijing, China

SenseTime Research, Beijing, China

0009-0000-8519-4630
View Profile

MuSe '23: Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and PersonalisationNovember 2023Pages 11–17https://doi.org/10.1145/3606039.3613113

Published:29 October 2023Publication History

MuSe '23: Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation

Pages 11–17

ABSTRACT

In this paper, we present the solution to the MuSe-Mimic subchallenge of the 4th Multimodal Sentiment Analysis Challenge. This sub-challenge aims to predict the level of approval, disappointment and uncertainty in user-generated video clips. In our experiments, we found that naive joint training of multiple modalities by late fusion would result in insufficient learning of unimodal features. Moreover, different modalities contribute differently to MuSe-Mimic. Relying solely on multimodal features or treating unimodal features equally may limit the model's generalization performance. To address these challenges, we propose an efficient multimodal transformer equipped with a modality-aware adaptive training strategy to facilitate optimal joint training on multimodal sequence inputs. This framework holds promise in leveraging cross-modal interactions while ensuring adequate learning of unimodal features. Our model achieves the mean Pearson's Correlation Coefficient of .729 (ranking 2nd), outperforming official baseline result of .473. Our code is available at https://github.com/dingchaoyue/Multimodal-Emotion-Recognition-MER-and-MuSe-2023-Challenges.

References

Shahin Amiriparian, Lukas Christ, Andreas König, Eva-Maria Messner, Alan Cowen, Erik Cambria, and Björn W. Schuller. 2023. MuSe 2023 Challenge: Multimodal Prediction of Mimicked Emotions, Cross-Cultural Humour, and Personalised Recognition of Affects. In ACM Multimedia.Google Scholar
Shahin Amiriparian, Maurice Gerczuk, Sandra Ottl, Nicholas Cummins, Michael Freitag, Sergey Pugachevskiy, and Björn Schuller. 2017. Snore Sound Classification Using Image-based Deep Spectrum Features. In INTERSPEECH. 3512--3516.Google Scholar
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. In NeurIPS, Vol. 33. 12449--12460.Google Scholar
Tadas Baltruvs aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 2 (2018), 423--443.Google Scholar
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In ICCV. 9650--9660.Google Scholar
Purnima Chandrasekar, Santosh Chapaneri, and Deepak Jayaswal. 2014. Automatic speech emotion recognition: A survey. In CSCITA. IEEE, 341--346.Google Scholar
Aayushi Chaudhari, Chintan Bhatt, Achyut Krishna, and Pier Luigi Mazzeo. 2022. ViTFER: Facial Emotion Recognition with Vision Transformers. Applied System Innovation , Vol. 5, 4 (2022), 80.Google ScholarCross Ref
Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. 2018. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In ICML. PMLR, 794--803.Google Scholar
Lukas Christ, Shahin Amiriparian, Alice Baird, Alexander Kathan, Niklas Müller, Steffen Klug, Chris Gagne, Panagiotis Tzirakis, Eva-Maria Meßner, Andreas König, et al. 2023. The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked Emotions, Cross-Cultural Humour, and Personalisation. arXiv preprint arXiv:2305.03369 (2023).Google Scholar
Lukas Christ, Shahin Amiriparian, Alice Baird, Panagiotis Tzirakis, Alexander Kathan, Niklas Müller, Lukas Stappen, Eva-Maria Meßner, Andreas König, Alan Cowen, et al. 2022. The muse 2022 multimodal sentiment analysis challenge: humor, emotional reactions, and stress. In Proceedings of the International on Multimodal Sentiment Analysis Workshop and Challenge. 5--14.Google ScholarDigital Library
Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020).Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. Minneapolis, Minnesota, 4171--4186.Google Scholar
Chaoyue Ding, Jiakui Li, Martin Zong, and Baoxiang Li. 2023. Speed-Robust Keyword Spotting Via Soft Self-Attention on Multi-Scale Features. In IEEE Spoken Language Technology Workshop. 1104--1111.Google Scholar
Kevin Ding, Martin Zong, Jiakui Li, and Baoxiang Li. 2022. LETR: A lightweight and efficient transformer for keyword spotting. In ICASSP. IEEE, 7987--7991.Google Scholar
Paul Ekman and Wallace V Friesen. 1978. Facial action coding system. Environmental Psychology & Nonverbal Behavior (1978).Google Scholar
Florian Eyben, Klaus R Scherer, Björn W Schuller, Johan Sundberg, Elisabeth André, Carlos Busso, Laurence Y Devillers, Julien Epps, Petri Laukka, Shrikanth S Narayanan, et al. 2015. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing , Vol. 7, 2 (2015), 190--202.Google ScholarDigital Library
Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In ACM Multimedia. Association for Computing Machinery, Firenze, Italy, 1459--1462.Google Scholar
Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. 2013. Challenges in representation learning: A report on three machine learning contests. In ICONIP. 117--124.Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.Google Scholar
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654 (2020).Google Scholar
Yu He, Licai Sun, Zheng Lian, Bin Liu, Jianhua Tao, Meng Wang, and Yuan Cheng. 2022. Multimodal Temporal Attention in Sentiment Analysis. In International on Multimodal Sentiment Analysis Workshop and Challenge. 61--66.Google Scholar
Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, et al. 2017. CNN architectures for large-scale audio classification. In ICASSP. IEEE, 131--135.Google Scholar
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing , Vol. 29 (2021), 3451--3460.Google ScholarDigital Library
Yu Huang, Junyang Lin, Chang Zhou, Hongxia Yang, and Longbo Huang. 2022. Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably). In ICML. PMLR, 9226--9259.Google Scholar
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).Google Scholar
Jia Li, Ziyang Zhang, Junjie Lang, Yueqi Jiang, Liuwei An, Peng Zou, Yangyang Xu, Sheng Gao, Jie Lin, Chunxiao Fan, et al. 2022. Hybrid multimodal feature extraction, mining and fusion for sentiment analysis. In Proceedings of the International on Multimodal Sentiment Analysis Workshop and Challenge. 81--88.Google ScholarDigital Library
Shan Li, Weihong Deng, and JunPing Du. 2017. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In CVPR. 2852--2861.Google Scholar
Kuan Liu, Yanen Li, Ning Xu, and Prem Natarajan. 2018. Learn to combine modalities in multimodal deep learning. arXiv preprint arXiv:1805.11730 (2018).Google Scholar
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).Google Scholar
R. Lotfian and C. Busso. 2019. Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech From Existing Podcast Recordings. IEEE Transactions on Affective Computing , Vol. 10 (2019), 471--483.Google ScholarCross Ref
Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, and Chen Sun. 2021. Attention bottlenecks for multimodal fusion. In NeurIPS. 14200--14213.Google Scholar
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP. 1532--1543.Google Scholar
Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. 2020. Pre-trained models for natural language processing: A survey. Science China Technological Sciences , Vol. 63, 10 (2020), 1872--1897.Google ScholarCross Ref
Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862 (2019).Google Scholar
Sefik Ilkin Serengil and Alper Ozpinar. 2020. LightFace: A Hybrid Deep Face Recognition Framework. In Innovations in Intelligent Systems and Applications Conference. 23--27.Google Scholar
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research , Vol. 15, 1 (2014), 1929--1958.Google ScholarDigital Library
Lukas Stappen, Alice Baird, Lukas Christ, Lea Schumann, Benjamin Sertolli, Eva-Maria Messner, Erik Cambria, Guoying Zhao, and Björn W Schuller. 2021. The MuSe 2021 multimodal sentiment analysis challenge: sentiment, emotion, physiological-emotion, and stress. In Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge. 5--14.Google ScholarDigital Library
Lukas Stappen, Alice Baird, Georgios Rizos, Panagiotis Tzirakis, Xinchen Du, Felix Hafner, Lea Schumann, Adria Mallol-Ragolta, Björn W Schuller, Iulia Lefter, et al. 2020. Muse 2020 challenge and workshop: Multimodal sentiment analysis, emotion-target engagement and trustworthiness detection in real-life media: Emotional car reviews in-the-wild. In Proceedings of the International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop. 35--44.Google ScholarDigital Library
Licai Sun, Zheng Lian, Bin Liu, and Jianhua Tao. 2023. Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis. IEEE Transactions on Affective Computing (2023).Google ScholarDigital Library
Lorenzo Vaiani, Moreno La Quatra, Luca Cagliero, and Paolo Garza. 2022. ViPER: Video-Based Perceiver for Emotion Recognition. In Proceedings of the International on Multimodal Sentiment Analysis Workshop and Challenge. 67--73.Google ScholarDigital Library
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS, Vol. 30.Google Scholar
J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Burkhardt, F. Eyben, and B. W. Schuller. 2023. Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap. IEEE Transactions on Pattern Analysis & Machine Intelligence 01 (2023), 1--13.Google Scholar
Weiyao Wang, Du Tran, and Matt Feiszli. 2020b. What makes training multi-modal classification networks hard?. In CVPR. 12695--12705.Google Scholar
Yikai Wang, Wenbing Huang, Fuchun Sun, Tingyang Xu, Yu Rong, and Junzhou Huang. 2020a. Deep multimodal fusion by channel exchanging. In NeurIPS, Vol. 33. 4835--4845.Google Scholar
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In NeurIPS, Vol. 32.Google Scholar
Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters , Vol. 23 (04 2016).Google ScholarCross Ref
Zengqun Zhao, Qingshan Liu, and Shanmin Wang. 2021. Learning deep global multi-scale and local attention features for facial expression recognition in the wild. IEEE Transactions on Image Processing , Vol. 30 (2021), 6544--6556. ioGoogle ScholarDigital Library

Index Terms

Multimodal Sentiment Analysis via Efficient Multimodal Transformer and Modality-Aware Adaptive Training Strategy
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Sentiment analysis
  2. Information systems applications
    1. Multimedia information systems

Recommendations

AcFormer: An Aligned and Compact Transformer for Multimodal Sentiment Analysis
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Multimodal Sentiment Analysis (MSA) is a popular research topic aimed at utilizing multimodal signals for understanding human emotions. The primary approach to solving this task is to develop complex fusion techniques. However, the heterogeneity and ...
Read More
Cross-modality Representation Interactive Learning for Multimodal Sentiment Analysis
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Effective alignment and fusion of multimodal features remain a significant challenge for multimodal sentiment analysis. In various multimodal applications, the text modal exhibits a significant advantage of compact yet expressive representation ability. ...
Read More
Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis
Abstract
Multimodal Sentiment Analysis (MSA) constitutes a pivotal technology in the realm of multimedia research. The efficacy of MSA models largely hinges on the quality of multimodal fusion. Notably, when conveying information pertinent to specific ...
Highlights
- Novel multimodal adaptive weight matrix enables accurate sentiment analysis by considering unique contributions of each modality.
- Multimodal attention mechanism addresses over-focusing on intra-modality attention.
- Multiple Softmax ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MuSe '23: Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation
November 2023
113 pages
ISBN:9798400702709
DOI:10.1145/3606039
General Chairs:
Shahin Amiriparian
University of Augsburg, Germany
,
Lukas Christ
University of Augsburg, Germany
,
Andreas Konig
University of Passau, Germany
,
Alan Cowen
Hume AI, USA
,
Eva-Maria Meßner
University of Ulm, Germany
,
Erik Cambria
Nanyang Technological University, Singapore
,
Bjorn W. Schuller
Imperial College London, UK
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
multimodal fusion
multimodal representation learning
multimodal sentiment analysis
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate14of17submissions,82%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 118
  Total Downloads
- Downloads (Last 12 months)118
- Downloads (Last 6 weeks)24
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multimodal Sentiment Analysis via Efficient Multimodal Transformer and Modality-Aware Adaptive Training Strategy

MuSe '23: Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation

ABSTRACT

References

Cited By

Index Terms

Recommendations

AcFormer: An Aligned and Compact Transformer for Multimodal Sentiment Analysis

Cross-modality Representation Interactive Learning for Multimodal Sentiment Analysis

Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Multimodal Sentiment Analysis via Efficient Multimodal Transformer and Modality-Aware Adaptive Training Strategy

MuSe '23: Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation

ABSTRACT

References

Cited By

Index Terms

Recommendations

AcFormer: An Aligned and Compact Transformer for Multimodal Sentiment Analysis

Cross-modality Representation Interactive Learning for Multimodal Sentiment Analysis

Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media