research-article

Building Robust Multimodal Sentiment Recognition via a Simple yet Effective Multimodal Transformer

Authors:

Qunyan ZhouAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 9596 - 9600

https://doi.org/10.1145/3581783.3612872

Published: 27 October 2023 Publication History

Abstract

In this paper, we present the solutions to the MER-MULTI and MER-NOISE sub-challenges of the Multimodal Emotion Recognition Challenge (MER 2023). For the tasks MER-MULTI and MER-NOISE, participants are required to recognize both discrete and dimensional emotions. Particularly, in MER-NOISE, the test videos are corrupted with noise, necessitating the consideration of modality robustness. Our empirical findings indicate that different modalities contribute differently to the tasks, with a significant impact from the audio and visual modalities, while the text modality plays a weaker role in emotion prediction. To facilitate subsequent multimodal fusion, and considering that language information is implicitly embedded in large pre-trained speech models, we have made the deliberate choice to abandon the text modality and solely utilize visual and acoustic modalities for these sub-challenges. To address the potential underfitting of individual modalities during multimodal training, we propose to jointly train all modalities via a weighted blending of supervision signals. Furthermore, to enhance the robustness of our model, we employ a range of data augmentation techniques at the image level, waveform level, and spectrogram level. Experimental results show that our model ranks 1st in both MER-MULTI (0.7005) and MER-NOISE (0.6846) sub-challenges, validating the effectiveness of our method. Our code is publicly available at https://github.com/dingchaoyue/Multimodal-Emotion-Recognition-MER-and-MuSe-2023-Challenges.

References

[1]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. In NeurIPS. 12449--12460.

[2]

Tadas Baltruvs aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41 (2018), 423--443.

[3]

Chun-Fu Richard Chen, Quanfu Fan, and Rameswar Panda. 2021. Crossvit: Cross-attention multi-scale vision transformer for image classification. In ICCV. 357--366.

[4]

Junyan Cheng, Iordanis Fostiropoulos, Barry Boehm, and Mohammad Soleymani. 2021. Multimodal phased transformer for sentiment analysis. In EMNLP. 2447--2458.

[5]

Lukas Christ, Shahin Amiriparian, Alice Baird, Alexander Kathan, Niklas Müller, Steffen Klug, Chris Gagne, Panagiotis Tzirakis, Eva-Maria Meßner, Andreas König, et al. 2023. The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked Emotions, Cross-Cultural Humour, and Personalisation. arXiv preprint arXiv:2305.03369 (2023).

[6]

Chaoyue Ding, Jiakui Li, Daoming Zong, Baoxiang Li, TianHao Zhang, and Qunyan Zhou. 2023 b. Stable Speech Emotion Recognition with Head-k-Pooling Loss. In INTERSPEECH.

[7]

Chaoyue Ding, Jiakui Li, Martin Zong, and Baoxiang Li. 2023 a. Speed-Robust Keyword Spotting Via Soft Self-Attention on Multi-Scale Features. In SLT. IEEE, 1104--1111.

[8]

Kevin Ding, Martin Zong, Jiakui Li, and Baoxiang Li. 2022. LETR: A lightweight and efficient transformer for keyword spotting. In ICASSP. IEEE, 7987--7991.

[9]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[10]

Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. 2013. Challenges in representation learning: A report on three machine learning contests. In ICONIP. 117--124.

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.

[12]

Yu He, Licai Sun, Zheng Lian, Bin Liu, Jianhua Tao, Meng Wang, and Yuan Cheng. 2022. Multimodal Temporal Attention in Sentiment Analysis. In Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge. 61--66.

Digital Library

[13]

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 29 (2021), 3451--3460.

Digital Library

[14]

Yu Huang, Chenzhuang Du, Zihui Xue, Xuanyao Chen, Hang Zhao, and Longbo Huang. 2021. What makes multi-modal learning better than single (provably). NeurIPS, Vol. 34 (2021), 10944--10956.

[15]

Yu Huang, Junyang Lin, Chang Zhou, Hongxia Yang, and Longbo Huang. 2022. Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably). In ICML. PMLR, 9226--9259.

[16]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[17]

Jia Li, Ziyang Zhang, Junjie Lang, Yueqi Jiang, Liuwei An, Peng Zou, Yangyang Xu, Sheng Gao, Jie Lin, Chunxiao Fan, et al. 2022. Hybrid multimodal feature extraction, mining and fusion for sentiment analysis. In Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge. 81--88.

Digital Library

[18]

Shan Li, Weihong Deng, and JunPing Du. 2017. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In CVPR. 2852--2861.

[19]

Zheng Lian, Bin Liu, and Jianhua Tao. 2021. CTNet: Conversational transformer network for emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 29 (2021), 985--1000.

Digital Library

[20]

Zheng Lian, Haiyang Sun, Licai Sun, Jinming Zhao, Ye Liu, Bin Liu, Jiangyan Yi, Meng Wang, Erik Cambria, Guoying Zhao, et al. 2023. MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning. arXiv preprint arXiv:2304.08981 (2023).

[21]

Ye Liu, Siyuan Li, Yang Wu, Chang-Wen Chen, Ying Shan, and Xiaohu Qie. 2022. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In CVPR. 3042--3051.

[22]

Fengmao Lv, Xiang Chen, Yanyong Huang, Lixin Duan, and Guosheng Lin. 2021. Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In CVPR. 2554--2562.

[23]

Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, and Chen Sun. 2021. Attention bottlenecks for multimodal fusion. NeurIPS (2021), 14200--14213.

[24]

Xiaokang Peng, Yake Wei, Andong Deng, Dong Wang, and Di Hu. 2022. Balanced multimodal learning via on-the-fly gradient modulation. In CVPR. 8238--8247.

[25]

David Snyder, Guoguo Chen, and Daniel Povey. 2015. Musan: A music, speech, and noise corpus. arXiv preprint arXiv:1510.08484 (2015).

[26]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, Vol. 15, 1 (2014), 1929--1958.

Digital Library

[27]

Licai Sun, Zheng Lian, Bin Liu, and Jianhua Tao. 2023. Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis. IEEE Transactions on Affective Computing (2023).

Digital Library

[28]

Licai Sun, Zheng Lian, Jianhua Tao, Bin Liu, and Mingyue Niu. 2020. Multi-modal continuous dimensional emotion recognition using recurrent neural network and self-attention mechanism. In Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-Life Media Challenge and Workshop. 27--34.

Digital Library

[29]

Licai Sun, Mingyu Xu, Zheng Lian, Bin Liu, Jianhua Tao, Meng Wang, and Yuan Cheng. 2021. Multimodal emotion recognition and sentiment analysis via attention enhanced recurrent model. In Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge. 15--20.

Digital Library

[30]

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Transformer for Unaligned Multimodal Language Sequences. CoRR, Vol. abs/1906.00295 (2019). arxiv: 1906.00295 http://arxiv.org/abs/1906.00295

[31]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS. 5998--6008.

[32]

Weiyao Wang, Du Tran, and Matt Feiszli. 2020. What makes training multi-modal classification networks hard?. In CVPR. 12695--12705.

[33]

Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, et al. 2022. Wenetspeech: A 10000 hours multi-domain mandarin corpus for speech recognition. In ICASSP. 6182--6186.

[34]

Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters, Vol. 23 (04 2016).

[35]

Zengqun Zhao, Qingshan Liu, and Shanmin Wang. 2021. Learning deep global multi-scale and local attention features for facial expression recognition in the wild. IEEE Transactions on Image Processing, Vol. 30 (2021), 6544--6556.

Digital Library

Cited By

Qi ALiu ZZhou XXiao JZhang FGan QTao MZhang GZhang LTao JGhosh SLian ZCai ZSchuller BDhall AZhao GKollias DCambria EGoecke RGedeon T(2024)Multimodal Emotion Recognition with Vision-language Prompting and Modality DropoutProceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing10.1145/3689092.3689401(49-53)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3689092.3689401
Wang HDu JDai YLee CRen YLiu Y(2024)Improving Multi-Modal Emotion Recognition Using Entropy-Based Fusion and Pruning-Based Network Architecture OptimizationICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447231(11766-11770)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10447231
Zhu XGuo CFeng HHuang YFeng YWang XWang R(2024)A Review of Key Technologies for Emotion Analysis Using Multimodal InformationCognitive Computation10.1007/s12559-024-10287-z16:4(1504-1530)Online publication date: 1-Jun-2024
https://doi.org/10.1007/s12559-024-10287-z

Index Terms

Building Robust Multimodal Sentiment Recognition via a Simple yet Effective Multimodal Transformer
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Sentiment analysis
  2. Information systems applications
    1. Multimedia information systems

Recommendations

Joint training strategy of unimodal and multimodal for multimodal sentiment analysis
Abstract
With the explosive growth of social media video content, research on multimodal sentiment analysis (MSA) has attracted considerable attention recently. Despite significant progress in MSA, there remains challenges: current research mostly focuses ...
Highlights
- Jointly training unimodal and multimodal tasks to optimize multimodal fusion.
- Using two modules for unimodal and multimodal learning.
- The proposed model achieves competitive results compared to latest baselines.
Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis
Abstract
Multimodal Sentiment Analysis (MSA) constitutes a pivotal technology in the realm of multimedia research. The efficacy of MSA models largely hinges on the quality of multimodal fusion. Notably, when conveying information pertinent to specific ...
Highlights
- Novel multimodal adaptive weight matrix enables accurate sentiment analysis by considering unique contributions of each modality.
- Multimodal attention mechanism addresses over-focusing on intra-modality attention.
- Multiple Softmax ...
Multimodal Sentiment Analysis via Efficient Multimodal Transformer and Modality-Aware Adaptive Training Strategy
MuSe '23: Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation

In this paper, we present the solution to the MuSe-Mimic subchallenge of the 4th Multimodal Sentiment Analysis Challenge. This sub-challenge aims to predict the level of approval, disappointment and uncertainty in user-generated video clips. In our ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
336
Total Downloads

Downloads (Last 12 months)174
Downloads (Last 6 weeks)9

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Qi ALiu ZZhou XXiao JZhang FGan QTao MZhang GZhang LTao JGhosh SLian ZCai ZSchuller BDhall AZhao GKollias DCambria EGoecke RGedeon T(2024)Multimodal Emotion Recognition with Vision-language Prompting and Modality DropoutProceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing10.1145/3689092.3689401(49-53)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3689092.3689401
Wang HDu JDai YLee CRen YLiu Y(2024)Improving Multi-Modal Emotion Recognition Using Entropy-Based Fusion and Pruning-Based Network Architecture OptimizationICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447231(11766-11770)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10447231
Zhu XGuo CFeng HHuang YFeng YWang XWang R(2024)A Review of Key Technologies for Emotion Analysis Using Multimodal InformationCognitive Computation10.1007/s12559-024-10287-z16:4(1504-1530)Online publication date: 1-Jun-2024
https://doi.org/10.1007/s12559-024-10287-z

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten