research-article

Multimodal Emotion Recognition in Noisy Environment Based on Progressive Label Revision

Authors:
Sunan Li

Southeast University, Nanjing, China

Southeast University, Nanjing, China

0000-0003-1494-4873
View Profile

,
Hailun lian

Southeast University, Nanjing, China

Southeast University, Nanjing, China

0000-0002-1355-9503
View Profile

,
Cheng Lu

Southeast University, Nanjing, China

Southeast University, Nanjing, China

0000-0002-1477-1020
View Profile

,
Yan Zhao

Southeast University, Nanjing, China

Southeast University, Nanjing, China

0000-0003-4577-7078
View Profile

,
Chuangao Tang

Southeast University, Nanjing, China

Southeast University, Nanjing, China

0000-0002-3653-136X
View Profile

,
Yuan Zong

Southeast University, Nanjing, China

Southeast University, Nanjing, China

0000-0002-0839-8792
View Profile

,
Wenming Zheng

Southeast University, Nanjing, China

Southeast University, Nanjing, China

0000-0002-7764-5179
View Profile

MM '23: Proceedings of the 31st ACM International Conference on MultimediaOctober 2023Pages 9571–9575https://doi.org/10.1145/3581783.3612867

Published:27 October 2023Publication History

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 9571–9575

ABSTRACT

The multimodal emotion recognition has attracted more attention in recent decades. Though remarkable progress has been achieved with the rapid development of deep learning, existing methods are still hard to tackle noise problems that occurred commonly in emotion recognition's practical application. To improve the robustness of the multimodal emotion recognition algorithm, we propose an MLP-based label revision algorithm. The framework consists of three complementary feature extraction networks that were verified in MER2023. After that, an MLP-based attention network with specially designed loss functions was used to fuse features from different modalities. Finally, the scheme that used the output probability of each emotion to revise the sample's output category was employed to revise the test set's label obtained by classifier. The samples that are most likely to be affected by noise and misclassified have a chance to get correct classification. The best experimental result shows that the F1-score of our algorithm on the test dataset of the MER 2023 Noise subchallenge is 86.35 and combined metric is 0.6694, which ranks 2nd at the MER 2023 NOISE subchallenge.

References

Brandon Amos, Bartosz Ludwiczuk, and Mahadev Satyanarayanan. 2016. OpenFace: A general-purpose face recognition library with mobile applications. Technical Report. Carnegie Mellon University-CS-16-118, Carnegie Mellon University School of Computer Science.Google Scholar
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, Vol. 33 (2020), 12449--12460.Google Scholar
Margaret M Bradley and Peter J Lang. 1994. Measuring emotion: the self-assessment manikin and the semantic differential. Journal of behavior therapy and experimental psychiatry, Vol. 25, 1 (1994), 49--59.Google ScholarCross Ref
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. 2020. Revisiting pre-trained models for Chinese natural language processing. arXiv preprint arXiv:2004.13922 (2020).Google Scholar
Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi. 2020. Real Time Speech Enhancement in the Waveform Domain. In Interspeech.Google Scholar
Yuantao Feng, Shiqi Yu, Hanyang Peng, Yan-Ran Li, and Jianguo Zhang. 2022. Detect Faces Efficiently: A Survey and Evaluations. IEEE Transactions on Biometrics, Behavior, and Identity Science, Vol. 4, 1 (2022), 1--18. https://doi.org/10.1109/TBIOM.2021.3120412Google ScholarCross Ref
Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. 2013. Challenges in representation learning: A report on three machine learning contests. In Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Korea, November 3-7, 2013. Proceedings, Part III 20. Springer, 117--124.Google ScholarCross Ref
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 29 (2021), 3451--3460.Google ScholarDigital Library
Markus Kächele, Martin Schels, Sascha Meudt, Günther Palm, and Friedhelm Schwenker. 2016. Revisiting the EmotiW challenge: how wild is it really? Classification of human emotions in movie snippets based on multiple features. Journal on Multimodal User Interfaces, Vol. 10 (2016), 151--162.Google ScholarCross Ref
Heysem Kaya, Furkan Gürpinar, Sadaf Afshar, and Albert Ali Salah. 2015. Contrasting and combining least squares based learners for emotion recognition in the wild. In Proceedings of the 2015 ACM on international conference on multimodal interaction. 459--466.Google ScholarDigital Library
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, Vol. 1. 2.Google Scholar
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, Vol. 25 (2012).Google ScholarDigital Library
Ya Li, Jianhua Tao, Björn Schuller, Shiguang Shan, Dongmei Jiang, and Jia Jia. 2018. Mec 2017: Multimodal emotion recognition challenge. In 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia). IEEE, 1--5.Google ScholarCross Ref
Zheng Lian, Haiyang Sun, Licai Sun, Jinming Zhao, Ye Liu, Bin Liu, Jiangyan Yi, Meng Wang, Erik Cambria, Guoying Zhao, et al. 2023. MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning. arXiv preprint arXiv:2304.08981 (2023).Google Scholar
Cheng Lu, Hailun Lian, Wenming Zheng, Yuan Zong, Yan Zhao, and Sunan Li. 2023. Learning Local to Global Feature Aggregation for Speech Emotion Recognition. arXiv preprint arXiv:2306.01491 (2023).Google Scholar
Cheng Lu, Wenming Zheng, Hailun Lian, Yuan Zong, Chuangao Tang, Sunan Li, and Yan Zhao. 2022a. Speech Emotion Recognition via an Attentive Time-Frequency Neural Network. IEEE Transactions on Computational Social Systems (2022), 1--10. https://doi.org/10.1109/TCSS.2022.3219825Google ScholarCross Ref
Cheng Lu, Yuan Zong, Wenming Zheng, Yang Li, Chuangao Tang, and Björn W. Schuller. 2022b. Domain Invariant Feature Learning for Speaker-Independent Speech Emotion Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 30 (2022), 2217--2230. https://doi.org/10.1109/TASLP.2022.3178232Google ScholarDigital Library
Hanyang Peng and Shiqi Yu. 2021. A Systematic IoU-Related Method: Beyond Simplified Regression for Better Localization. IEEE Transactions on Image Processing, Vol. 30 (2021), 5032--5044. https://doi.org/10.1109/TIP.2021.3077144Google ScholarCross Ref
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
David Snyder, Guoguo Chen, and Daniel Povey. 2015. Musan: A music, speech, and noise corpus. arXiv preprint arXiv:1510.08484 (2015).Google Scholar
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, Vol. 15, 1 (2014), 1929--1958.Google Scholar
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1--9.Google ScholarCross Ref
Xiaoguang Tu, Jian Zhao, Qiankun Liu, Wenjie Ai, Guodong Guo, Zhifeng Li, Wei Liu, and Jiashi Feng. 2021. Joint Face Image Restoration and Frontalization for Recognition. T-CSVT (2021).Google Scholar
Vladimir Vapnik and A Ya Chervonenkis. 1964. A class of algorithms for pattern recognition learning. Avtomat. i Telemekh, Vol. 25, 6 (1964), 937--945.Google Scholar
Kai Wang, Xiaojiang Peng, Jianfei Yang, Shijian Lu, and Yu Qiao. 2020. Suppressing uncertainties for large-scale facial expression recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6897--6906.Google ScholarCross Ref
Qingzhong Wang, Pengfei Zhang, Haoyi Xiong, and Jian Zhao. 2021. Face.evoLVe: A High-Performance Face Recognition Library. arXiv preprint arXiv:2107.08621 (2021).Google Scholar
Hao Wu, Jianyang Gu, Xiaojin Fan, He Li, Lidong Xie, and Jian Zhao. 2022. 3D-Guided Frontal Face Generation for Pose-Invariant Recognition. T-IST (2022).Google Scholar
Jianlong Wu, Zhouchen Lin, and Hongbin Zha. 2015. Multiple models fusion for emotion recognition in the wild. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. 475--481.Google ScholarDigital Library
Wei Wu, Hanyang Peng, and Shiqi Yu. 2023. YuNet: A Tiny Millisecond-level Face Detector. Machine Intelligence Research (2023), 1--10. https://doi.org/10.1007/s11633-023-1423-yGoogle ScholarCross Ref
Anbang Yao, Junchao Shao, Ningning Ma, and Yurong Chen. 2015. Capturing au-aware facial features and their latent relations for emotion recognition in the wild. In Proceedings of the 2015 acm on international conference on multimodal interaction. 451--458.Google ScholarDigital Library
Jian Zhao, Shuicheng Yan, and Jiashi Feng. 2020. Towards age-invariant face recognition. T-PAMI (2020).Google Scholar
Yan Zhao, Jincen Wang, Ru Ye, Yuan Zong, Wenming Zheng, and Li Zhao. 2022. Deep Transductive Transfer Regression Network for Cross-Corpus Speech Emotion Recognition. In Proc. Interspeech 2022. 371--375. https://doi.org/10.21437/Interspeech.2022--679Google ScholarCross Ref
Yan Zhao, Jincen Wang, Yuan Zong, Wenming Zheng, Hailun Lian, and Li Zhao. 2023. Deep Implicit Distribution Alignment Networks for cross-Corpus Speech Emotion Recognition. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1--5. https://doi.org/10.1109/ICASSP49357.2023.10095388Google ScholarCross Ref

Index Terms

Multimodal Emotion Recognition in Noisy Environment Based on Progressive Label Revision
1. Computing methodologies
  1. Artificial intelligence
2. Human-centered computing
  1. Human computer interaction (HCI)

Recommendations

Semi-Supervised Multimodal Emotion Recognition with Class-Balanced Pseudo-labeling
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

This paper presents our solution for the Semi-Supervised Multimodal Emotion Recognition Challenge (MER2023-SEMI), addressing the issue of limited annotated data in emotion recognition. Recently, the self-training-based Semi-Supervised Learning~(SSL) ...
Read More
Building Robust Multimodal Sentiment Recognition via a Simple yet Effective Multimodal Transformer
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

In this paper, we present the solutions to the MER-MULTI and MER-NOISE sub-challenges of the Multimodal Emotion Recognition Challenge (MER 2023). For the tasks MER-MULTI and MER-NOISE, participants are required to recognize both discrete and dimensional ...
Read More
Audio-Visual Group-based Emotion Recognition using Local and Global Feature Aggregation based Multi-Task Learning
ICMI '23: Proceedings of the 25th International Conference on Multimodal Interaction

Audio-video group emotion recognition is a challenging task and has attracted more attention in recent decades. Recently, deep learning models have shown tremendous advances in analyzing human emotion. However, due to its difficulties such as hard to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
feature fusion
label revision
modality robustness
multimodal emotion recognition
neural networks
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 174
  Total Downloads
- Downloads (Last 12 months)174
- Downloads (Last 6 weeks)40
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multimodal Emotion Recognition in Noisy Environment Based on Progressive Label Revision

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Semi-Supervised Multimodal Emotion Recognition with Class-Balanced Pseudo-labeling

Building Robust Multimodal Sentiment Recognition via a Simple yet Effective Multimodal Transformer

Audio-Visual Group-based Emotion Recognition using Local and Global Feature Aggregation based Multi-Task Learning