ABSTRACT
The multimodal emotion recognition has attracted more attention in recent decades. Though remarkable progress has been achieved with the rapid development of deep learning, existing methods are still hard to tackle noise problems that occurred commonly in emotion recognition's practical application. To improve the robustness of the multimodal emotion recognition algorithm, we propose an MLP-based label revision algorithm. The framework consists of three complementary feature extraction networks that were verified in MER2023. After that, an MLP-based attention network with specially designed loss functions was used to fuse features from different modalities. Finally, the scheme that used the output probability of each emotion to revise the sample's output category was employed to revise the test set's label obtained by classifier. The samples that are most likely to be affected by noise and misclassified have a chance to get correct classification. The best experimental result shows that the F1-score of our algorithm on the test dataset of the MER 2023 Noise subchallenge is 86.35 and combined metric is 0.6694, which ranks 2nd at the MER 2023 NOISE subchallenge.
- Brandon Amos, Bartosz Ludwiczuk, and Mahadev Satyanarayanan. 2016. OpenFace: A general-purpose face recognition library with mobile applications. Technical Report. Carnegie Mellon University-CS-16-118, Carnegie Mellon University School of Computer Science.Google Scholar
- Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, Vol. 33 (2020), 12449--12460.Google Scholar
- Margaret M Bradley and Peter J Lang. 1994. Measuring emotion: the self-assessment manikin and the semantic differential. Journal of behavior therapy and experimental psychiatry, Vol. 25, 1 (1994), 49--59.Google ScholarCross Ref
- Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. 2020. Revisiting pre-trained models for Chinese natural language processing. arXiv preprint arXiv:2004.13922 (2020).Google Scholar
- Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi. 2020. Real Time Speech Enhancement in the Waveform Domain. In Interspeech.Google Scholar
- Yuantao Feng, Shiqi Yu, Hanyang Peng, Yan-Ran Li, and Jianguo Zhang. 2022. Detect Faces Efficiently: A Survey and Evaluations. IEEE Transactions on Biometrics, Behavior, and Identity Science, Vol. 4, 1 (2022), 1--18. https://doi.org/10.1109/TBIOM.2021.3120412Google ScholarCross Ref
- Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. 2013. Challenges in representation learning: A report on three machine learning contests. In Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Korea, November 3-7, 2013. Proceedings, Part III 20. Springer, 117--124.Google ScholarCross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
- Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 29 (2021), 3451--3460.Google ScholarDigital Library
- Markus Kächele, Martin Schels, Sascha Meudt, Günther Palm, and Friedhelm Schwenker. 2016. Revisiting the EmotiW challenge: how wild is it really? Classification of human emotions in movie snippets based on multiple features. Journal on Multimodal User Interfaces, Vol. 10 (2016), 151--162.Google ScholarCross Ref
- Heysem Kaya, Furkan Gürpinar, Sadaf Afshar, and Albert Ali Salah. 2015. Contrasting and combining least squares based learners for emotion recognition in the wild. In Proceedings of the 2015 ACM on international conference on multimodal interaction. 459--466.Google ScholarDigital Library
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, Vol. 1. 2.Google Scholar
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, Vol. 25 (2012).Google ScholarDigital Library
- Ya Li, Jianhua Tao, Björn Schuller, Shiguang Shan, Dongmei Jiang, and Jia Jia. 2018. Mec 2017: Multimodal emotion recognition challenge. In 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia). IEEE, 1--5.Google ScholarCross Ref
- Zheng Lian, Haiyang Sun, Licai Sun, Jinming Zhao, Ye Liu, Bin Liu, Jiangyan Yi, Meng Wang, Erik Cambria, Guoying Zhao, et al. 2023. MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning. arXiv preprint arXiv:2304.08981 (2023).Google Scholar
- Cheng Lu, Hailun Lian, Wenming Zheng, Yuan Zong, Yan Zhao, and Sunan Li. 2023. Learning Local to Global Feature Aggregation for Speech Emotion Recognition. arXiv preprint arXiv:2306.01491 (2023).Google Scholar
- Cheng Lu, Wenming Zheng, Hailun Lian, Yuan Zong, Chuangao Tang, Sunan Li, and Yan Zhao. 2022a. Speech Emotion Recognition via an Attentive Time-Frequency Neural Network. IEEE Transactions on Computational Social Systems (2022), 1--10. https://doi.org/10.1109/TCSS.2022.3219825Google ScholarCross Ref
- Cheng Lu, Yuan Zong, Wenming Zheng, Yang Li, Chuangao Tang, and Björn W. Schuller. 2022b. Domain Invariant Feature Learning for Speaker-Independent Speech Emotion Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 30 (2022), 2217--2230. https://doi.org/10.1109/TASLP.2022.3178232Google ScholarDigital Library
- Hanyang Peng and Shiqi Yu. 2021. A Systematic IoU-Related Method: Beyond Simplified Regression for Better Localization. IEEE Transactions on Image Processing, Vol. 30 (2021), 5032--5044. https://doi.org/10.1109/TIP.2021.3077144Google ScholarCross Ref
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- David Snyder, Guoguo Chen, and Daniel Povey. 2015. Musan: A music, speech, and noise corpus. arXiv preprint arXiv:1510.08484 (2015).Google Scholar
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, Vol. 15, 1 (2014), 1929--1958.Google Scholar
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1--9.Google ScholarCross Ref
- Xiaoguang Tu, Jian Zhao, Qiankun Liu, Wenjie Ai, Guodong Guo, Zhifeng Li, Wei Liu, and Jiashi Feng. 2021. Joint Face Image Restoration and Frontalization for Recognition. T-CSVT (2021).Google Scholar
- Vladimir Vapnik and A Ya Chervonenkis. 1964. A class of algorithms for pattern recognition learning. Avtomat. i Telemekh, Vol. 25, 6 (1964), 937--945.Google Scholar
- Kai Wang, Xiaojiang Peng, Jianfei Yang, Shijian Lu, and Yu Qiao. 2020. Suppressing uncertainties for large-scale facial expression recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6897--6906.Google ScholarCross Ref
- Qingzhong Wang, Pengfei Zhang, Haoyi Xiong, and Jian Zhao. 2021. Face.evoLVe: A High-Performance Face Recognition Library. arXiv preprint arXiv:2107.08621 (2021).Google Scholar
- Hao Wu, Jianyang Gu, Xiaojin Fan, He Li, Lidong Xie, and Jian Zhao. 2022. 3D-Guided Frontal Face Generation for Pose-Invariant Recognition. T-IST (2022).Google Scholar
- Jianlong Wu, Zhouchen Lin, and Hongbin Zha. 2015. Multiple models fusion for emotion recognition in the wild. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. 475--481.Google ScholarDigital Library
- Wei Wu, Hanyang Peng, and Shiqi Yu. 2023. YuNet: A Tiny Millisecond-level Face Detector. Machine Intelligence Research (2023), 1--10. https://doi.org/10.1007/s11633-023-1423-yGoogle ScholarCross Ref
- Anbang Yao, Junchao Shao, Ningning Ma, and Yurong Chen. 2015. Capturing au-aware facial features and their latent relations for emotion recognition in the wild. In Proceedings of the 2015 acm on international conference on multimodal interaction. 451--458.Google ScholarDigital Library
- Jian Zhao, Shuicheng Yan, and Jiashi Feng. 2020. Towards age-invariant face recognition. T-PAMI (2020).Google Scholar
- Yan Zhao, Jincen Wang, Ru Ye, Yuan Zong, Wenming Zheng, and Li Zhao. 2022. Deep Transductive Transfer Regression Network for Cross-Corpus Speech Emotion Recognition. In Proc. Interspeech 2022. 371--375. https://doi.org/10.21437/Interspeech.2022--679Google ScholarCross Ref
- Yan Zhao, Jincen Wang, Yuan Zong, Wenming Zheng, Hailun Lian, and Li Zhao. 2023. Deep Implicit Distribution Alignment Networks for cross-Corpus Speech Emotion Recognition. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1--5. https://doi.org/10.1109/ICASSP49357.2023.10095388Google ScholarCross Ref
Index Terms
- Multimodal Emotion Recognition in Noisy Environment Based on Progressive Label Revision
Recommendations
Semi-Supervised Multimodal Emotion Recognition with Class-Balanced Pseudo-labeling
MM '23: Proceedings of the 31st ACM International Conference on MultimediaThis paper presents our solution for the Semi-Supervised Multimodal Emotion Recognition Challenge (MER2023-SEMI), addressing the issue of limited annotated data in emotion recognition. Recently, the self-training-based Semi-Supervised Learning~(SSL) ...
Building Robust Multimodal Sentiment Recognition via a Simple yet Effective Multimodal Transformer
MM '23: Proceedings of the 31st ACM International Conference on MultimediaIn this paper, we present the solutions to the MER-MULTI and MER-NOISE sub-challenges of the Multimodal Emotion Recognition Challenge (MER 2023). For the tasks MER-MULTI and MER-NOISE, participants are required to recognize both discrete and dimensional ...
Audio-Visual Group-based Emotion Recognition using Local and Global Feature Aggregation based Multi-Task Learning
ICMI '23: Proceedings of the 25th International Conference on Multimodal InteractionAudio-video group emotion recognition is a challenging task and has attracted more attention in recent decades. Recently, deep learning models have shown tremendous advances in analyzing human emotion. However, due to its difficulties such as hard to ...
Comments