skip to main content
10.1145/3581783.3612867acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multimodal Emotion Recognition in Noisy Environment Based on Progressive Label Revision

Published:27 October 2023Publication History

ABSTRACT

The multimodal emotion recognition has attracted more attention in recent decades. Though remarkable progress has been achieved with the rapid development of deep learning, existing methods are still hard to tackle noise problems that occurred commonly in emotion recognition's practical application. To improve the robustness of the multimodal emotion recognition algorithm, we propose an MLP-based label revision algorithm. The framework consists of three complementary feature extraction networks that were verified in MER2023. After that, an MLP-based attention network with specially designed loss functions was used to fuse features from different modalities. Finally, the scheme that used the output probability of each emotion to revise the sample's output category was employed to revise the test set's label obtained by classifier. The samples that are most likely to be affected by noise and misclassified have a chance to get correct classification. The best experimental result shows that the F1-score of our algorithm on the test dataset of the MER 2023 Noise subchallenge is 86.35 and combined metric is 0.6694, which ranks 2nd at the MER 2023 NOISE subchallenge.

References

  1. Brandon Amos, Bartosz Ludwiczuk, and Mahadev Satyanarayanan. 2016. OpenFace: A general-purpose face recognition library with mobile applications. Technical Report. Carnegie Mellon University-CS-16-118, Carnegie Mellon University School of Computer Science.Google ScholarGoogle Scholar
  2. Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, Vol. 33 (2020), 12449--12460.Google ScholarGoogle Scholar
  3. Margaret M Bradley and Peter J Lang. 1994. Measuring emotion: the self-assessment manikin and the semantic differential. Journal of behavior therapy and experimental psychiatry, Vol. 25, 1 (1994), 49--59.Google ScholarGoogle ScholarCross RefCross Ref
  4. Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. 2020. Revisiting pre-trained models for Chinese natural language processing. arXiv preprint arXiv:2004.13922 (2020).Google ScholarGoogle Scholar
  5. Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi. 2020. Real Time Speech Enhancement in the Waveform Domain. In Interspeech.Google ScholarGoogle Scholar
  6. Yuantao Feng, Shiqi Yu, Hanyang Peng, Yan-Ran Li, and Jianguo Zhang. 2022. Detect Faces Efficiently: A Survey and Evaluations. IEEE Transactions on Biometrics, Behavior, and Identity Science, Vol. 4, 1 (2022), 1--18. https://doi.org/10.1109/TBIOM.2021.3120412Google ScholarGoogle ScholarCross RefCross Ref
  7. Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. 2013. Challenges in representation learning: A report on three machine learning contests. In Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Korea, November 3-7, 2013. Proceedings, Part III 20. Springer, 117--124.Google ScholarGoogle ScholarCross RefCross Ref
  8. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  9. Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 29 (2021), 3451--3460.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Markus Kächele, Martin Schels, Sascha Meudt, Günther Palm, and Friedhelm Schwenker. 2016. Revisiting the EmotiW challenge: how wild is it really? Classification of human emotions in movie snippets based on multiple features. Journal on Multimodal User Interfaces, Vol. 10 (2016), 151--162.Google ScholarGoogle ScholarCross RefCross Ref
  11. Heysem Kaya, Furkan Gürpinar, Sadaf Afshar, and Albert Ali Salah. 2015. Contrasting and combining least squares based learners for emotion recognition in the wild. In Proceedings of the 2015 ACM on international conference on multimodal interaction. 459--466.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, Vol. 1. 2.Google ScholarGoogle Scholar
  13. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  14. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, Vol. 25 (2012).Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ya Li, Jianhua Tao, Björn Schuller, Shiguang Shan, Dongmei Jiang, and Jia Jia. 2018. Mec 2017: Multimodal emotion recognition challenge. In 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia). IEEE, 1--5.Google ScholarGoogle ScholarCross RefCross Ref
  16. Zheng Lian, Haiyang Sun, Licai Sun, Jinming Zhao, Ye Liu, Bin Liu, Jiangyan Yi, Meng Wang, Erik Cambria, Guoying Zhao, et al. 2023. MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning. arXiv preprint arXiv:2304.08981 (2023).Google ScholarGoogle Scholar
  17. Cheng Lu, Hailun Lian, Wenming Zheng, Yuan Zong, Yan Zhao, and Sunan Li. 2023. Learning Local to Global Feature Aggregation for Speech Emotion Recognition. arXiv preprint arXiv:2306.01491 (2023).Google ScholarGoogle Scholar
  18. Cheng Lu, Wenming Zheng, Hailun Lian, Yuan Zong, Chuangao Tang, Sunan Li, and Yan Zhao. 2022a. Speech Emotion Recognition via an Attentive Time-Frequency Neural Network. IEEE Transactions on Computational Social Systems (2022), 1--10. https://doi.org/10.1109/TCSS.2022.3219825Google ScholarGoogle ScholarCross RefCross Ref
  19. Cheng Lu, Yuan Zong, Wenming Zheng, Yang Li, Chuangao Tang, and Björn W. Schuller. 2022b. Domain Invariant Feature Learning for Speaker-Independent Speech Emotion Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 30 (2022), 2217--2230. https://doi.org/10.1109/TASLP.2022.3178232Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hanyang Peng and Shiqi Yu. 2021. A Systematic IoU-Related Method: Beyond Simplified Regression for Better Localization. IEEE Transactions on Image Processing, Vol. 30 (2021), 5032--5044. https://doi.org/10.1109/TIP.2021.3077144Google ScholarGoogle ScholarCross RefCross Ref
  21. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google ScholarGoogle Scholar
  22. David Snyder, Guoguo Chen, and Daniel Povey. 2015. Musan: A music, speech, and noise corpus. arXiv preprint arXiv:1510.08484 (2015).Google ScholarGoogle Scholar
  23. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, Vol. 15, 1 (2014), 1929--1958.Google ScholarGoogle Scholar
  24. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  25. Xiaoguang Tu, Jian Zhao, Qiankun Liu, Wenjie Ai, Guodong Guo, Zhifeng Li, Wei Liu, and Jiashi Feng. 2021. Joint Face Image Restoration and Frontalization for Recognition. T-CSVT (2021).Google ScholarGoogle Scholar
  26. Vladimir Vapnik and A Ya Chervonenkis. 1964. A class of algorithms for pattern recognition learning. Avtomat. i Telemekh, Vol. 25, 6 (1964), 937--945.Google ScholarGoogle Scholar
  27. Kai Wang, Xiaojiang Peng, Jianfei Yang, Shijian Lu, and Yu Qiao. 2020. Suppressing uncertainties for large-scale facial expression recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6897--6906.Google ScholarGoogle ScholarCross RefCross Ref
  28. Qingzhong Wang, Pengfei Zhang, Haoyi Xiong, and Jian Zhao. 2021. Face.evoLVe: A High-Performance Face Recognition Library. arXiv preprint arXiv:2107.08621 (2021).Google ScholarGoogle Scholar
  29. Hao Wu, Jianyang Gu, Xiaojin Fan, He Li, Lidong Xie, and Jian Zhao. 2022. 3D-Guided Frontal Face Generation for Pose-Invariant Recognition. T-IST (2022).Google ScholarGoogle Scholar
  30. Jianlong Wu, Zhouchen Lin, and Hongbin Zha. 2015. Multiple models fusion for emotion recognition in the wild. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. 475--481.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Wei Wu, Hanyang Peng, and Shiqi Yu. 2023. YuNet: A Tiny Millisecond-level Face Detector. Machine Intelligence Research (2023), 1--10. https://doi.org/10.1007/s11633-023-1423-yGoogle ScholarGoogle ScholarCross RefCross Ref
  32. Anbang Yao, Junchao Shao, Ningning Ma, and Yurong Chen. 2015. Capturing au-aware facial features and their latent relations for emotion recognition in the wild. In Proceedings of the 2015 acm on international conference on multimodal interaction. 451--458.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jian Zhao, Shuicheng Yan, and Jiashi Feng. 2020. Towards age-invariant face recognition. T-PAMI (2020).Google ScholarGoogle Scholar
  34. Yan Zhao, Jincen Wang, Ru Ye, Yuan Zong, Wenming Zheng, and Li Zhao. 2022. Deep Transductive Transfer Regression Network for Cross-Corpus Speech Emotion Recognition. In Proc. Interspeech 2022. 371--375. https://doi.org/10.21437/Interspeech.2022--679Google ScholarGoogle ScholarCross RefCross Ref
  35. Yan Zhao, Jincen Wang, Yuan Zong, Wenming Zheng, Hailun Lian, and Li Zhao. 2023. Deep Implicit Distribution Alignment Networks for cross-Corpus Speech Emotion Recognition. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1--5. https://doi.org/10.1109/ICASSP49357.2023.10095388Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Multimodal Emotion Recognition in Noisy Environment Based on Progressive Label Revision

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MM '23: Proceedings of the 31st ACM International Conference on Multimedia
        October 2023
        9913 pages
        ISBN:9798400701085
        DOI:10.1145/3581783

        Copyright © 2023 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 October 2023

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate995of4,171submissions,24%

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia
      • Article Metrics

        • Downloads (Last 12 months)174
        • Downloads (Last 6 weeks)40

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader