skip to main content
10.1145/3607865.3613183acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Label Distribution Adaptation for Multimodal Emotion Recognition with Multi-label Learning

Published:29 October 2023Publication History

ABSTRACT

In the task of multimodal emotion recognition with multi-label learning (MER-MULTI), leveraging the correlation between discrete and dimensional emotions is crucial for improving the model's performance. However, there may be a mismatch between the feature distributions of the training set and the testing set, which could result in the trained model's inability to adapt to the correlations between labels in the testing set. Therefore, a significant challenge in MER-MULTI is how to match the feature distributions of the training set and testing set samples. To tackle this issue, we propose a method called Label Distribution Adaptation for MER-MULTI. More specifically, by adapting the label distribution between the training set and testing set to remove training samples that do not match the features of the testing set. This can enhance the model's performance and generalization on testing data, enabling it to better capture the correlations between labels. Furthermore, to alleviate the difficulty of model training and inference, we design a novel loss function called Multi-label Emotion Joint Learning Loss (MEJL), which combines the correlations between discrete and dimensional emotions. Specifically, through contrastive learning, we transform the shared feature distribution of multiple labels into a space where discrete and dimensional emotions are consistent. This facilitates the model in learning the relationships between discrete and dimensional emotions. Finally, we have evaluated the proposed method, which has achieved second place in the MER-MULTI task of the MER 2023 Challenge.

References

  1. Md Shad Akhtar, Dushyant Chauhan, Deepanway Ghosal, Soujanya Poria, Asif Ekbal, and Pushpak Bhattacharyya. 2019. Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 370--379.Google ScholarGoogle ScholarCross RefCross Ref
  2. Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems , Vol. 33 (2020), 12449--12460.Google ScholarGoogle Scholar
  3. Shizhe Chen, Qin Jin, Jinming Zhao, and Shuai Wang. 2017. Multimodal multi-task learning for dimensional and continuous emotion recognition. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. 19--26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. 2022. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, Vol. 16, 6 (2022), 1505--1518.Google ScholarGoogle ScholarCross RefCross Ref
  5. Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. 2020. Revisiting pre-trained models for Chinese natural language processing. arXiv preprint arXiv:2004.13922 (2020).Google ScholarGoogle Scholar
  6. Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. 2013. Challenges in representation learning: A report on three machine learning contests. In Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Korea, November 3--7, 2013. Proceedings, Part III 20. Springer, 117--124.Google ScholarGoogle ScholarCross RefCross Ref
  7. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  8. Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing , Vol. 29 (2021), 3451--3460.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning. Advances in neural information processing systems , Vol. 33 (2020), 18661--18673.Google ScholarGoogle Scholar
  10. DP Kingma. 2014. Adam: a method for stochastic optimization. In Int Conf Learn Represent.Google ScholarGoogle Scholar
  11. Dimitrios Kollias, Attila Schulc, Elnar Hajiyev, and Stefanos Zafeiriou. 2020. Analysing affective behavior in the first abaw 2020 competition. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). IEEE, 637--643.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Oh-Wook Kwon, Kwokleung Chan, Jiucang Hao, and Te-Won Lee. 2003. Emotion recognition by speech signals. In Eighth European conference on speech communication and technology.Google ScholarGoogle ScholarCross RefCross Ref
  13. Shan Li, Weihong Deng, and JunPing Du. 2017. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2852--2861.Google ScholarGoogle ScholarCross RefCross Ref
  14. Ya Li, Jianhua Tao, Björn Schuller, Shiguang Shan, Dongmei Jiang, and Jia Jia. 2018. Mec 2017: Multimodal emotion recognition challenge. In 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia). IEEE, 1--5.Google ScholarGoogle ScholarCross RefCross Ref
  15. Zheng Lian, Bin Liu, and Jianhua Tao. 2021a. CTNet: Conversational transformer network for emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing , Vol. 29 (2021), 985--1000.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Zheng Lian, Bin Liu, and Jianhua Tao. 2021b. DECN: Dialogical emotion correction network for conversational emotion recognition. Neurocomputing , Vol. 454 (2021), 483--495.Google ScholarGoogle ScholarCross RefCross Ref
  17. Zheng Lian, Haiyang Sun, Licai Sun, Jinming Zhao, Ye Liu, Bin Liu, Jiangyan Yi, Meng Wang, Erik Cambria, Guoying Zhao, et al. 2023. MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning. arXiv preprint arXiv:2304.08981 (2023).Google ScholarGoogle Scholar
  18. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).Google ScholarGoogle Scholar
  19. Suhaila Najim Mohammed and Alia Karim Abdul Hassan. 2020. A survey on emotion recognition for human robot interaction. Journal of computing and information technology, Vol. 28, 2 (2020), 125--146.Google ScholarGoogle Scholar
  20. Donn Morrison, Ruili Wang, and Liyanage C De Silva. 2007. Ensemble methods for spoken emotion recognition in call-centres. Speech communication, Vol. 49, 2 (2007), 98--112.Google ScholarGoogle Scholar
  21. Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862 (2019).Google ScholarGoogle Scholar
  22. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research , Vol. 15, 1 (2014), 1929--1958.Google ScholarGoogle Scholar
  23. Chuangao Tang, Wenming Zheng, Yuan Zong, Nana Qiu, Cheng Lu, Xilei Zhang, Xiaoyan Ke, and Cuntai Guan. 2020. Automatic identification of high-risk autism spectrum disorder: a feasibility study using video and audio data under the still-face paradigm. ieee transactions on neural systems and rehabilitation engineering, Vol. 28, 11 (2020), 2401--2410.Google ScholarGoogle Scholar
  24. Siqing Wu, Tiago H Falk, and Wai-Yip Chan. 2011. Automatic speech emotion recognition using modulation spectral features. Speech communication, Vol. 53, 5 (2011), 768--785.Google ScholarGoogle Scholar
  25. Zhuoyuan Yao, Di Wu, Xiong Wang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, and Xin Lei. 2021. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. arXiv preprint arXiv:2102.01547 (2021).Google ScholarGoogle Scholar
  26. Zhihong Zeng, Maja Pantic, Glenn I Roisman, and Thomas S Huang. 2007. A survey of affect recognition methods: audio, visual and spontaneous expressions. In Proceedings of the 9th international conference on Multimodal interfaces. 126--133.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Jianhua Zhang, Zhong Yin, Peng Chen, and Stefano Nichele. 2020. Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review. Information Fusion , Vol. 59 (2020), 103--126.Google ScholarGoogle ScholarCross RefCross Ref
  28. Zengqun Zhao, Qingshan Liu, and Shanmin Wang. 2021. Learning deep global multi-scale and local attention features for facial expression recognition in the wild. IEEE Transactions on Image Processing , Vol. 30 (2021), 6544--6556. ioGoogle ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Label Distribution Adaptation for Multimodal Emotion Recognition with Multi-label Learning

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MRAC '23: Proceedings of the 1st International Workshop on Multimodal and Responsible Affective Computing
        October 2023
        88 pages
        ISBN:9798400702884
        DOI:10.1145/3607865

        Copyright © 2023 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 29 October 2023

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia
      • Article Metrics

        • Downloads (Last 12 months)92
        • Downloads (Last 6 weeks)8

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader