ABSTRACT
In the task of multimodal emotion recognition with multi-label learning (MER-MULTI), leveraging the correlation between discrete and dimensional emotions is crucial for improving the model's performance. However, there may be a mismatch between the feature distributions of the training set and the testing set, which could result in the trained model's inability to adapt to the correlations between labels in the testing set. Therefore, a significant challenge in MER-MULTI is how to match the feature distributions of the training set and testing set samples. To tackle this issue, we propose a method called Label Distribution Adaptation for MER-MULTI. More specifically, by adapting the label distribution between the training set and testing set to remove training samples that do not match the features of the testing set. This can enhance the model's performance and generalization on testing data, enabling it to better capture the correlations between labels. Furthermore, to alleviate the difficulty of model training and inference, we design a novel loss function called Multi-label Emotion Joint Learning Loss (MEJL), which combines the correlations between discrete and dimensional emotions. Specifically, through contrastive learning, we transform the shared feature distribution of multiple labels into a space where discrete and dimensional emotions are consistent. This facilitates the model in learning the relationships between discrete and dimensional emotions. Finally, we have evaluated the proposed method, which has achieved second place in the MER-MULTI task of the MER 2023 Challenge.
- Md Shad Akhtar, Dushyant Chauhan, Deepanway Ghosal, Soujanya Poria, Asif Ekbal, and Pushpak Bhattacharyya. 2019. Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 370--379.Google ScholarCross Ref
- Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems , Vol. 33 (2020), 12449--12460.Google Scholar
- Shizhe Chen, Qin Jin, Jinming Zhao, and Shuai Wang. 2017. Multimodal multi-task learning for dimensional and continuous emotion recognition. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. 19--26.Google ScholarDigital Library
- Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. 2022. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, Vol. 16, 6 (2022), 1505--1518.Google ScholarCross Ref
- Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. 2020. Revisiting pre-trained models for Chinese natural language processing. arXiv preprint arXiv:2004.13922 (2020).Google Scholar
- Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. 2013. Challenges in representation learning: A report on three machine learning contests. In Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Korea, November 3--7, 2013. Proceedings, Part III 20. Springer, 117--124.Google ScholarCross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
- Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing , Vol. 29 (2021), 3451--3460.Google ScholarDigital Library
- Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning. Advances in neural information processing systems , Vol. 33 (2020), 18661--18673.Google Scholar
- DP Kingma. 2014. Adam: a method for stochastic optimization. In Int Conf Learn Represent.Google Scholar
- Dimitrios Kollias, Attila Schulc, Elnar Hajiyev, and Stefanos Zafeiriou. 2020. Analysing affective behavior in the first abaw 2020 competition. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). IEEE, 637--643.Google ScholarDigital Library
- Oh-Wook Kwon, Kwokleung Chan, Jiucang Hao, and Te-Won Lee. 2003. Emotion recognition by speech signals. In Eighth European conference on speech communication and technology.Google ScholarCross Ref
- Shan Li, Weihong Deng, and JunPing Du. 2017. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2852--2861.Google ScholarCross Ref
- Ya Li, Jianhua Tao, Björn Schuller, Shiguang Shan, Dongmei Jiang, and Jia Jia. 2018. Mec 2017: Multimodal emotion recognition challenge. In 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia). IEEE, 1--5.Google ScholarCross Ref
- Zheng Lian, Bin Liu, and Jianhua Tao. 2021a. CTNet: Conversational transformer network for emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing , Vol. 29 (2021), 985--1000.Google ScholarDigital Library
- Zheng Lian, Bin Liu, and Jianhua Tao. 2021b. DECN: Dialogical emotion correction network for conversational emotion recognition. Neurocomputing , Vol. 454 (2021), 483--495.Google ScholarCross Ref
- Zheng Lian, Haiyang Sun, Licai Sun, Jinming Zhao, Ye Liu, Bin Liu, Jiangyan Yi, Meng Wang, Erik Cambria, Guoying Zhao, et al. 2023. MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning. arXiv preprint arXiv:2304.08981 (2023).Google Scholar
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).Google Scholar
- Suhaila Najim Mohammed and Alia Karim Abdul Hassan. 2020. A survey on emotion recognition for human robot interaction. Journal of computing and information technology, Vol. 28, 2 (2020), 125--146.Google Scholar
- Donn Morrison, Ruili Wang, and Liyanage C De Silva. 2007. Ensemble methods for spoken emotion recognition in call-centres. Speech communication, Vol. 49, 2 (2007), 98--112.Google Scholar
- Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862 (2019).Google Scholar
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research , Vol. 15, 1 (2014), 1929--1958.Google Scholar
- Chuangao Tang, Wenming Zheng, Yuan Zong, Nana Qiu, Cheng Lu, Xilei Zhang, Xiaoyan Ke, and Cuntai Guan. 2020. Automatic identification of high-risk autism spectrum disorder: a feasibility study using video and audio data under the still-face paradigm. ieee transactions on neural systems and rehabilitation engineering, Vol. 28, 11 (2020), 2401--2410.Google Scholar
- Siqing Wu, Tiago H Falk, and Wai-Yip Chan. 2011. Automatic speech emotion recognition using modulation spectral features. Speech communication, Vol. 53, 5 (2011), 768--785.Google Scholar
- Zhuoyuan Yao, Di Wu, Xiong Wang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, and Xin Lei. 2021. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. arXiv preprint arXiv:2102.01547 (2021).Google Scholar
- Zhihong Zeng, Maja Pantic, Glenn I Roisman, and Thomas S Huang. 2007. A survey of affect recognition methods: audio, visual and spontaneous expressions. In Proceedings of the 9th international conference on Multimodal interfaces. 126--133.Google ScholarDigital Library
- Jianhua Zhang, Zhong Yin, Peng Chen, and Stefano Nichele. 2020. Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review. Information Fusion , Vol. 59 (2020), 103--126.Google ScholarCross Ref
- Zengqun Zhao, Qingshan Liu, and Shanmin Wang. 2021. Learning deep global multi-scale and local attention features for facial expression recognition in the wild. IEEE Transactions on Image Processing , Vol. 30 (2021), 6544--6556. ioGoogle ScholarDigital Library
Index Terms
- Label Distribution Adaptation for Multimodal Emotion Recognition with Multi-label Learning
Recommendations
Inductive Semi-supervised Multi-Label Learning with Co-Training
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningIn multi-label learning, each training example is associated with multiple class labels and the task is to learn a mapping from the feature space to the power set of label space. It is generally demanding and time-consuming to obtain labels for training ...
Semi-supervised multi-label classification using incomplete label information
Highlights- An inductive semi-supervised method called Smile is proposed for multi-label classification using incomplete label information.
AbstractClassifying multi-label instances using incompletely labeled instances is one of the fundamental tasks in multi-label learning. Most existing methods regard this task as supervised weak-label learning problem and assume sufficient ...
Multi-label learning by exploiting label dependency
KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data miningIn multi-label learning, each training example is associated with a set of labels and the task is to predict the proper label set for the unseen example. Due to the tremendous (exponential) number of possible label sets, the task of learning from multi-...
Comments