Speech emotion recognition (SER) classifies speech into emotion categories such as: Happy, Angry, Sad and Neutral. Recently, deep learning has been applied to the SER task. This paper proposes a multi-task learning (MTL) framework to simultaneously perform speech-to-text recognition and emotion classification, with an end-to-end deep neural model based on wav2vec-2.0. Experiments on the IEMOCAP benchmark show that the proposed method achieves the state-of-the-art performance on the SER task. In addition, an ablation study establishes the effectiveness of the proposed MTL framework.
Cite as: Cai, X., Yuan, J., Zheng, R., Huang, L., Church, K. (2021) Speech Emotion Recognition with Multi-Task Learning. Proc. Interspeech 2021, 4508-4512, doi: 10.21437/Interspeech.2021-1852
@inproceedings{cai21b_interspeech, author={Xingyu Cai and Jiahong Yuan and Renjie Zheng and Liang Huang and Kenneth Church}, title={{Speech Emotion Recognition with Multi-Task Learning}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={4508--4512}, doi={10.21437/Interspeech.2021-1852} }