Speech Emotion Recognition with Multi-Task Learning

Cai, Xingyu; Yuan, Jiahong; Zheng, Renjie; Huang, Liang; Church, Kenneth

doi:10.21437/Interspeech.2021-1852

Speech Emotion Recognition with Multi-Task Learning

Xingyu Cai, Jiahong Yuan, Renjie Zheng, Liang Huang, Kenneth Church

Speech emotion recognition (SER) classifies speech into emotion categories such as: Happy, Angry, Sad and Neutral. Recently, deep learning has been applied to the SER task. This paper proposes a multi-task learning (MTL) framework to simultaneously perform speech-to-text recognition and emotion classification, with an end-to-end deep neural model based on wav2vec-2.0. Experiments on the IEMOCAP benchmark show that the proposed method achieves the state-of-the-art performance on the SER task. In addition, an ablation study establishes the effectiveness of the proposed MTL framework.

doi: 10.21437/Interspeech.2021-1852

Cite as: Cai, X., Yuan, J., Zheng, R., Huang, L., Church, K. (2021) Speech Emotion Recognition with Multi-Task Learning. Proc. Interspeech 2021, 4508-4512, doi: 10.21437/Interspeech.2021-1852

@inproceedings{cai21b_interspeech,
  author={Xingyu Cai and Jiahong Yuan and Renjie Zheng and Liang Huang and Kenneth Church},
  title={{Speech Emotion Recognition with Multi-Task Learning}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={4508--4512},
  doi={10.21437/Interspeech.2021-1852}
}