An End-to-end Multitask Learning Model to Improve Speech Emotion Recognition | IEEE Conference Publication | IEEE Xplore

An End-to-end Multitask Learning Model to Improve Speech Emotion Recognition


Abstract:

In this paper, we propose an attention-based CNN-BLSTM model with the end-to-end (E2E) learning method. We first extract Mel-spectrogram from wav file instead of using ha...Show More

Abstract:

In this paper, we propose an attention-based CNN-BLSTM model with the end-to-end (E2E) learning method. We first extract Mel-spectrogram from wav file instead of using handcrafted features. Then we adopt two types of attention mechanisms to let the model focuses on salient periods of speech emotions over the temporal dimension. Considering that there are many individual differences among people in expressing emotions, we incorporate speaker recognition as an auxiliary task. Moreover, since the training data set has a small sample size, we include data from another language as data augmentation. We evaluated the proposed method on SAVEE dataset by training it with single task, multitask, and cross-language. The evaluation shows that our proposed model achieves 73.62% for weighted accuracy and 71.11% for un-weighted accuracy in the task of speech emotion recognition, which outperforms the baseline with 11.13 points.
Date of Conference: 18-21 January 2021
Date Added to IEEE Xplore: 18 December 2020
ISBN Information:

ISSN Information:

Conference Location: Amsterdam, Netherlands

Contact IEEE to Subscribe

References

References is not available for this document.