Speech Emotion Recognition Using Capsule Networks | IEEE Conference Publication | IEEE Xplore

Speech Emotion Recognition Using Capsule Networks


Abstract:

Speech emotion recognition (SER) is a fundamental step towards fluent human-machine interaction. One challenging problem in SER is obtaining utterance-level feature repre...Show More

Abstract:

Speech emotion recognition (SER) is a fundamental step towards fluent human-machine interaction. One challenging problem in SER is obtaining utterance-level feature representation for classification. Recent works on SER have made significant progress by using spectrogram features and introducing neural network methods, e.g., convolutional neural networks (CNNs). However the fundamental problem of CNNs is that the spatial information in spectrograms is not captured, which are basically position and relationship information of low-level features like pitch and formant frequencies. This paper presents a novel architecture based on the capsule networks (CapsNets) for SER. The proposed system can take into account the spatial relationship of speech features in spectrograms, and provide an effective pooling method for obtaining utterance global features. We also introduce a recurrent connection to CapsNets to improve the model's time sensitivity. We compare the proposed model to previous published results based on combined CNN-long short-term memory (CNN-LSTM) models on the benchmark corpus IEMOCAP over four emotions, i.e., neutral, angry, happy and sad. Experimental results show that our model achieves better results than the baseline system on weighted accuracy (WA) (72.73% vs. 68.8%) and un-weighted accuracy (UA) (59.71% vs. 59.4%), which demonstrates the effectiveness of CapsNets for SER.
Date of Conference: 12-17 May 2019
Date Added to IEEE Xplore: 17 April 2019
ISBN Information:

ISSN Information:

Conference Location: Brighton, UK

Contact IEEE to Subscribe

References

References is not available for this document.