Speaker-Independent Speech Emotion Recognition Based on CNN-BLSTM and Multiple SVMs

Liu, Zhen-Tao; Xiao, Peng; Li, Dan-Yun; Hao, Man

doi:10.1007/978-3-030-27535-8_43

Zhen-Tao Liu^14,15,
Peng Xiao^14,15,
Dan-Yun Li^14,15 &
…
Man Hao^14,15

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11742))

Included in the following conference series:

International Conference on Intelligent Robotics and Applications

3236 Accesses
4 Citations

Abstract

Speaker-independent speech emotion recognition (SER) is a complex task because of the variations among different speakers, such as gender, age and other emotional irrelevant factors, which may lead to a tremendous difference among emotional features’ distribution. To alleviate the adverse effect generated by emotional irrelevant factors, we propose a SER model that consists of convolutional neutral networks (CNN), attention-based bidirectional long short-term memory network (BLSTM), and multiple linear support vector machines. The log Mel-spectrogram with its velocity (delta) and acceleration (double delta) coefficients are adopted as the inputs of our model since they can apply sufficient information for feature learning by our model. Several groups of speaker-independent SER experiments are performed on the Interactive Emotional Dyadic Motion Capture Database (IEMOCAP) database to improve the credibility of the results. Experimental results show that our method obtains unweighted average recall of 61.50% and weighted average recall of 62.31% for speaker-independent SER on IEMOCAP database.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognit. 44(3), 572–587 (2011)
Article Google Scholar
Gunes, H., Piccardi, M.: Bi-modal emotion recognition from expressive face and body gestures. J. Netw. Comput. Appl. 30(4), 1334–1345 (2007)
Article Google Scholar
Calvo, M.G., Nummenmaa, L.: Perceptual and affective mechanisms in facial expression recognition: an integrative review. Cogn. Emot. 30(6), 1081–1106 (2016)
Article Google Scholar
Mohammadi, Z., Frounchi, J., Amiri, M.: Wavelet-based emotion recognition system using EEG signal. Neural Comput. Appl. 28(8), 1985–1990 (2017)
Article Google Scholar
Liu, Z.T., Wu, M., Cao, W.H., et al.: Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing 273, 271–280 (2018)
Article Google Scholar
Shi, P.: Speech emotion recognition based on deep belief network. In: 15th International Conference on Networking, Sensing and Control. IEEE, Zhuhai (2018)
Google Scholar
Zhu, L., Chen, L., Zhao, D., et al.: Emotion recognition from Chinese speech for smart affective services using a combination of SVM and DBN. Sensors 17(7), 1694 (2017)
Article Google Scholar
Mao, Q., Dong, M., Huang, Z., et al.: Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimed. 16(8), 2203–2213 (2014)
Article Google Scholar
Hossain, M.S., Muhammad, G.: Emotion recognition using deep learning approach from audio–visual emotional big data. Inf. Fusion 49, 69–78 (2019)
Article Google Scholar
Zhang, S., Zhang, S., Huang, T., et al.: Learning affective features with a hybrid deep model for audio–visual emotion recognition. IEEE Trans. Circuits Syst. Video Technol. 28(10), 3030–3043 (2018)
Article Google Scholar
Chen, M., He, X., Yang, J., et al.: 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 25(10), 1440–1444 (2018)
Article Google Scholar
Graves, A., Jaitly, N., Mohamed, A.: Hybrid speech recognition with deep bidirectional LSTM. In: 2013 Proceedings of Workshop on Automatic Speech Recognition and Understanding, pp. 273–278. IEEE, Olomouc (2013)
Google Scholar
Zhao, J., Mao, X., Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 47, 312–323 (2019)
Article Google Scholar
New, T.L., Foo, S.W., De Silva, L.C.: Speech emotion recognition using hidden Markov models. Speech Commun. 41(4), 603–623 (2003)
Article Google Scholar
Pan, Y., Shen, P., Shen, L.: Speech emotion recognition using support vector machine. Int. J. Smart Home 6(2), 101–108 (2012)
Google Scholar
Lee, C.C., Mower, E., Busso, C., et al.: Emotion recognition using a hierarchical binary decision tree approach. Speech Commun. 53(9–10), 1162–1171 (2011)
Article Google Scholar
Liu, Z.T., Xie, Q., Wu, M., et al.: Speech emotion recognition based on an improved brain emotion learning model. Neurocomputing 309, 145–156 (2018)
Article Google Scholar
Li, P., Song, Y., Wang, P., et al.: A multi-feature multi-classifier system for speech emotion recognition. In: 2018 First Asian Conference on Affective Computing and Intelligent Interaction, Beijing, China (2018)
Google Scholar
Huang, C.W., Narayanan, S.S.: Attention assisted discovery of sub-utterance structure in speech emotion recognition. In: 17th Proceedings of Annual Conference of the International Speech Communication Association, pp. 1387–1391. International Speech Communication Association, San Francisco (2016)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, pp. 1097–1105. Neural Information Processing Systems, Nevada (2012)
Google Scholar
Zhou, P., Shi, W., Tian, J., et al.: Attention-based bidirectional long short-term memory networks for relation classification. In: 54th Proceedings of the Annual Meeting of the Association for Computational Linguistics, Short Papers, vol. 2, pp. 207–212. Association for Computational Linguistics, Berlin (2016)
Google Scholar
Tang, Y.: Deep learning using linear support vector machines. In: 30th International Conference on Machine Learning, Atlanta, Georgia, USA (2013)
Google Scholar
Busso, C., Bulut, M., Lee, C.C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335 (2008)
Article Google Scholar
Abadi, M., Barham, P., Chen, J., et al.: TensorFlow: a system for large-scale machine learning. In: 12th Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI 2016), pp. 265–283. USENIX Association, Savannah (2016)
Google Scholar
Ng, A.Y.: Feature selection, L₁ vs. L₂ regularization, and rotational invariance. In: 21st Proceedings of International Conference on Machine Learning. ACM, Banff (2004)
Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grants 61403422, 61703375 and 61273102, the Hubei Provincial Natural Science Foundation of China under Grants 2018CFB447 and 2015CFA010, the Wuhan Science and Technology Project under Grant 2017010201010133, the 111 project under Grant B17040, and the Fundamental Research Funds for National University, China University of Geosciences (Wuhan) under Grant 1810491T07.

Author information

Authors and Affiliations

School of Automation, China University of Geosciences, Wuhan, 430074, Hubei, China
Zhen-Tao Liu, Peng Xiao, Dan-Yun Li & Man Hao
Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan, 430074, Hubei, China
Zhen-Tao Liu, Peng Xiao, Dan-Yun Li & Man Hao

Authors

Zhen-Tao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Peng Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Dan-Yun Li
View author publications
You can also search for this author in PubMed Google Scholar
Man Hao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhen-Tao Liu .

Editor information

Editors and Affiliations

Shenyang Institute of Automation, Shenyang, China
Haibin Yu
Shenyang Institute of Automation, Shenyang, China
Jinguo Liu
Shenyang Institute of Automation, Shenyang, China
Lianqing Liu
University of Portsmouth, Portsmouth, UK
Zhaojie Ju
Shenyang Institute of Automation, Shenyang, China
Yuwang Liu
University of Portsmouth, Portsmouth, UK
Dalin Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, ZT., Xiao, P., Li, DY., Hao, M. (2019). Speaker-Independent Speech Emotion Recognition Based on CNN-BLSTM and Multiple SVMs. In: Yu, H., Liu, J., Liu, L., Ju, Z., Liu, Y., Zhou, D. (eds) Intelligent Robotics and Applications. ICIRA 2019. Lecture Notes in Computer Science(), vol 11742. Springer, Cham. https://doi.org/10.1007/978-3-030-27535-8_43

Download citation

DOI: https://doi.org/10.1007/978-3-030-27535-8_43
Published: 02 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27534-1
Online ISBN: 978-3-030-27535-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics