Skip to main content

Speaker-Independent Speech Emotion Recognition Based on CNN-BLSTM and Multiple SVMs

  • Conference paper
  • First Online:
Intelligent Robotics and Applications (ICIRA 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11742))

Included in the following conference series:

Abstract

Speaker-independent speech emotion recognition (SER) is a complex task because of the variations among different speakers, such as gender, age and other emotional irrelevant factors, which may lead to a tremendous difference among emotional features’ distribution. To alleviate the adverse effect generated by emotional irrelevant factors, we propose a SER model that consists of convolutional neutral networks (CNN), attention-based bidirectional long short-term memory network (BLSTM), and multiple linear support vector machines. The log Mel-spectrogram with its velocity (delta) and acceleration (double delta) coefficients are adopted as the inputs of our model since they can apply sufficient information for feature learning by our model. Several groups of speaker-independent SER experiments are performed on the Interactive Emotional Dyadic Motion Capture Database (IEMOCAP) database to improve the credibility of the results. Experimental results show that our method obtains unweighted average recall of 61.50% and weighted average recall of 62.31% for speaker-independent SER on IEMOCAP database.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognit. 44(3), 572–587 (2011)

    Article  Google Scholar 

  2. Gunes, H., Piccardi, M.: Bi-modal emotion recognition from expressive face and body gestures. J. Netw. Comput. Appl. 30(4), 1334–1345 (2007)

    Article  Google Scholar 

  3. Calvo, M.G., Nummenmaa, L.: Perceptual and affective mechanisms in facial expression recognition: an integrative review. Cogn. Emot. 30(6), 1081–1106 (2016)

    Article  Google Scholar 

  4. Mohammadi, Z., Frounchi, J., Amiri, M.: Wavelet-based emotion recognition system using EEG signal. Neural Comput. Appl. 28(8), 1985–1990 (2017)

    Article  Google Scholar 

  5. Liu, Z.T., Wu, M., Cao, W.H., et al.: Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing 273, 271–280 (2018)

    Article  Google Scholar 

  6. Shi, P.: Speech emotion recognition based on deep belief network. In: 15th International Conference on Networking, Sensing and Control. IEEE, Zhuhai (2018)

    Google Scholar 

  7. Zhu, L., Chen, L., Zhao, D., et al.: Emotion recognition from Chinese speech for smart affective services using a combination of SVM and DBN. Sensors 17(7), 1694 (2017)

    Article  Google Scholar 

  8. Mao, Q., Dong, M., Huang, Z., et al.: Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimed. 16(8), 2203–2213 (2014)

    Article  Google Scholar 

  9. Hossain, M.S., Muhammad, G.: Emotion recognition using deep learning approach from audio–visual emotional big data. Inf. Fusion 49, 69–78 (2019)

    Article  Google Scholar 

  10. Zhang, S., Zhang, S., Huang, T., et al.: Learning affective features with a hybrid deep model for audio–visual emotion recognition. IEEE Trans. Circuits Syst. Video Technol. 28(10), 3030–3043 (2018)

    Article  Google Scholar 

  11. Chen, M., He, X., Yang, J., et al.: 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 25(10), 1440–1444 (2018)

    Article  Google Scholar 

  12. Graves, A., Jaitly, N., Mohamed, A.: Hybrid speech recognition with deep bidirectional LSTM. In: 2013 Proceedings of Workshop on Automatic Speech Recognition and Understanding, pp. 273–278. IEEE, Olomouc (2013)

    Google Scholar 

  13. Zhao, J., Mao, X., Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 47, 312–323 (2019)

    Article  Google Scholar 

  14. New, T.L., Foo, S.W., De Silva, L.C.: Speech emotion recognition using hidden Markov models. Speech Commun. 41(4), 603–623 (2003)

    Article  Google Scholar 

  15. Pan, Y., Shen, P., Shen, L.: Speech emotion recognition using support vector machine. Int. J. Smart Home 6(2), 101–108 (2012)

    Google Scholar 

  16. Lee, C.C., Mower, E., Busso, C., et al.: Emotion recognition using a hierarchical binary decision tree approach. Speech Commun. 53(9–10), 1162–1171 (2011)

    Article  Google Scholar 

  17. Liu, Z.T., Xie, Q., Wu, M., et al.: Speech emotion recognition based on an improved brain emotion learning model. Neurocomputing 309, 145–156 (2018)

    Article  Google Scholar 

  18. Li, P., Song, Y., Wang, P., et al.: A multi-feature multi-classifier system for speech emotion recognition. In: 2018 First Asian Conference on Affective Computing and Intelligent Interaction, Beijing, China (2018)

    Google Scholar 

  19. Huang, C.W., Narayanan, S.S.: Attention assisted discovery of sub-utterance structure in speech emotion recognition. In: 17th Proceedings of Annual Conference of the International Speech Communication Association, pp. 1387–1391. International Speech Communication Association, San Francisco (2016)

    Google Scholar 

  20. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, pp. 1097–1105. Neural Information Processing Systems, Nevada (2012)

    Google Scholar 

  21. Zhou, P., Shi, W., Tian, J., et al.: Attention-based bidirectional long short-term memory networks for relation classification. In: 54th Proceedings of the Annual Meeting of the Association for Computational Linguistics, Short Papers, vol. 2, pp. 207–212. Association for Computational Linguistics, Berlin (2016)

    Google Scholar 

  22. Tang, Y.: Deep learning using linear support vector machines. In: 30th International Conference on Machine Learning, Atlanta, Georgia, USA (2013)

    Google Scholar 

  23. Busso, C., Bulut, M., Lee, C.C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335 (2008)

    Article  Google Scholar 

  24. Abadi, M., Barham, P., Chen, J., et al.: TensorFlow: a system for large-scale machine learning. In: 12th Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI 2016), pp. 265–283. USENIX Association, Savannah (2016)

    Google Scholar 

  25. Ng, A.Y.: Feature selection, L1 vs. L2 regularization, and rotational invariance. In: 21st Proceedings of International Conference on Machine Learning. ACM, Banff (2004)

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grants 61403422, 61703375 and 61273102, the Hubei Provincial Natural Science Foundation of China under Grants 2018CFB447 and 2015CFA010, the Wuhan Science and Technology Project under Grant 2017010201010133, the 111 project under Grant B17040, and the Fundamental Research Funds for National University, China University of Geosciences (Wuhan) under Grant 1810491T07.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhen-Tao Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, ZT., Xiao, P., Li, DY., Hao, M. (2019). Speaker-Independent Speech Emotion Recognition Based on CNN-BLSTM and Multiple SVMs. In: Yu, H., Liu, J., Liu, L., Ju, Z., Liu, Y., Zhou, D. (eds) Intelligent Robotics and Applications. ICIRA 2019. Lecture Notes in Computer Science(), vol 11742. Springer, Cham. https://doi.org/10.1007/978-3-030-27535-8_43

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-27535-8_43

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-27534-1

  • Online ISBN: 978-3-030-27535-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics