Skip to main content
Log in

Attention and Feature Selection for Automatic Speech Emotion Recognition Using Utterance and Syllable-Level Prosodic Features

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

This work attempts to recognize emotions from human speech using prosodic information represented by variations in duration, energy, and fundamental frequency (\(F_{0}\)) values. For this, the speech signal is first automatically segmented into syllables. Prosodic features at the utterance (15 features) and syllable level (10 features) are extracted using the syllable boundaries and trained separately using deep neural network classifiers. The effectiveness of the proposed approach is demonstrated on German speech corpus-EMOTional Sensitivity ASistance System (EmotAsS) for people with disabilities, the dataset used for the Interspeech 2018 Atypical Affect Sub-Challenge. The initial set of prosodic features on evaluation yields an unweighted average recall (UAR) of 30.15%. A fusion of the decision scores of these features with spectral features gives a UAR of 36.71%. This paper also employs methods like attention mechanism and feature selection using resampling-based recursive feature elimination (RFE) to enhance system performance. Implementing attention and feature selection followed by a score-level fusion improves the UAR to 36.83% and 40.96% for prosodic features and overall fusion, respectively. The fusion of the scores of the best individual system of the Atypical Affect Sub-Challenge and the proposed system provides a UAR (43.71%) above the best test result reported. The effectiveness of the proposed system has also been demonstrated on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database with a UAR of 63.83%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  1. S.B. Alex, B.P. Babu, L. Mary, Utterance and syllable level prosodic features for automatic emotion recognition, in 2018 IEEE Recent Advances in Intelligent Computational Systems (RAICS) (IEEE, 2018) pp 31–35

  2. D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate (2014). arXiv:1409.0473

  3. R. Banse, K.R. Scherer, Acoustic profiles in vocal emotion expression. J. Pers. Soc. Psychol. 70(3), 614–636 (1996)

    Google Scholar 

  4. I. Bisio, A. Delfino, F. Lavagetto, M. Marchese, A. Sciarrone, Gender-driven emotion recognition through speech signals for ambient intelligence applications. IEEE Trans. Emerg. Top. Comput. 1(2), 244–257 (2013)

    Google Scholar 

  5. F. Burkhardt, A. Paeschke, M. Rolfes, W.F. Sendlmeier, B. Weiss, A database of German emotional speech, in Ninth European Conference on Speech Communication and Technology (2005)

  6. C. Busso, M. Bulut, C.C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, S. Lee, S.S. Narayanan, Iemocap: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335 (2008)

  7. C. Busso, S. Lee, S. Narayanan, Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Trans. Audio Speech Lang. Process. 17(4), 582–596 (2009)

    Google Scholar 

  8. M. Cabanac, What is emotion? Behav. Process. 60(2), 69–83 (2002)

    Google Scholar 

  9. D.A. Cairns, J.H. Hansen, Nonlinear analysis and classification of speech under stressed conditions. J. Acoust. Soc. Am. 96(6), 3392–3400 (1994)

    Google Scholar 

  10. N. Campbell, P. Mokhtari, Voice quality: the 4th prosodic dimension, in 15th ICPhS, (2003), pp. 2417–2420

  11. L. Chen, X. Mao, Y. Xue, L.L. Cheng, Speech emotion recognition: Features and classification models. Digit. Signal Proc. 22(6), 1154–1160 (2012)

    MathSciNet  Google Scholar 

  12. V. Chernykh, G. Sterling, P. Prihodko, Emotion recognition from speech with recurrent neural networks (2017). arXiv:1701.08071

  13. F. Chollet, et al. Keras. https://github.com/keras-team/keras (2015)

  14. J.K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Bengio, Attention-based models for speech recognition, in Advances in Neural Information Processing Systems, (2015), pp. 577–585

  15. R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, J.G. Taylor, Emotion recognition in human-computer interaction. IEEE Signal Process. Mag. 18(1), 32–80 (2001)

    Google Scholar 

  16. G.E. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2012)

    Google Scholar 

  17. C. Darwin, P. Prodger, The Expression of the Emotions in Man and Animals (Oxford University Press, New York, 1998)

    Google Scholar 

  18. S.B. Davis, P. Mermelstein, Comparison of parametric representation for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)

    Google Scholar 

  19. K. Djolander, The snack sound toolkit. http://www.speech.kth.se/snack (2004)

  20. M. El Ayadi, M.S. Kamel, F. Karray, Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recogn. 44(3), 572–587 (2011)

    MATH  Google Scholar 

  21. F. Eyben, A. Batliner, B. Schuller, Towards a standard set of acoustic features for the processing of emotion in speech, in Proceedings of Meetings on Acoustics 159ASA, vol. 9 (ASA, 2010), p 060006

  22. F. Eyben, K.R. Scherer, B.W. Schuller, J. Sundberg, E. André, C. Busso, L.Y. Devillers, J. Epps, P. Laukka, S.S. Narayanan et al., The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2), 190–202 (2015)

    Google Scholar 

  23. M.P. Gelfer, D.M. Fendel, Comparisons of jitter, shimmer, and signal-to-noise ratio from directly digitized versus taped voice samples. J. Voice 9(4), 378–382 (1995)

    Google Scholar 

  24. S. Gharsellaoui, S.A. Selouani, A.O. Dahmane, Automatic emotion recognition using auditory and prosodic indicative features, in IEEE 28th Canadian Conference on Electrical and Computer Engineering (CCECE), 2015 (IEEE, 2015) pp. 1265–1270

  25. I.J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, Y. Bengio, Maxout networks (2013). arXiv:1302.4389

  26. G. Gosztolya, T. Grósz, L. Tóth, General utterance-level feature extraction for classifying crying sounds, atypical & self-assessed affect and heart beats. Proc. Interspeech 2018, 531–535 (2018)

    Google Scholar 

  27. A. Graves, Generating sequences with recurrent neural networks (2013). arXiv:1308.0850

  28. K. Han, D. Yu, I. Tashev, Speech emotion recognition using deep neural network and extreme learning machine, in Fifteenth Annual Conference of the International Speech Communication Association (2014)

  29. S. Hantke, H. Sagha, N. Cummins, B. Schuller, Emotional speech of mentally and physically disabled individuals: Introducing the emotass database and first findings, in Proceedings of Interspeech 2017 (ISCA, Stockholm, Sweden, 2017) pp. 3137–3141

  30. Q. Jin, C. Li, S. Chen, H. Wu, Speech emotion recognition with acoustic and lexical features, in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (IEEE, 2015) pp. 4749–4753

  31. W.F. Johnson, R.N. Emde, K.R. Scherer, M.D. Klinnert, Recognition of emotion from vocal cues. Arch. Gen. Psychiatry 43(3), 280–283 (1986)

    Google Scholar 

  32. P.N. Juslin, P. Laukka, Communication of emotions in vocal expression and music performance: Different channels, same code? Psychol. Bull. 129(5), 770–814 (2003)

    Google Scholar 

  33. S.G. Koolagudi, K.S. Rao, Emotion recognition from speech: a review. Int. J. Speech Technol. 15(2), 99–117 (2012)

    Google Scholar 

  34. M. Kuhn, Building predictive models in r using the caret package. J. Stat. Softw. 8(5), 1–26 (2008)

    Google Scholar 

  35. O.W. Kwon, K. Chan, J. Hao, T.W. Lee, Emotion recognition by speech signals, in INTERSPEECH (2003)

  36. S. Latif, R. Rana, J. Qadir, J. Epps, Variational autoencoders for learning latent representations of speech emotion: A preliminary study. Proc. Interspeech 2018, 3107–3111 (2018)

    Google Scholar 

  37. P. Laukka, P.N. Juslin, A. Gabrielsson, Impact of intended emotion intensity on cue utilization and decoding accuracy in vocal expression of emotion. Int. J. Psychol. 35, 288–289 (2000)

    Google Scholar 

  38. C.M. Lee, S.S. Narayanan, Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 13(2), 293–303 (2005)

    Google Scholar 

  39. C.M. Lee, S. Yildirim, M. Bulut, A. Kazemzadeh, C. Busso, Z. Deng, S. Lee, S. Narayanan, Emotion recognition based on phoneme classes, in Interspeech, (2004) pp. 205–211

  40. J. Lee, I. Tashev, High-level feature representation using recurrent neural network for speech emotion recognition, in Sixteenth Annual Conference of the International Speech Communication Association (2015)

  41. P. Li, Y. Song, I. McLoughlin, W. Guo, L. Dai, An attention pooling based representation learning method for speech emotion recognition. Proc. Interspeech 2018, 3087–3091 (2018)

  42. X. Li, J. Tao, M.T. Johnson, J. Soltis, A. Savage, K.M. Leong, J.D. Newman, Stress and emotion classification using jitter and shimmer features, in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) vol. 4 (IEEE, 2007), pp IV–1081

  43. A. Liaw, M. Wiener, Classification and regression by randomforest. R News 2(3), 18–22 (2002)

    Google Scholar 

  44. W. Lim, D. Jang, T. Lee, Speech emotion recognition using convolutional and recurrent neural networks, in Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2016 Asia-Pacific (IEEE, 2016) pp. 1–4

  45. I. Luengo, E. Navas, I. Hernáez, Feature analysis and evaluation for automatic emotion identification in speech. IEEE Trans. Multimedia 12(6), 490–501 (2010)

    Google Scholar 

  46. M. Lugger, B. Yang, The relevance of voice quality features in speaker independent emotion recognition, in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) vol 4, (IEEE, 2007), pp. IV–17

  47. D. Luo, Y. Zou, D. Huang, Investigation on joint representation learning for robust feature extraction in speech emotion recognition. Proc. Interspeech 2018, 152–156 (2018)

    Google Scholar 

  48. L. Mary, B. Yegnanarayana, Extraction and representation of prosodic features for language and speaker recognition. Speech Commun. 50(10), 782–796 (2008)

    Google Scholar 

  49. L. Mary, A.P. Antony, B.P. Babu, S.M. Prasanna, Automatic syllabification of speech signal using short time energy and vowel onset points. Int. J. Speech Technol. 21, 1–9 (2018)

    Google Scholar 

  50. A. Metallinou, M. Wollmer, A. Katsamanis, F. Eyben, B. Schuller, S. Narayanan, Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Trans. Affect. Comput. 3(2), 184–198 (2012)

    Google Scholar 

  51. S. Mirsamadi, E. Barsoum, C. Zhang, Automatic speech emotion recognition using recurrent neural networks with local attention, in 2017 IEEE International Conference on Acoustics (Speech and Signal Processing (ICASSP), IEEE, 2017), pp. 2227–2231

  52. V. Mohanan, L. Mary, Prosody based emotion recognition using SVM, in International Conference on Signal and Speech Processing (ICSSP), (2016) pp. 100–105

  53. I.R. Murray, J.L. Arnott, Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion. J. Acoust. Soc. Am. 93(2), 1097–1108 (1993)

    Google Scholar 

  54. M. Neumann, N.T. Vu, Attentive convolutional neural network based speech emotion recognition: A study on the impact of input features, signal length, and acted speech (2017). arXiv:1706.00612

  55. T.L. Nwe, S.W. Foo, L.C. De Silva, Speech emotion recognition using hidden markov models. Speech Commun. 41(4), 603–623 (2003)

    Google Scholar 

  56. J. Qiu, K. Sun, I.J. Rudas, H. Gao, Command filter-based adaptive nn control for mimo nonlinear systems with full-state constraints and actuator hysteresis. IEEE Trans. Cybern. (2019a)

  57. J. Qiu, K. Sun, T. Wang, H. Gao, Observer-based fuzzy adaptive event-triggered control for pure-feedback nonlinear systems with prescribed performance. IEEE Trans. Fuzzy Syst. 27(11), 2152–2162 (2019b)

    Google Scholar 

  58. R Core Team, R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2018). https://www.R-project.org/

  59. K.S. Rao, S.G. Koolagudi, R.R. Vempada, Emotion recognition from speech using global and local prosodic features. Int. J. Speech Technol. 16(2), 143–160 (2013)

    Google Scholar 

  60. F. Richardson, D. Reynolds, N. Dehak, Deep neural network approaches to speaker and language recognition. IEEE Signal Process. Lett. 22(10), 1671–1675 (2015)

    Google Scholar 

  61. J. Rong, G. Li, Y.P.P. Chen, Acoustic feature selection for automatic emotion recognition from speech. Inf. Process. Manag. 45(3), 315–328 (2009)

    Google Scholar 

  62. A. Satt, S. Rozenberg, R. Hoory, Efficient emotion recognition from speech using deep learning on spectrograms, in INTERSPEECH, (2017) pp. 1089–1093

  63. K.R. Scherer, Methods of research on vocal communication: Paradigms and parameters. Handbook of methods in nonverbal behavior research, pp. 136–198 (1982)

  64. K.R. Scherer et al., Psychological models of emotion. Neuropsychol. Emot. 137(3), 137–162 (2000)

    Google Scholar 

  65. B. Schuller, G. Rigoll, Timing levels in segment-based speech emotion recognition, in Proceedings of International Conference on Spoken Language Processing ICSLP, Pittsburgh, USA, 2006

  66. B. Schuller, G. Rigoll, M. Lang, Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture, in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 1, (IEEE, 2004), pp. I–577

  67. B. Schuller, S. Steidl, A. Batliner, The interspeech 2009 emotion challenge, in Proceedings of Interspeech 2009, Brighton, UK, (2009) pp. 312–315

  68. B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. Müller, S. Narayanan, The interspeech 2010 paralinguistic challenge, in Proceedings of INTERSPEECH 2010, Makuhari, Japan, (2010a) pp. 2794–2797

  69. B. Schuller, B. Vlasenko, F. Eyben, M. Wollmer, A. Stuhlsatz, A. Wendemuth, G. Rigoll, Cross-corpus acoustic emotion recognition: Variances and strategies. IEEE Trans. Affect. Comput. 1(2), 119–131 (2010b)

    Google Scholar 

  70. B. Schuller, A. Batliner, S. Steidl, F. Schiel, J. Krajewski, The interspeech 2011 speaker state challenge, in Proceedings of INTERSPEECH 2011, Florence, Italy (2011a)

  71. B. Schuller, A. Batliner, S. Steidl, D. Seppi, Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Commun. 53(9–10), 1062–1087 (2011b)

    Google Scholar 

  72. B. Schuller, S. Steidl, A. Batliner, E. Nöth, A. Vinciarelli, F. Burkhardt, R. van Son, F. Weninger, F. Eyben, T. Bocklet, The interspeech 2012 speaker trait challenge, in Proceedings of INTERSPEECH, Portland (OR, USA, 2012)

  73. B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi, et al. The interspeech 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism, in Proceedings of INTERSPEECH 2013, Lyon, France

  74. B.W. Schuller, S. Steidl, A. Batliner, P.B. Marschik, H. Baumeister, F. Dong, S. Hantke, F. Pokorny, E.M. Rathner, K.D. Bartl-Pokorny, et al. The interspeech 2018 computational paralinguistics challenge: Atypical & self-assessed affect, crying & heart beats, in Proceedings of Interspeech 2018, Hyderabad, India pp. 122–126 (2018)

  75. A. Stuhlsatz, C. Meyer, F. Eyben, T. Zielke, G. Meier, B. Schuller, Deep neural networks for acoustic emotion recognition: raising the benchmarks, 2011 IEEE international conference on Acoustics (Speech and Signal Processing (ICASSP), IEEE, 2011), pp. 5688–5691

    Google Scholar 

  76. K. Sun, S. Mou, J. Qiu, T. Wang, H. Gao, Adaptive fuzzy control for nontriangular structural stochastic switched nonlinear systems with full state constraints. IEEE Trans. Fuzzy Syst. 27(8), 1587–1601 (2018)

    Google Scholar 

  77. R. Sun, E. Moore, J.F. Torres, Investigating glottal parameters for differentiating emotional categories with similar prosodics, 2009 IEEE International Conference on Acoustics (Speech and Signal Processing (ICASSP), IEEE, 2009), pp. 4509–4512

    Google Scholar 

  78. J. Sundberg, S. Patel, E. Bjorkner, K.R. Scherer, Interdependencies among voice source parameters in emotional speech. IEEE Trans. Affect. Comput. 2(3), 162–174 (2011)

    Google Scholar 

  79. D. Tacconi, O. Mayora, P. Lukowicz, B. Arnrich, C. Setz, G. Troster, C. Haring, Activity and emotion recognition to support early diagnosis of psychiatric diseases, in Second International Conference on Pervasive Computing Technologies for Healthcare, pp. 100–102 (2008)

  80. D. Tang, J. Zeng, M. Li, An end-to-end deep learning framework with speech emotion recognition of atypical individuals. Proc. Interspeech 2018, 162–166 (2018)

    Google Scholar 

  81. P. Taylor, Analysis and synthesis of intonation using the tilt model. J. Acoust. Soc. Am. 107(3), 1697–1714 (2000)

    Google Scholar 

  82. L. Ten Bosch, Emotions, speech and the asr framework. Speech Commun. 40(1), 213–225 (2003)

    MATH  Google Scholar 

  83. G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M.A. Nicolaou, B. Schuller, S. Zafeiriou, Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2016) pp. 5200–5204

  84. E. Variani, X. Lei, E. McDermott, I. Lopez-Moreno, J. Gonzalez-Dominguez, Deep neural networks for small footprint text-dependent speaker verification, 2014 IEEE International Conference on Acoustics (Speech and Signal Processing (ICASSP), IEEE, 2014), pp. 4052–4056

    Google Scholar 

  85. V. Vegesna, P. Jain, K. Gurugubelli, A. Vuppala, Emotional speech classifier systems: For sensitive assistance to support disabled individuals, pp. 6–10 (2018). https://doi.org/10.21437/SMM.2018-2

  86. J. Wagner, D. Schiller, A. Seiderer, E. André, Deep learning in paralinguistic recognition tasks: Are hand-crafted features still relevant? Proc. Interspeech 2018, 147–151 (2018)

    Google Scholar 

  87. Z.Q. Wang, I. Tashev, Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks, 2017 IEEE International Conference on Acoustics (Speech and Signal Processing (ICASSP), IEEE, 2017), pp. 5150–5154

    Google Scholar 

  88. S. Yildirim, M. Bulut, C.M. Lee, A. Kazemzadeh, Z. Deng, S. Lee, S. Narayanan, C. Busso, An acoustic study of emotions expressed in speech, in Eighth International Conference on Spoken Language Processing, (2004) pp. 2193–2196

  89. S. Yildirim, S. Narayanan, A. Potamianos, Detecting emotional state of a child in a conversational computer game. Comput. Speech Lang. 25(1), 29–44 (2011)

    Google Scholar 

  90. D. Yu, M.L. Seltzer, J. Li, J.T. Huang, F. Seide, Feature learning in deep neural networks-studies on speech recognition tasks (2013). arXiv:1301.3605

  91. Z. Zhao, Y. Zheng, Z. Zhang, H. Wang, Y. Zhao, C. Li, Exploring spatio-temporal representations by integrating attention-based bidirectional-lstm-rnns and fcns for speech emotion recognition. Proc. Interspeech 2018, 272–276 (2018)

    Google Scholar 

  92. P. Zhou, W. Shi, J. Tian, Z. Qi, B. Li, H. Hao, B. Xu, Attention-based bidirectional long short-term memory networks for relation classification, in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), vol 2 (2016) pp. 207–212

Download references

Acknowledgements

The authors would like to thank the organizers of the Interspeech 2018 Atypical Affect Sub-Challenge for providing the decision scores of the OpenSMILE ComParE baseline system.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Starlet Ben Alex.

Ethics declarations

Conflict of interest

The authors wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alex, S.B., Mary, L. & Babu, B.P. Attention and Feature Selection for Automatic Speech Emotion Recognition Using Utterance and Syllable-Level Prosodic Features. Circuits Syst Signal Process 39, 5681–5709 (2020). https://doi.org/10.1007/s00034-020-01429-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-020-01429-3

Keywords

Navigation