Skip to main content
Log in

Deep learning approaches for speech emotion recognition: state of the art and research challenges

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

A Correction to this article was published on 01 May 2021

This article has been updated

Abstract

Speech emotion recognition (SER) systems identify emotions from the human voice in the areas of smart healthcare, driving a vehicle, call centers, automatic translation systems, and human-machine interaction. In the classical SER process, discriminative acoustic feature extraction is the most important and challenging step because discriminative features influence the classifier performance and decrease the computational time. Nonetheless, current handcrafted acoustic features suffer from limited capability and accuracy in constructing a SER system for real-time implementation. Therefore, to overcome the limitations of handcrafted features, in recent years, variety of deep learning techniques have been proposed and employed for automatic feature extraction in the field of emotion prediction from speech signals. However, to the best of our knowledge, there is no in-depth review study is available that critically appraises and summarizes the existing deep learning techniques with their strengths and weaknesses for SER. Hence, this study aims to present a comprehensive review of deep learning techniques, uniqueness, benefits and their limitations for SER. Moreover, this review study also presents speech processing techniques, performance measures and publicly available emotional speech databases. Furthermore, this review also discusses the significance of the findings of the primary studies. Finally, it also presents open research issues and challenges that need significant research efforts and enhancements in the field of SER systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4-
Fig. 5-
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11-
Fig. 12-
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17-

Similar content being viewed by others

Change history

References

  1. Abadi M, Chu A, Goodfellow I, McMahan HB, Mironov I, Talwar K, Zhang L (2016b). Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. ACM, pp 308–318

  2. Abadi M et al. (2016a). Tensorflow: large-scale machine learning on heterogeneous distributed systems arXiv preprint arXiv:160304467

  3. Adam T, Salam M, Gunawan TS (2013). Wavelet based Cepstral Coefficients for neural network speech recognition. In: 2013 IEEE International Conference on Signal and Image Processing Applications. IEEE, pp 447–451

  4. Alghamdi R (2016) Hidden Markov Models (HMMs) and Security Applications. Int J Adv Comput Sci Appl 7:39–47

    Google Scholar 

  5. Anoop V, Rao P, Aruna S (2018). An effective speech emotion recognition using artificial neural networks. In: International proceedings on advances in soft computing, Intelligent Systems and Applications. Springer, pp. 393–401

  6. A-r M, Dahl GE, Hinton G (2011) Acoustic modeling using deep belief networks. IEEE Trans Audio Speech Lang Process 20:14–22

    Google Scholar 

  7. Arshad H, Khan MA, Sharif M, Yasmin M, Javed MYJIJoML, cybernetics (2019). Multi-level features fusion and selection for human gait recognition: an optimized framework of Bayesian model and binomial distribution 10:3601–3618

  8. Arshad H, Khan MA, Sharif MI, Yasmin M, Tavares JMR, Zhang YD, Satapathy SCJES (2020). A multilevel paradigm for deep convolutional neural network features selection with an application to human gait recognition:e12541

  9. Automation C (2010) CASIA Chinese emotional Corpus. Institute of Automation, Chinese Academy of Sciences. http://www.chineseldc.org/doc/CLDC-SPC-2005-010/report.htm. 2010

  10. Aytar Y, Vondrick C, Torralba A (2016). Soundnet: learning sound representations from unlabeled video. In: Advances in neural information processing systems. pp. 892–900

  11. Badshah AM et al. (2019). Deep features-based speech emotion recognition for smart affective services multimedia tools and applications 78:5571-5589

  12. Banse R, Scherer KR (1996) Acoustic profiles in vocal emotion expression. J Pers Soc Psychol 70:614–636

    Google Scholar 

  13. Bargal SA, Barsoum E, Ferrer CC, Zhang C (2016). Emotion recognition in the wild from videos using images. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, pp 433–436

  14. Bengio Y (2009) Learning deep architectures for AI foundations and trends® in. Mach Learn 2:1–127

    MATH  Google Scholar 

  15. Bhattacharjee U (2013) A comparative study of LPCC and MFCC features for the recognition of Assamese phonemes. International journal of engineering research and technology 2:1–6

    Google Scholar 

  16. Borji A, Sihite DN, Itti L (2012) Quantitative analysis of human-model agreement in visual saliency modeling: A comparative study. IEEE Trans Image Process 22:55–69

    MathSciNet  MATH  Google Scholar 

  17. Brownlee J (2019). Deep Learning & Artificial Neural Networks. Machine learning mastery. https://machinelearningmastery.com/what-is-deep-learning/. 2019

  18. Busso C et al (2008) IEMOCAP: Interactive emotional dyadic motion capture database. Lang Resour Eval 42:335

    Google Scholar 

  19. Cairong Z, Xinran Z, Cheng Z, Li Z (2016). A novel DBN feature fusion model for cross-corpus speech emotion recognition Journal of Electrical and Computer Engineering 2016

  20. Campbell N (2000). Databases of emotional speech. In: ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion

  21. Chen L, Su W, Feng Y, Wu M, She J, Hirota KJIS (2020). Two-layer fuzzy multiple random forest for speech emotion recognition in human-robot interaction 509:150–163

  22. Chen R, Zhou Y, Qian Y (2018). Emotion Recognition Using Support Vector Machine and Deep Neural Network. In, Singapore. Man-machine speech communication. Springer Singapore, pp 122–131

  23. Chernykh V, Prikhodko P (2017). Emotion recognition from speech with recurrent neural networks arXiv preprint arXiv:170108071

  24. Chung J, Gulcehre C, Cho K, Bengio Y (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv:14123555

  25. Coetzee H, Barnwell T An LSP (1989). Based speech quality measure. In: International Conference on Acoustics, Speech, and Signal Processing. IEEE, pp 596–599

  26. Costantini G, Iaderola I, Paoloni A, Todisco M (2014). Emovo corpus: an italian emotional speech database. In: International Conference on Language Resources and Evaluation (LREC 2014). European Language Resources Association (ELRA), pp 3501–3504

  27. Cowie R, Cornelius RR (2003) Describing the emotional states that are expressed in speech. Speech Comm 40:5–32

    MATH  Google Scholar 

  28. Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Mag 18:32–80

    Google Scholar 

  29. Cutajar M, Gatt E, Grech I, Casha O, Micallef J (2013) Comparative study of automatic speech recognition techniques. IET Signal Proc 7:25–46

    Google Scholar 

  30. Degirmenci A (2014). Introduction to hidden Markov models Harvard University,[online] available from: https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Degirmenci+A+%282014%29.+Introduction+to+hidden+Markov+models+Harvard+University&btnG=. Accessed 10 Oct 2016

  31. Degottex G, Kane J, Drugman T, Raitio T, Scherer S (2014). COVAREP—A collaborative voice analysis repository for speech technologies. In: 2014 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, pp 960–964

  32. Deng L (2014). A tutorial survey of architectures, algorithms, and applications for deep learning APSIPA Transactions on Signal and Information Processing 3

  33. Deng J, Frühholz S, Zhang Z, Schuller B (2017a) Recognizing emotions from whispered speech based on acoustic feature transfer learning. IEEE Access 5:5235–5246

    Google Scholar 

  34. Deng J, Xia R, Zhang Z, Liu Y, Schuller B (2014). Introducing shared-hidden-layer autoencoders for transfer learning and their application in acoustic emotion recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 4818–4822

  35. Deng J, Xu X, Zhang Z, Frühholz S, Grandjean D, Schuller B (2017b). Fisher kernels on phase-based features for speech emotion recognition. In: Dialogues with social robots. Springer, pp. 195–203

  36. Deng J, Xu X, Zhang Z, Frühholz S, Schuller B (2017c) Semisupervised autoencoders for speech emotion recognition IEEE/ACM transactions on audio. Speech, and Language Processing 26:31–43

    Google Scholar 

  37. Deng J, Xu X, Zhang Z, Frühholz S, Schuller B (2017d) Universum autoencoder-based domain adaptation for speech emotion recognition. IEEE Signal Process Lett 24:500–504

    Google Scholar 

  38. Deng J, Xu XZ, Zhang ZX, Fruhholz S, Schuller B (2018) Semisupervised Autoencoders for Speech Emotion Recognition. IEEE-ACM Trans Audio Speech Lang 26:31–43. https://doi.org/10.1109/taslp.2017.2759338

    Article  Google Scholar 

  39. Deriche M (2017) A Two-Stage Hierarchical Bilingual Emotion Recognition System Using a Hidden Markov Model and Neural Networks. Arab J Sci Eng 42:5231–5249

    MathSciNet  MATH  Google Scholar 

  40. Deriche M, Abo absa AH (2017) A Two-Stage Hierarchical Bilingual Emotion Recognition System Using a Hidden Markov Model and Neural Networks, Arab J Sci.\ Eng. 42:5231–5249. https://doi.org/10.1007/s13369-017-2742-5

  41. Domingos P (2012) A few useful things to know about machine learning. Commun ACM 55:78–87

    Google Scholar 

  42. Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12:2121–2159

    MathSciNet  MATH  Google Scholar 

  43. El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn 44:572–587

    MATH  Google Scholar 

  44. Endah SN, Widodo AP, Fariq ML, Nadianada SI, Maulana F (2017). Beyond back-propagation learning for diabetic detection: Convergence comparison of gradient descent, momentum and Adaptive Learning Rate. In: 2017 1st International Conference on Informatics and Computational Sciences (ICICoS). IEEE, pp 189–194

  45. Erfani SM, Rajasegarar S, Karunasekera S, Leckie C (2016) High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning. Pattern Recogn 58:121–134

    Google Scholar 

  46. Etienne C, Fidanza G, Petrovskii A, Devillers L, Schmauch B (2018). Speech Emotion Recognition with Data Augmentation and Layer-wise Learning Rate Adjustment arXiv preprint arXiv:180205630

  47. Eyben F, Weninger F, Gross F, Schuller B (2013). Recent developments in opensmile, the munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM international conference on Multimedia. ACM, pp 835–838

  48. Eyben F, Wöllmer M, Schuller B (2009). OpenEAR—introducing the Munich open-source emotion and affect recognition toolkit. In: 2009 3rd international conference on affective computing and intelligent interaction and workshops. IEEE, pp 1–6

  49. Eyben F et al (2015) The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans Affect Comput 7:190–202

    Google Scholar 

  50. Fayek HM, Lech M, Cavedon L (2017) Evaluating deep learning architectures for Speech Emotion Recognition. Neural Netw 92:60–68

    Google Scholar 

  51. Fei W, Ye X, Sun Z, Huang Y, Zhang X, Shang S (2016). Research on speech emotion recognition based on deep auto-encoder. In: 2016 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER). IEEE, pp 308–312

  52. Fonnegra RD, Díaz GM (2018). Speech Emotion Recognition Based on a Recurrent Neural Network Classification Model. In, Cham. Advances in Computer Entertainment Technology. Springer International Publishing, pp 882–892

  53. France DJ, Shiavi RG, Silverman S, Silverman M, Wilkes M (2000) Acoustical properties of speech as indicators of depression and suicidal risk. IEEE Trans Biomed Eng 47:829–837

    Google Scholar 

  54. Gers FA, Schmidhuber J, Cummins F (1999). Learning to forget: continual prediction with LSTM

  55. Ghosh S, Laksana E, Morency L-P, Scherer S (2016a). Representation learning for speech emotion recognition. In: Interspeech. pp. 3603–3607

  56. Ghosh S, Laksana E, Morency LP, Scherer S, Int Speech Commun A (2016b). Representation Learning for Speech Emotion Recognition. In: 17th Annual Conference of the International Speech Communication Association. Interspeech. Isca-Int Speech Communication Assoc, Baixas, pp 3603–3607. doi:https://doi.org/10.21437/Interspeech.2016-692

  57. Giannakopoulos T (2015) Pyaudioanalysis: An open-source python library for audio signal analysis. PLoS One 10:e0144610

    Google Scholar 

  58. Gjoreski M, Gjoreski H, Kulakov A (n.d.). Automatic recognition of emotions from speech

  59. Gretton A, Smola A, Huang J, Schmittfull M, Borgwardt K, Schölkopf B (2009) Covariate shift by kernel mean matching. Dataset shift in machine learning 3:5

    Google Scholar 

  60. Gulli A, Pal S (2017). Deep learning with Keras. Packt Publishing Ltd,

  61. Gulzar T, Singh A, Sharma S (2014) Comparative analysis of LPCC, MFCC and BFCC for the recognition of Hindi words using artificial neural networks. Int J Comput Appl 101:22–27

    Google Scholar 

  62. Guo Y, Liu Y, Oerlemans A, Lao S, Wu S, Lew MS (2016) Deep learning for visual understanding: A review. Neurocomputing 187:27–48. https://doi.org/10.1016/j.neucom.2015.09.116

    Article  Google Scholar 

  63. Gupta D, Bansal P, Choudhary K (2018). The state of the art of feature extraction techniques in speech recognition. In: Speech and language processing for human-machine communications. Springer, pp. 195–207

  64. Hajarolasvadi N, Demirel H (2019). 3D CNN-based speech emotion recognition using K-means clustering and spectrograms entropy 21:479

  65. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD explorations newsletter 11:10–18

    Google Scholar 

  66. Hansen JH, Bou-Ghazale SE (1997). Getting started with SUSAS: A speech under simulated and actual stress database. In: Fifth European Conference on Speech Communication and Technology

  67. Hansen JH, Cairns DA (1995) Icarus: source generator based real-time recognition of speech in noisy stressful and lombard effect environments. Speech Comm 16:391–422

    Google Scholar 

  68. Haq S, Jackson PJ (2011). Multimodal emotion recognition. In: machine audition: principles, algorithms and systems. IGI global, pp 398-423

  69. He K, Zhang X, Ren S, Sun J (2015). Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision. pp. 1026–1034

  70. Heracleous P, Yoneyama A (2019) A comprehensive study on bilingual and multilingual speech emotion recognition using a two-pass classification scheme. PloS one 14:e0220386

    Google Scholar 

  71. Hershey S et al. (2017). CNN architectures for large-scale audio classification. In: 2017 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, pp 131–135

  72. Hinton GE (2012). A practical guide to training restricted Boltzmann machines. In: neural networks: tricks of the trade. Springer, pp 599-619

  73. Hinton GE, Osindero S, Teh Y-W (2006) A fast learning algorithm for deep belief nets. Neural Comput 18:1527–1554

    MathSciNet  MATH  Google Scholar 

  74. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313:504–507

    MathSciNet  MATH  Google Scholar 

  75. Hinton G et al. (2012). Deep neural networks for acoustic modeling in speech recognition IEEE Signal processing magazine 29

  76. Ho N-H, Yang H-J, Kim S-H, Lee GJIA (2020). Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network 8:61672–61686

  77. Hossain MS, Muhammad G (2019) Emotion recognition using deep learning approach from audio–visual emotional big data. Information Fusion 49:69–78

    Google Scholar 

  78. Huang C, Gong W, Fu W, Feng D (2014a). A research of speech emotion recognition based on deep belief network and SVM Mathematical Problems in Engineering 2014

  79. Huang Y, Hu M, Yu X, Wang T, Yang C Transfer Learning of Deep Neural Network for Speech Emotion Recognition. In, Singapore, 2016a. Pattern recognition. Springer Singapore, pp 721–729

  80. Huang Y, Tian K, Wu A, Zhang G (2019) Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition. J Ambient Intell Humaniz Comput 10:1787–1798

    Google Scholar 

  81. Huang Y, Wu A, Zhang G, Li Y (2014b). Speech emotion recognition based on coiflet wavelet packet cepstral coefficients. In: Chinese conference on pattern recognition. Springer, pp 436–443

  82. Huang Y, Wu A, Zhang G, Li Y (2016b) Speech emotion recognition based on deep belief networks and wavelet packet cepstral coefficients international journal of simulation: systems. Sci Technol 17:28.21–28.25

    Google Scholar 

  83. Huang Z, Xue W, Mao Q, Zhan Y (2017) Unsupervised domain adaptation for speech emotion recognition using PCANet. Multimed Tools Appl 76:6785–6799. https://doi.org/10.1007/s11042-016-3354-x

    Article  Google Scholar 

  84. Hussain N, Khan MA, Sharif M, Khan SA, Albesher AA, Saba T, Armaghan AJMTAhdos (2020). A deep neural network and classical features based scheme for objects recognition: an application for machine inspection

  85. Ide H, Kurita T (2017). Improvement of learning for CNN with ReLU activation by sparse regularization. In: 2017 International Joint Conference on Neural Networks (IJCNN). IEEE, pp 2684–2691

  86. Ioffe S, Szegedy C (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift arXiv preprint arXiv:150203167

  87. Jarchi D, Andreu-Perez J, Kiani M, Vysata O, Kuchynka J, Prochazka A, Sanei SJS (2020). Recognition of Patient Groups with Sleep Related Disorders using Bio-signal Processing and Deep Learning 20:2594

  88. Jia Y et al. (2014). Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on Multimedia. ACM, pp 675–678

  89. Jian Y et al (2017) A novel extreme learning machine classification model for e-Nose application based on the multiple kernel approach. Sensors 17:1434

    Google Scholar 

  90. Jiang W, Wang Z, Jin JS, Han X, Li C (2019) Speech Emotion Recognition with Heterogeneous Feature Unification of Deep Neural Network. Sensors 19:2730

    Google Scholar 

  91. Kaiser JF (1990). On a simple algorithm to calculate the’energy’of a signal. In: International conference on acoustics, speech, and signal processing. IEEE, pp 381–384

  92. Kerkeni L, Serrestou Y, Mbarki M, Mahjoub MA, Raoof K, Cléder C (2017). Speech emotion recognition: recurrent neural networks compared to SVM and linear regression

  93. Keyvanrad MA, Homayounpour MM (2014). A brief survey on deep belief networks and introducing a new object oriented toolbox (DeeBNet) arXiv preprint arXiv:14083264

  94. Khalid S, Muhammad N, Sharif MJIITS. (2018) Automatic measurement of the traffic sign with digital segmentation and recognition 13:269–279

  95. Khan H, Sharif M, Bibi N, Muhammad NJTEPJP (2019). A novel algorithm for the detection of cerebral aneurysm using sub-band morphological operation 134:34

  96. Khan MA et al. (2020). Human action recognition using fusion of multiview and deep features: an application to video surveillance:1–27

  97. Kingma DP, Ba J (2014). Adam: A method for stochastic optimization arXiv preprint arXiv:14126980

  98. Ko B (2018) A brief review of facial emotion recognition based on visual information. Sensors 18:401

    Google Scholar 

  99. Lalitha S, Geyasruti D, Narayanan RMS (2015) Emotion Detection Using MFCC and Cepstrum Features. Prog Comput Sci 70:29–35. https://doi.org/10.1016/j.procs.2015.10.020

    Article  Google Scholar 

  100. Latha CP, Priya M (2016) A review on deep learning algorithms for speech and facial emotion recognition APTIKOM. Electron J Comput Sci Inf Technol 1:92–108

    Google Scholar 

  101. Laydrus NC, Ambikairajah E, Celler B (2007). Automated sound analysis system for home telemonitoring using shifted delta cepstral features. In: 2007 15th International Conference on Digital Signal Processing. IEEE, pp 135–138

  102. Le D, Provost EM (2015). Data selection for acoustic emotion recognition: Analyzing and comparing utterance and sub-utterance selection strategies. In: 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), 21–24 Sept. 2015. pp 146–152. doi:https://doi.org/10.1109/ACII.2015.7344564

  103. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. https://doi.org/10.1038/nature14539

    Article  Google Scholar 

  104. Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE transactions on speech and audio processing 13:293–303

    Google Scholar 

  105. Lee J, Tashev I (2015). High-level feature representation using recurrent neural network for speech emotion recognition. In: Sixteenth Annual Conference of the International Speech Communication Association

  106. Li C, Sanchez R-V, Zurita G, Cerrada M, Cabrera D, Vásquez RE (2015) Multimodal deep support vector classification with homologous features and its application to gearbox fault diagnosis. Neurocomputing 168:119–127

    Google Scholar 

  107. Liu Z-T, Wu M, Cao W-H, Mao J-W, Xu J-P, Tan G-Z (2018) Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing 273:271–280

    Google Scholar 

  108. Lopez-Moreno I, Gonzalez-Dominguez J, Martinez D, Plchot O, Gonzalez-Rodriguez J, Moreno PJ (2016) On the use of deep feedforward neural networks for automatic language identification. Comput Speech Lang 40:46–59

    Google Scholar 

  109. Lyons J (2013). Python speech features. https://github.com/jameslyons/python_speech_features. Accessed 16-03-2017 2017

  110. Mannepalli K, Sastry PN, Suman M (2017) A novel adaptive fractional deep belief networks for speaker emotion recognition. Alex Eng J 56:485–497

    Google Scholar 

  111. Mannepalli K, Sastry PN, Suman M (2016) FDBN: Design and development of Fractional Deep Belief Networks for speaker emotion recognition. Int J Speech Technol 19:779–790

    Google Scholar 

  112. Mano LY et al (2016) Exploiting IoT technologies for enhancing health smart homes through patient identification and emotion recognition. Comput Commun 89:178–190

    Google Scholar 

  113. Manolov A, Boumbarov O, Manolova A, Poulkov V, Tonchev K (2017). Feature selection in affective speech classification. In: 2017 40th international conference on telecommunications and signal processing, TSP 2017. pp. 354–358. doi:https://doi.org/10.1109/TSP.2017.8076004

  114. Mao Q, Xu G, Xue W, Gou J, Zhan Y (2017) Learning emotion-discriminative and domain-invariant features for domain adaptation in speech emotion recognition. Speech Comm 93:1–10

    Google Scholar 

  115. Martin O, Kotsia I, Macq B, Pitas I (2006). The eNTERFACE'05 audio-visual emotion database. In: 22nd International Conference on Data Engineering Workshops (ICDEW’06). IEEE, pp 8–8

  116. McCormick C (2014). Deep Learning Tutorial - Softmax Regression. http://mccormickml.com/2014/06/13/deep-learning-tutorial-softmax-regression/. Accessed 13 Jun 2014

  117. McFee B, Raffel C, Liang D, Ellis DP, McVicar M, Battenberg E, Nieto O (2015). Librosa: Audio and music signal analysis in python. In: Proceedings of the 14th python in science conference

  118. McLoughlin IV, Chance R (1997). LSP-based speech modification for intelligibility enhancement. In: Proceedings of 13th International Conference on Digital Signal Processing. IEEE, pp 591–594

  119. Meftah AH, Alotaibi YA, Selouani S-A (2018) Evaluation of an Arabic speech corpus of emotions: A perceptual and statistical analysis. IEEE Access 6:72845–72861

    Google Scholar 

  120. Meftah A, Alotaibi Y, Selouani S (2016). Emotional speech recognition: A multilingual perspective. In: 2016 International Conference on Bio-engineering for Smart Technologies (BioSMART), 4–7 Dec. 2016. pp 1–4. doi:https://doi.org/10.1109/BIOSMART.2016.7835600

  121. Mehmood A et al. (2020). Prosperous human gait recognition: an end-to-end system based on pre-trained CNN features selection

  122. Mehta D, Siddiqui M, Javaid A (2018) Facial emotion recognition: A survey and real-world user experiences in mixed reality. Sensors 18:416

    Google Scholar 

  123. Meng H, Yan T, Yuan F, Wei H (2019) Speech emotion recognition from 3D log-Mel spectrograms with deep learning network. IEEE access 7:125868–125881

    Google Scholar 

  124. Mesnil G et al. (2011). Unsupervised and transfer learning challenge: a deep learning approach. In: Proceedings of the 2011 International Conference on Unsupervised and Transfer Learning workshop-Volume 27. JMLR. org, pp 97–111

  125. Michel P, El Kaliouby R (2003). Real time facial expression recognition in video using support vector machines. In: Proceedings of the 5th international conference on Multimodal interfaces. ACM, pp 258–264

  126. MicroPyramid (2011) Understanding Audio Quality: Bit Rate, Sample Rate. https://micropyramid.com/blog/understanding-audio-quality-bit-rate-sample-rate/. 2011

  127. Milton A, Roy SS, Selvi ST (2013). SVM scheme for speech emotion recognition using MFCC feature international journal of computer applications 69

  128. Mishra AN, Shrotriya M, Sharan S (2010). Comparative wavelet, PLP, and LPC speech recognition techniques on the Hindi speech digits database. In: Second International Conference on Digital Image Processing. International Society for Optics and Photonics, p 754634

  129. Molchanov D, Ashukha A, Vetrov D (2017). Variational dropout sparsifies deep neural networks. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, pp 2498–2507

  130. Morrison D, Wang R, De Silva LC (2007) Ensemble methods for spoken emotion recognition in call-centres. Speech Comm 49:98–112

    Google Scholar 

  131. Mu Y, Gómez LAH, Montes AC, MARTÍNEZ CA, Wang X, Gao H (2017). Speech emotion recognition using convolutional-recurrent neural networks with attention model DEStech transactions on computer science and engineering

  132. Muda L, Begam M, Elamvazuthi I (2010). Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques arXiv preprint arXiv:10034083

  133. Mukherjee H, Dhar A, Obaidullah SM, Phadikar S, Roy KJMT, Applications (2020). Image-based features for speech signal classification:1–17

  134. Murray IR, Arnott JL (1993) Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion. J Acoust Soc Am 93:1097–1108

    Google Scholar 

  135. Naz I, Muhammad N, Yasmin M, Sharif M, Shah JH, Fernandes SLJJoMiM, Biology (2019). Robust discrimination of leukocytes protuberant types for early diagnosis of leukemia 19:1950055

  136. Neiberg D, Elenius K, Laskowski K (2006). Emotion recognition in spontaneous speech using GMMs. In: Ninth international conference on spoken language processing

  137. Neumann M, Vu NT (2017). Attentive convolutional neural network based speech emotion recognition: A study on the impact of input features, signal length, and acted speech arXiv preprint arXiv:170600612

  138. Ng A (2017). Improving deep neural networks: Hyperparameter tuning, regularization and optimization Deeplearning ai on Coursera

  139. Nwe TL, Foo SW, De Silva LC (2003) Speech emotion recognition using hidden Markov models. Speech Comm 41:603–623

    Google Scholar 

  140. Nweke HF, Teh YW, Al-Garadi MA, Alo UR (2018) Deep learning algorithms for human activity recognition using mobile and wearable sensor networks: State of the art and research challenges. Expert Systems with Applications 105:233–261

    Google Scholar 

  141. Pannu HS, Ahuja S, Dang N, Soni S, Malhi AKJMT, APPLICATIONS (2020). Deep learning based image classification for intestinal hemorrhage

  142. Papakostas M, Siantikos G, Giannakopoulos T, Spyrou E, Sgouropoulos D (2017a). Recognizing emotional states using speech information. In: GeNeDis 2016. Springer, pp 155-164

  143. Papakostas M, Spyrou E, Giannakopoulos T, Siantikos G, Sgouropoulos D, Mylonas P, Makedon F (2017b). Deep visual attributes vs. hand-crafted audio features on multidomain speech emotion recognition computation 5:26

  144. Partila P, Voznak M, Tovarek J (2015a). Pattern recognition methods and features selection for speech emotion recognition system The Scientific World Journal 2015

  145. Partila P, Voznak M, Tovarek J (2015b) Pattern Recognition Methods and Features Selection for Speech Emotion Recognition System. TheScientificWorldJournal 2015:573068–573067. https://doi.org/10.1155/2015/573068

    Article  Google Scholar 

  146. Pavez E, Silva JF (2012) Analysis and design of wavelet-packet cepstral coefficients for automatic speech recognition. Speech Comm 54:814–835

    Google Scholar 

  147. Picard RW, Vyzas E, Healey J (2001). Toward machine emotional intelligence: Analysis of affective physiological state IEEE Transactions on Pattern Analysis & Machine Intelligence:1175–1191

  148. Pires ES, Machado JT, de Moura OP, Cunha JB, Mendes L (2010) Particle swarm optimization with fractional-order velocity. Nonlinear Dyn 61:295–301

    MATH  Google Scholar 

  149. Poria S, Cambria E, Gelbukh A (2016) Aspect extraction for opinion mining with a deep convolutional neural network. Knowl.-Based Syst 108:42–49

    Google Scholar 

  150. Povey D et al. (2011). The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on automatic speech recognition and understanding, 2011. vol CONF. IEEE Signal Processing Society,

  151. Prabhakar OP, Sahu NK (2013). A survey on: voice command recognition technique international journal of advanced research in computer science and software engineering 3

  152. Rabiner LR (1978). Digital processing of speech signal digital processing of speech signal

  153. Rabiner LR, Gold B (1975). Theory and application of digital signal processing Englewood cliffs, NJ, prentice-Hall, Inc, 1975 777 p

  154. Raj RJS, Shobana SJ, Pustokhina IV, Pustokhin DA, Gupta D, Shankar KJIA (2020). Optimal Feature Selection-Based Medical Image Classification Using Deep Learning Model in Internet of Medical Things 8:58006–58017

  155. Ralph Abbey TH, and Tao Wang (2017). Methods of multinomial classification using support vector machines paper presented at the SAS® global forum, Orlando, Florida

  156. Rana R, Epps J, Jurdak R, Li X, Goecke R, Brereton M, Soar J (n.d.). Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech

  157. Ranzato MA, Poultney C, Chopra S, Cun YL (2007). Efficient learning of sparse representations with an energy-based model. In: Advances in neural information processing systems. pp. 1137–1144

  158. Reynolds DA, Rose RC (1995) Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE transactions on speech and audio processing 3:72–83

    Google Scholar 

  159. Rifai S, Vincent P, Muller X, Glorot X, Bengio Y (2011). Contractive auto-encoders: Explicit invariance during feature extraction. In: Proceedings of the 28th International Conference on International Conference on Machine Learning. Omnipress, pp 833–840

  160. Roy T, Marwala T, Chakraverty SJMMiIS (2020). A Survey of Classification Techniques in Speech Emotion Recognition:33–48

  161. Ruder S (2016). An overview of gradient descent optimization algorithms arXiv preprint arXiv:160904747

  162. Salakhutdinov R, Larochelle H (2010) Efficient learning of deep Boltzmann machines. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics. pp. 693–700

  163. Satt A, Rozenberg S, Hoory R (2017). Efficient emotion recognition from speech using deep learning on spectrograms. In: INTERSPEECH. pp. 1089–1093

  164. Schaul T et al. (2010). PyBrain Journal of Machine Learning Research 11:743–746

  165. Scherer KR (1986) Vocal affect expression: A review and a model for future research. Psychol Bull 99:143

    Google Scholar 

  166. Schuller B, Rigoll G, Lang M (2004). Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: Acoustics, Speech, and Signal Processing, 2004. Proceedings.(ICASSP'04). IEEE International Conference on. IEEE, pp I-577

  167. Schuller B, Steidl S, Batliner A (2009). The interspeech 2009 emotion challenge. In: Tenth Annual Conference of the International Speech Communication Association

  168. Schuller B, Steidl S, Batliner A, Burkhardt F, Devillers L, Müller C, Narayanan SS The INTERSPEECH (2010). Paralinguistic challenge. In: Eleventh Annual Conference of the International Speech Communication Association, 2010

  169. Seide F, Agarwal A (2016). CNTK: Microsoft’s open-source deep-learning toolkit. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp 2135–2135

  170. Severyn A, Moschitti A (2015). Twitter sentiment analysis with deep convolutional neural networks. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, pp 959–962

  171. Sezgin M, Gunsel B, Karabulut Kurt G (2012a). Perceptual audio features for emotion detection EURASIP journal on audio, Speech, and Music Processing 2012 doi:https://doi.org/10.1186/1687-4722-2012-16

  172. Sezgin C, Gunsel B, Krajewski J (2015) Medium term speaker state detection by perceptually masked spectral features. Speech Comm 67:26–41

    Google Scholar 

  173. Sezgin MC, Gunsel B, Kurt GK (2012b) Perceptual audio features for emotion detection EURASIP journal on audio. Speech, and Music Processing 2012:16

    Google Scholar 

  174. Shaburov V, Monastyrshyn Y (2017). Emotion recognition in video conferencing. Google Patents,

  175. Shahsavarani S (2018). Speech emotion recognition using convolutional neural networks

  176. Shami MT, Kamel MS (2005). Segment-based approach to the recognition of emotions in speech. In: 2005 IEEE International Conference on Multimedia and Expo. IEEE, p 4 pp.

  177. Sharma M, Jalal AS, Khan A (2019) Emotion recognition using facial expression by fusing key points descriptor and texture features. Multimed Tools Appl 78:16195–16219

    Google Scholar 

  178. Sivanagaraja T, Ho MK, Khong AWH, Wang Y (2017). End-to-end speech emotion recognition using multi-scale convolution networks. In: 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 12–15 Dec. 2017. pp 189–192. doi:https://doi.org/10.1109/APSIPA.2017.8282026

  179. Soong F, Juang B (1984). Line spectrum pair (LSP) and speech data compression. In: ICASSP'84. IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, pp 37–40

  180. Srikanth M, Pravena D, Govind D (2018a). Tamil speech emotion recognition using deep belief network(DBN) vol 678. doi:https://doi.org/10.1007/978-3-319-67934-1_29

  181. Srikanth M, Pravena D, Govind D (2018b). Tamil Speech Emotion Recognition Using Deep Belief Network(DBN). In, Cham. Advances in Signal Processing and Intelligent Recognition Systems. Springer International Publishing, pp 328–336

  182. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958

    MathSciNet  MATH  Google Scholar 

  183. Steidl S (2009) Automatic classification of emotion related user states in spontaneous children’s speech. University of Erlangen-Nuremberg Erlangen, Germany

    Google Scholar 

  184. Stolar MN, Lech M, Bolia RS, Skinner M (2017). Real time speech emotion recognition using RGB image classification and transfer learning. In: 2017 11th International Conference on Signal Processing and Communication Systems (ICSPCS), 13–15 Dec. 2017. pp 1–8. doi:https://doi.org/10.1109/ICSPCS.2017.8270472

  185. Sugiyama M, Nakajima S, Kashima H, Buenau PV, Kawanabe M (2008) Direct importance estimation with model selection and its application to covariate shift adaptation. In: Advances in neural information processing systems. pp. 1433–1440

  186. Sun L, Chen J, Xie K, Gu T (2018) Deep and shallow features fusion based on deep convolutional neural network for speech emotion recognition. Int J Speech Technol 21:931–940

    Google Scholar 

  187. Sun R, Moore E (2011). Investigating glottal parameters and teager energy operators in emotion recognition. In: International Conference on Affective Computing and Intelligent Interaction. Springer, pp 425–434

  188. Sunitha Ram C, Ponnusamy R (2014). An effective automatic speech emotion recognition for Tamil language based on DWT and MFCC using Stability-plasticity dilemma Neural network. In: 2014 International conference on information communication and embedded systems, ICICES, 2015. doi:https://doi.org/10.1109/ICICES.2014.7034102

  189. Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, 2014. pp. 3104–3112

  190. Suykens JA, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9:293–300

    Google Scholar 

  191. Swain M, Routray A, Kabisatpathy P (2018) Databases, features and classifiers for speech emotion recognition: a review. Int J Speech Technol 21:93–120. https://doi.org/10.1007/s10772-018-9491-z

    Article  Google Scholar 

  192. Szegedy C et al. (2015). Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1–9

  193. Tang Y (2013). Deep learning using support vector machines CoRR, abs/13060239 2

  194. Tawari A, Trivedi MMJITom (2010). Speech emotion analysis: Exploring the role of context 12:502–509

  195. Teager H (1980) Some observations on oral air flow during phonation IEEE transactions on acoustics. Speech, and Signal Processing 28:599–601

    Google Scholar 

  196. Teager HM, Teager SM (1983). A phenomenological model for vowel production in the vocal tract Speech Science: Recent Advances:73–109

  197. Team TTD et al. (2016). Theano: A Python framework for fast computation of mathematical expressions arXiv preprint arXiv:160502688

  198. Tong DL, Mintram R (2010) Genetic algorithm-neural network (GANN): a study of neural network activation functions and depth of genetic algorithm search applied to feature selection. Int J Mach Learn Cybern 1:75–87

    Google Scholar 

  199. Torres-Carrasquillo PA, Singer E, Kohler MA, Greene RJ, Reynolds DA, Deller JR (2002). Approaches to language identification using Gaussian mixture models and shifted delta cepstral features. In: Seventh international conference on spoken language processing

  200. Trevisan MA, Eguia MC, Mindlin GB (2001) Nonlinear aspects of analysis and synthesis of speech time series data. Phys Rev E 63:026216

    Google Scholar 

  201. Vedaldi A, Lenc K (2015). Matconvnet: Convolutional neural networks for matlab. In: Proceedings of the 23rd ACM international conference on Multimedia. ACM, pp 689–692

  202. Ververidis D, Kotropoulos C (2005). Emotional speech classification using Gaussian mixture models and the sequential floating forward selection algorithm. In: 2005 IEEE International Conference on Multimedia and Expo. IEEE, pp 1500–1503

  203. Vincent P, Larochelle H, Bengio Y, Manzagol P-A (2008). Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on Machine learning. ACM, pp 1096–1103

  204. vlab.amrita.edu (2019) Non-stationary nature of speech signal. Amrita Vishwa Vidyapeetham http://vlabamritaedu/?sub=3&brch=164&sim=371&cnt=1104 Accessed 17 October 2019 2019

  205. Wan L, Zeiler M, Zhang S, Le Cun Y, Fergus R (2013). Regularization of neural networks using dropconnect. In: International conference on machine learning. pp. 1058–1066

  206. Wei P, Zhao Y (2019). A novel speech emotion recognition algorithm based on wavelet kernel sparse classifier in stacked deep auto-encoder model Personal and Ubiquitous Computing:1–9

  207. Wen G, Li H, Huang J, Li D, Xun E (2017). Random deep belief networks for recognizing emotions from speech signals Comput Intell Neurosci 2017

  208. Weninger F, Bergmann J, Schuller B (2015) Introducing currennt: The munich open-source cuda recurrent neural network toolkit. J Mach Learn Res 16:547–551

    MathSciNet  MATH  Google Scholar 

  209. Weninger F, Ringeval F, Marchi E, Schuller BW Discriminatively trained recurrent neural networks for continuous dimensional emotion recognition from audio. In: IJCAI, 2016. pp. 2196–2202

  210. Williams CE, Stevens KN (1972) Emotions and speech: Some acoustical correlates. J Acoust Soc Am 52:1238–1250

    Google Scholar 

  211. Wöllmer M, Metallinou A, Eyben F, Schuller B, Narayanan S (2010). Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional lstm modeling. In: Proc. INTERSPEECH 2010, Makuhari. pp. 2362–2365

  212. Wong E, Sridharan S (2001). Comparison of linear prediction cepstrum coefficients and mel-frequency cepstrum coefficients for language identification. In: Proceedings of 2001 International Symposium on Intelligent Multimedia, Video and Speech Processing. ISIMP 2001 (IEEE Cat. No. 01EX489). IEEE, pp 95–98

  213. Xie Y, Liang R, Liang Z, Zhao L (2019). Attention-Based Dense LSTM for Speech Emotion Recognition IEICE TRANSACTIONS on Information and Systems 102:1426–1429

  214. Yadav KS, Mukhedkar M (2013). Review on speech recognition International Journal of Science and Engineering 1:61–70

  215. Yeh J-H, Pao T-L, Lin C-Y, Tsai Y-W, Chen Y-T (2011). Segment-based emotion recognition from continuous Mandarin Chinese speech Computers in Human Behavior 27:1545–1552

  216. Yu Z et al. (2015). Using bidirectional lstm recurrent neural networks to learn high-level abstractions of sequential features for automated scoring of non-native spontaneous speech. In: 2015 IEEE workshop on automatic speech recognition and understanding (ASRU). IEEE, pp 338–345

  217. Zaidan NA, Salam MS MFCC (2016). Global Features Selection in Improving Speech Emotion Recognition Rate. In, Cham. Advances in Machine Learning and Signal Processing. Springer International Publishing, pp 141–153

  218. Zhalehpour S, Onder O, Akhtar Z, Erdem CE (2016) BAUM-1: A spontaneous audio-visual face database of affective and mental states. IEEE Trans Affect Comput 8:300–313

    Google Scholar 

  219. Zhang W, Meng X, Lu Q, Rao Y, Zhou J A (2013). hybrid emotion recognition on android smart phones. In: 2013 IEEE International Conference on Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing. IEEE, pp 1313–1318

  220. Zhang T, Wu J (2015). Speech emotion recognition with i-vector feature and RNN model. In: 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP). IEEE, pp 524–528

  221. Zhang S, Zhang S, Huang T, Gao W (2017a). Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching IEEE Transactions on Multimedia 20:1576–1590

  222. Zhang W, Zhao D, Chai Z, Yang LT, Liu X, Gong F, Yang S (2017b). Deep learning and SVM-based emotion recognition from Chinese speech for smart affective services Software: Practice and Experience 47:1127–1138

  223. Zhang W, Zhao D, Chen X, Zhang Y (2016c). Deep Learning Based Emotion Recognition from Chinese Speech. In, Cham. Inclusive Smart Cities and Digital Health. Springer International Publishing, pp 49–58

  224. Zhang S, Zhao X, Chuang Y, Guo W, Chen Y (2016a). Feature Learning via Deep Belief Network for Chinese Speech Emotion Recognition. In, Singapore. Pattern recognition. Springer Singapore, pp 645–651

  225. Zhang SQ, Zhao XM, Chuang YL, Guo WP, Chen Y (2016b) Feature learning via deep belief network for Chinese speech emotion recognition. In: Tan T, Li X, Chen X, Zhou J, Yang J, Cheng H (eds) Pattern Recognition, Communications in Computer and Information Science, vol 663. Springer-Verlag Singapore Pte Ltd, Singapore, pp 645–651. https://doi.org/10.1007/978-981-10-3005-5_53

    Chapter  Google Scholar 

  226. Zhao Z, Bao Z, Zhao Y, Zhang Z, Cummins N, Ren Z, Schuller B (2019b). Exploring deep spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition IEEE Access 7:97515–97525

  227. Zhao J, Mao X, Chen L (2019a). Speech emotion recognition using deep 1D & 2D CNN LSTM networks biomedical signal processing and control 47:312-323

  228. Zheng W, Yu J, Zou Y (2015). An experimental study of speech emotion recognition based on deep convolutional neural networks. In: 2015 international conference on affective computing and intelligent interaction (ACII). IEEE, pp 827–831

  229. Zhu L, Chen L, Zhao D, Zhou J, Zhang W (2017a). Emotion recognition from Chinese speech for smart affective services using a combination of SVM and DBN Sensors 17:1694

  230. Zhu LZ, Chen LM, Zhao DH, Zhou JH, Zhang WS (2017b). Emotion Recognition from Chinese Speech for Smart Affective Services Using a Combination of SVM and DBN Sensors 17:14. https://doi.org/10.3390/s17071694

  231. Zou CR, Zhang XR, Zha C, Zhao L (2016). A novel DBN feature fusion model for cross-Corpus speech emotion recognition journal of electrical and computer engineering:11 https://doi.org/10.1155/2016/7437860

  232. Z-w H, Xue W-t, Mao Q-R (2015) Speech emotion recognition with unsupervised feature learning. Frontiers of Information Technology & Electronic Engineering 16:358–366

    Google Scholar 

  233. Lykartsis A, Weinzierl S (2016). Rhythm Description for Music and Speech Using the Beat Histogram with Multiple Novelty Functions: First Results

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Rashid Jahangir or Ying Wah Teh.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised: The table footer of Table 5 has to be removed as it should be part of the main text. Also, the authors’ biographies and photos were left out.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jahangir, R., Teh, Y.W., Hanif, F. et al. Deep learning approaches for speech emotion recognition: state of the art and research challenges. Multimed Tools Appl 80, 23745–23812 (2021). https://doi.org/10.1007/s11042-020-09874-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-09874-7

Keywords

Navigation