Skip to main content
Log in

Deep features-based speech emotion recognition for smart affective services

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Emotion recognition from speech signals is an interesting research with several applications like smart healthcare, autonomous voice response systems, assessing situational seriousness by caller affective state analysis in emergency centers, and other smart affective services. In this paper, we present a study of speech emotion recognition based on the features extracted from spectrograms using a deep convolutional neural network (CNN) with rectangular kernels. Typically, CNNs have square shaped kernels and pooling operators at various layers, which are suited for 2D image data. However, in case of spectrograms, the information is encoded in a slightly different manner. Time is represented along the x-axis and y-axis shows frequency of the speech signal, whereas, the amplitude is indicated by the intensity value in the spectrogram at a particular position. To analyze speech through spectrograms, we propose rectangular kernels of varying shapes and sizes, along with max pooling in rectangular neighborhoods, to extract discriminative features. The proposed scheme effectively learns discriminative features from speech spectrograms and performs better than many state-of-the-art techniques when evaluated its performance on Emo-DB and Korean speech dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Abdelgawad H, Shalaby A, Abdulhai B, Gutub AAA (2014) Microscopic modeling of large-scale pedestrian–vehicle conflicts in the city of Madinah, Saudi Arabia. J Adv Transp 48:507–525

    Article  Google Scholar 

  2. Ahmad J, Muhammad K, Kwon S-I, Baik SW, Rho S (2016) Dempster-Shafer Fusion Based Gender Recognition for Speech Analysis Applications. In: Platform Technology and Service (PlatCon), 2016 International Conference on, pp 1–4

  3. Ahmad J, Sajjad M, Rho S, Kwon S-I, Lee MY, Baik SW (2016) Determining speaker attributes from stress-affected speech in emergency situations with hybrid SVM-DNN architecture. Multimed Tools Appl 1–25. https://doi.org/10.1007/s11042-016-4041-7

    Article  Google Scholar 

  4. Ahmad J, Fiaz M, Kwon S-I, Sodanil M, Vo B, Baik SW (2016) Gender Identification using MFCC for Telephone Applications-A Comparative Study. International Journal of Computer Science and Electronics Engineering 3.5 (2015):351–355

  5. Aly SA, AlGhamdi TA, Salim M, Amin HH, Gutub AA (2014) Information Gathering Schemes For Collaborative Sensor Devices. Procedia Comput Sci 32:1141–1146

    Article  Google Scholar 

  6. Badshah AM, Ahmad J, Rahim N, Baik SW (2017) Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network. In: Platform Technology and Service (PlatCon), 2017 International Conference on, pp 1–5

  7. Banse R, Scherer KR (1996) Acoustic profiles in vocal emotion expression. J Pers Soc Psychol 70:614

    Article  Google Scholar 

  8. Bengio Y, Courville A, Vincent P (2013) Representation learning: A review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35:1798–1828

    Article  Google Scholar 

  9. Burkhardt F, Paeschke A, Rolfes M, Sendlmeier WF, Weiss B (2005) A database of German emotional speech. In: Interspeech, pp 1517–1520

  10. Curtis S, Zafar B, Gutub A, Manocha D (2013) Right of way. Vis Comput 29:1277–1292

    Article  Google Scholar 

  11. Deng L, Seltzer ML, Yu D, Acero A, Mohamed A-R, Hinton GE (2010) Binary coding of speech spectrograms using a deep auto-encoder. In: Interspeech, pp 1692–1695

  12. Deng J, Zhang Z, Marchi E, Schuller B (2013) Sparse autoencoder-based feature transfer learning for speech emotion recognition. In: Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on, pp 511–516

  13. Dennis J, Tran HD, Li H (2011) Spectrogram image feature for sound event classification in mismatched conditions. IEEE Signal Process Lett 18:130–133

    Article  Google Scholar 

  14. El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recogn 44:572–587

    Article  Google Scholar 

  15. Engberg IS, Hansen AV, Andersen O, Dalsgaard P (1997) Design, recording and verification of a danish emotional speech database. In: Eurospeech

  16. Eyben F, Wöllmer M, Schuller B (2009) OpenEAR—introducing the Munich open-source emotion and affect recognition toolkit. In: Affective Computing and Intelligent Interaction and Workshops, 2009. ACII 2009. 3rd International Conference on, pp 1–6

  17. France DJ, Shiavi RG, Silverman S, Silverman M, Wilkes M (2000) Acoustical properties of speech as indicators of depression and suicidal risk. IEEE Trans Biomed Eng 47:829–837

    Article  Google Scholar 

  18. Gharavian D, Sheikhan M, Nazerieh A, Garoucy S (2012) Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network. Neural Comput & Applic 21:2115–2126

    Article  Google Scholar 

  19. Guo Z, Wang ZJ (2013) An unsupervised hierarchical feature learning framework for one-shot image recognition. IEEE Trans Multimedia 15:621–632

    Article  Google Scholar 

  20. Gutub A, Alharthi N (2011) Improving Hajj and Umrah Services Utilizing Exploratory Data Visualization Techniques. Inf Vis 10:356–371

    Article  Google Scholar 

  21. Guven E, Bock P (2010) Speech emotion recognition using a backward context. In: Applied Imagery Pattern Recognition Workshop (AIPR), 2010 I.E. 39th, pp 1–5

  22. Haq S, Jackson PJ, Edge J (2009) Speaker-dependent audio-visual emotion recognition. In: AVSP, pp 53–58

  23. Hu H, Xu M-X, Wu W (2007) Fusion of global statistical and segmental spectral features for speech emotion recognition. In: INTERSPEECH, pp 2269–2272

  24. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R et al (2014) Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on Multimedia, pp 675–678

  25. Kaysi I, Sayour M, Alshalalfah B, Gutub A (2012) Rapid transit service in the unique context of Holy Makkah: assessing the first year of operation during the 2010 pilgrimage season. Urban Transp XVIII Urban Transp Environ 21st Century 18:253

    Google Scholar 

  26. Kaysi I, Alshalalfah B, Shalaby A, Sayegh A, Sayour M, Gutub A (2013) Users' Evaluation of Rail Systems in Mass Events: Case Study in Mecca, Saudi Arabia. Transp Res Rec J Transp Res Board 2350:111–118

    Article  Google Scholar 

  27. Khan MK, Zakariah M, Malik H, Choo K-KR (2017) A novel audio forensic data-set for digital multimedia forensics. Aust J Forensic Sci 1–18. http://doi.org/10.1080/00450618.2017.1296186

    Article  Google Scholar 

  28. Kim S, Guy SJ, Hillesland K, Zafar B, Gutub AA-A, Manocha D (2015) Velocity-based modeling of physical interactions in dense crowds. Vis Comput 31:541–555

    Article  Google Scholar 

  29. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  30. Krothapalli SR, Koolagudi SG (2013) Emotion recognition using vocal tract information. In: Emotion Recognition Using Speech Features, ed. Springer, pp 67–78

  31. Liu P, Choo K-KR, Wang L, Huang F (2016) SVM or deep learning? A comparative study on remote sensing image classification. Soft Comput 1–13. https://doi.org/10.1007/s00500-016-2247-2

    Article  Google Scholar 

  32. Lugger M, Janoir M-E, Yang B (2009) Combining classifiers with diverse feature sets for robust speaker independent emotion recognition. In: Signal Processing Conference, 2009 17th European, pp 1225–1229

  33. Mao Q, Wang X, Zhan Y (2010) Speech emotion recognition method based on improved decision tree and layered feature selection. Int J Humanoid Rob 7:245–261

    Article  Google Scholar 

  34. Mao Q, Dong M, Huang Z, Zhan Y (2014) Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans Multimedia 16:2203–2213

    Article  Google Scholar 

  35. Morrison D, Wang R, De Silva LC (2007) Ensemble methods for spoken emotion recognition in call-centres. Speech Comm 49:98–112

    Article  Google Scholar 

  36. Nanda A, Sa PK, Choudhury SK, Bakshi S, Majhi B (2017) A Neuromorphic Person Re-Identification Framework for Video Surveillance. IEEE Access 5:6471–6482

    Google Scholar 

  37. Pao T-L, Chen Y-T, Yeh J-H, Cheng Y-M, Lin Y-Y (2007) A comparative study of different weighting schemes on KNN-based emotion recognition in Mandarin speech. Advanced Intelligent Computing Theories and Applications. With Aspects of Theoretical and Methodological Issues, pp 997–1005

  38. Ramakrishnan S, El Emary IM (2013) Speech emotion recognition approaches in human computer interaction. Telecommun Syst 52(3):1467–1478

    Article  Google Scholar 

  39. Raman R, Sa PK, Majhi B, Bakshi S (2016) Direction Estimation for Pedestrian Monitoring System in Smart Cities: An HMM Based Approach. IEEE Access 4:5788–5808

    Article  Google Scholar 

  40. Rao KS, Koolagudi SG, Vempada RR (2013) Emotion recognition from speech using global and local prosodic features. Int Journal Speech Technol 16:143–160

    Article  Google Scholar 

  41. Rout JK, Choo K-KR, Dash AK, Bakshi S, Jena SK, Williams KL (2017) A model for sentiment and emotion analysis of unstructured social media text. Electron Commer Res 1–19. https://doi.org/10.1007/s10660-017-9257-8

    Article  Google Scholar 

  42. Schmidt EM, Kim YE (2011) Learning emotion-based acoustic features with deep belief networks. In: Applications of Signal Processing to Audio and Acoustics (WASPAA), 2011 I.E. Workshop on, pp 65–68

  43. Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP'04). IEEE International Conference on, pp I-577

  44. Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (1929-1958) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:2014

    MathSciNet  MATH  Google Scholar 

  45. Stuhlsatz A, Meyer C, Eyben F, Zielke T, Meier G, Schuller B (2011) Deep neural networks for acoustic emotion recognition: raising the benchmarks. In: Acoustics, Speech and Signal Processing (ICASSP), 2011 I.E. International Conference on, pp 5688–5691

  46. Sun R, Moore E (2011) Investigating glottal parameters and teager energy operators in emotion recognition. Affective computing and intelligent interaction, pp 425-434

  47. Ververidis D, Kotropoulos C (2006) Emotional speech recognition: Resources, features, and methods. Speech Comm 48:1162–1181

    Article  Google Scholar 

  48. Wöllmer M, Metallinou A, Katsamanis N, Schuller B, Narayanan S (2012) Analyzing the memory of BLSTM neural networks for enhanced emotion classification in dyadic spoken interactions. In: Acoustics, Speech and Signal Processing (ICASSP), 2012 I.E. International Conference on, pp 4157–4160

  49. Xia M, Lijiang C (2010) Speech emotion recognition based on parametric filter and fractal dimension. IEICE Trans Inf Syst 93:2324–2326

    Google Scholar 

  50. Xu Z, Luo X, Liu Y, Choo K-KR, Sugumaran V, Yen N et al (2016) From latency, through outbreak, to decline: detecting different states of emergency events using web resources. IEEE Trans Big Data PP:1 https://doi.org/10.1109/TBDATA.2016.2599935

    Article  Google Scholar 

  51. Yen N, Zhang H, Wei X, Lu Z, Choo K-KR, Mei L et al (2017) Social Sensors Based Online Attention Computing of Public Safety Events. IEEE Trans Emerg Top Comput 5(3):403–411

    Article  Google Scholar 

  52. Yu D, Seltzer ML, Li J, Huang J-T, Seide F (2013) Feature learning in deep neural networks-studies on speech recognition tasks. Published at ICLR 2013. https://sites.google.com/site/representationlearning2013/

  53. Yun S, Yoo CD (2012) Loss-scaled large-margin Gaussian mixture models for speech emotion classification. IEEE Trans Audio Speech Lang Process 20:585–598

    Article  Google Scholar 

  54. (2017, 4–5-2017). NVIDIA/DIGITS. Available: https://github.com/NVIDIA/DIGITS

Download references

Acknowledgements

This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. R0126-15- 1119, Development of a solution for situation-awareness based on the analysis of speech and environmental sounds).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sung Wook Baik.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Badshah, A.M., Rahim, N., Ullah, N. et al. Deep features-based speech emotion recognition for smart affective services. Multimed Tools Appl 78, 5571–5589 (2019). https://doi.org/10.1007/s11042-017-5292-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-5292-7

Keywords

Navigation