Skip to main content
Log in

Modified dense convolutional networks based emotion detection from speech using its paralinguistic features

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Emotion recognition through speech is one of the fundamental approaches for human interaction. Speech modulations stipulate different emotions and context. In this paper, we propose modified dense convolutional networks (modified DenseNet201) for emotion detection from speech using its paralinguistic features such as vocal tract features. The proposed network performs emotion classification from speech using spectrograms of its audio files. The proposed network outperforms other alternative models like residual networks, AlexNet, VGG16, SVM, XGBoost, boosted random forest etc. for emotion classification from speech. Moreover, the proposed network surpasses all other existing methods proposed in the literature and obtains state-of-the-art results in most of the cases. Further, the proposed network has been successfully validated on two different language datasets: ‘EmoDB’ and ‘SAVEE’ which qualifies it as a language-independent emotion detection system from speech.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Abdelwahab M, Busso C (2018) Study of dense network approaches for speech emotion recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, New York, pp 5084–5088

  2. Abdul Qayyum AB, Arefeen A, Shahnaz C (2019) Convolutional Neural Network (CNN) Based Speech-Emotion Recognition. 2019 IEEE International Conference on Signa-Processing, Information, Communication & Systems (SPICSCON), Dhaka, Bangladesh, pp 122–125. https://doi.org/10.1109/SPICSCON48833.2019.9065172

  3. Arora P, Chaspari T (2019) Exploring siamese neural network architectures for preserving speaker identity in speech emotion classification. In: Proceedings of the 4th International Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction, pp 15–18. ACM, New York

  4. Barsoum E, Zhang C, Ferrer CC, Zhang Z (2016) Training deep networks for facial expression recognition with crowd-sourced label distribution. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, New York, pp 279–283

  5. Birhala A, Ristea CN, Radoi A, Dutu LC (2020) Temporal aggregation of audio-visual modalities for emotion recognition. 2020 43rd International Conference on Telecommunications and Signal Processing (TSP), Milan, Italy, pp 305–308. https://doi.org/10.1109/TSP49548.2020.9163474

  6. Blouin C, Mafolo V (2005) A study on the automatic detection and characterization of emotion in a voice service context. In: Ninth European Conference on Speech Communication and Technology

  7. Bothe C, Magg S, Weber C, Wermter S (2018) Conversational analysis using utterance-level attention-based bidirectional recurrent neural networks. arXiv preprint arXiv:1805.06242

  8. Burkhardt F, Paeschke A, Rolfes M, Sendlmeier WF, Weiss B (2005) A database of German emotional speech. In: Ninth European Conference on Speech Communication and Technology

  9. Burmania A, Busso C (2017) A stepwise analysis of aggregated crowdsourced labels describing multimodal emotional behaviors. In: INTERSPEECH, pp 152–156

  10. Cao H, Cooper DG, Keutmann MK, Gur RC, Nenkova A, Verma R (2014) ‘CREMA-D’: crowd-sourced emotional multimodal actors dataset. IEEE Trans Affect Comput 5(4):377–390

    Article  Google Scholar 

  11. Cummings KE, Clements MA (1995) Analysis of the glottal excitation of emotionally styled and stressed speech. J Acoust Soc Am 98(1):88–98

    Article  Google Scholar 

  12. Dai D, Wu Z, Li R, Wu X, Jia J, Meng H (2019) Learning discriminative features from spectrograms using center loss for speech emotion recognition. ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, pp 7405–7409. https://doi.org/10.1109/ICASSP.2019.8683765

  13. Doerfler M, Grill T (2017) Inside the spectrogram: Convolutional neural networks in audio processing. https://doi.org/10.1109/SAMPTA.2017.8024472

  14. Fayek HM, Lech M, Cavedon L (2017) Evaluating deep learning architectures for speech emotion recognition. Neural Netw 92:60–68

    Article  Google Scholar 

  15. Fourier Analysis and Synthesis (2018) Hyperphysics.Phy-Astr.Gsu.Edu. http://hyperphysics.phy-astr.gsu.edu/hbase/Audio/fourier.html#c1. Accessed 21 Nov 2018

  16. Fox E (2018) Perspectives from affective science on understanding the nature of emotion. Brain Neurosci Adv. https://doi.org/10.1177/2398212818812628

    Article  Google Scholar 

  17. Ghaleb E, Popa M, Asteriadis S (2019) Multimodal and temporal perception of audio-visual cues for emotion recognition. 2019 8th International Conference on Affective Computing and Interaction I (ACII), Cambridge, United Kingdom, pp 552–55. https://doi.org/10.1109/ACII.2019.8925444

  18. Gulcehre C, Moczulski M, Bengio Y (2014) Adasecant: robust adaptive secant method for stochastic gradient. arXiv preprint arXiv:1412.7419

  19. Gulcehre C, Sotelo J, Moczulski M, Bengio Y (2017) A robust adaptive stochastic gradient method for deep learning. arXiv preprint arXiv:1703.00788

  20. Guo-Feng F, Qing S, Wang H, Hong W-C, Li H-J (2013) Support vector regression model based on empirical mode decomposition and auto regression for electric load forecasting. Energies 6(4):1887–1901

    Article  Google Scholar 

  21. Guo-Feng F, Peng L-L, Hong W-C, Sun F (2016) Electric load forecasting by the SVR model with differential empirical mode decomposition and auto regression. Neurocomputing 173:958–970

    Article  Google Scholar 

  22. Guo-Feng F, Guo Y-H, Zheng J-M, Hong W-C (2020) A generalized regression model based on hybrid empirical mode decomposition and support vector regression with back propagation neural network for mid-short term load forecasting. Journal of Forecasting 39(5):737–756

    Article  MathSciNet  Google Scholar 

  23. Guo-Feng F, Wei X, Li Y-T, Hong W-C (2020) Forecasting electricity consumption using a novel hybrid model. Sustain Cities Soc 61:102320

    Article  Google Scholar 

  24. Hannun A, Case C, Casper J, Catanzaro B et al (2014) Deep Speech: Scaling Up End-to-End Speech Recognition. CoRR, arXiv:1412.5567

  25. Hong W-C, Fan G-F (2019) Hybrid empirical mode decomposition with support vector regression model for short term load forecasting. Energies 12(6):1093

  26. Huang C-W, Narayanan SS (2016) Attention Assisted discovery of sub-utterance structure in speech emotion recognition. In: Proceedings of Interspeech, pp 1387–1391

  27. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Recognition P (CVPR), Honolulu HI, pp 2261–2269. https://doi.org/10.1109/CVPR.2017.243

  28. Iwendi C, Bashir AK, Peshkar A, Sujatha R, Chatterjee JM, Pasupuleti S, Mishra R, Pillai S, Jo O (2020) COVID-19 Patient Health Prediction Using Boosted Random Forest Algorithm. Front Public Health 8:357. https://doi.org/10.3389/fpubh.2020.00357

    Article  Google Scholar 

  29. Jackson P, Haq S (2014) Surrey Audio-Visual Expressed Emotion (SAVEE) Database. University of Surrey, Guildford

    Google Scholar 

  30. Jozefowicz R, Vinyals O, Schuster M, Shazeer N, Wu Y (2016) Exploring the Limits of Language Modeling. arXiv:1602.02410 [cs]

  31. Lakomkin E, Zamani MA, Weber C, Magg S, Wermter S (2018) Emorl: continuous acoustic emotion classification using deep reinforcement learning. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, New York, pp 1–6

  32. Lee J, Tashev I (2015) High-level feature representation using recurrent neural network for speech emotion recognition. In: INTERSPEECH, pp 1537–1540

  33. Li M-W, Geng J, Zhang Wei-ChiangHLi-Dong (2019) Periodogram estimation based on LSSVR-CCPSO compensation for forecasting ship motion. Nonlinear Dyn 97(4):2579–2594

    Article  Google Scholar 

  34. Martens J (2010) Deep learning via hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp 735–742

  35. McFee B, Colin R, Liang D, Ellis D, Mcvicar M, Battenberg E, Nieto O (2015) librosa: Audio and music signal analysis in python, pp 18-24. https://doi.org/10.25080/Majora-7b98e3ed-003

  36. Neiberg D, Elenius K, Karlsson I, Laskowski K (2006) Emotion recognition in spontaneous speech. In: Proceedings of Fonetik, pp 101–104

  37. Oudeyer PY (2002) Novel useful features and algorithms for the recognition of emotions in human speech. In: Speech Prosody 2002, International Conference

  38. Radford A, Jozefowicz R, Sutskever I (2017) Learning to Generate Reviews and Discovering Sentiment. arXiv:1704.01444 [cs]

  39. Ravindran G, Shenbagadevi S, Selvam VS (2010) Cepstral and linear prediction techniques for improving intelligibility and audibility of impaired speech. J Biomed Sci Eng 3(01):85

    Article  Google Scholar 

  40. Sauter DA, Eisner F, Ekman P, Scott SK (2010) Cross-cultural recognition of basic emotions through nonverbal emotional vocalizations. Proc Natl Acad Sci 107(6):2408–2412

  41. Scherer KR (2003) Vocal communication of emotion: a review of research paradigms. Speech Commun 40(1–2):227–256

    Article  Google Scholar 

  42. Singh R, Puri H, Aggarwal N, Gupta V (2020) An efficient language-independent acoustic emotion classification system. Arab J Sci Eng 45:3111–3121. https://doi.org/10.1007/s13369-019-04293-9. Accessed 7 Oct 2020

  43. Smith LN (2017) Cyclical learning rates for training neural networks. IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, pp 464–472. https://doi.org/10.1109/WACV.2017.58

  44. Smith LN (2018) A disciplined approach to neural network hyper-parameters: Part 1: learning rate, batch size, momentum, and weight decay. http://arxiv.org/abs/1803.09820

  45. Smith LN, Topin N (2019) Super-convergence: very fast training of neural networks using large learning rates. Proc SPIE 11006, Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, 1100612. https://doi.org/10.1117/12.2520589

  46. Ververidis D, Kotropoulos C (2006) Emotional speech recognition: resources, features, and methods. Speech Commun 48(9):1162–1181

    Article  Google Scholar 

  47. Vlasenko B, Schuller B, Wendemuth A, Rigoll G (2008) On the influence of phonetic content variation for acoustic emotion recognition. In: International Tutorial and Research Workshop on Perception and Interactive Technologies for Speech-Based Systems. Springer, Berlin, pp 217–220

    Google Scholar 

  48. Wang ZQ, Tashev I (2017) Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, New York, pp 5150–5154

  49. Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M et al (2016) Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv:1609.08144 [cs]

  50. Wu S, Zhong S, Liu Y (2017) Deep residual learning for image analysis. Multimed Tools Appl:1–17. https://doi.org/10.1007/s11042-017-4440-4

  51. Zhang Z, Ding S, Sun Y (2020) A support vector regression model hybridized with chaotic krill herd algorithm and empirical mode decomposition for regression task. Neurocomputing 410:185–201

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Varun Gupta.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dhiman, R., Kang, G.S. & Gupta, V. Modified dense convolutional networks based emotion detection from speech using its paralinguistic features. Multimed Tools Appl 80, 32041–32069 (2021). https://doi.org/10.1007/s11042-021-11210-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11210-6

Keywords

Navigation