Abstract
Emotion recognition through speech is one of the fundamental approaches for human interaction. Speech modulations stipulate different emotions and context. In this paper, we propose modified dense convolutional networks (modified DenseNet201) for emotion detection from speech using its paralinguistic features such as vocal tract features. The proposed network performs emotion classification from speech using spectrograms of its audio files. The proposed network outperforms other alternative models like residual networks, AlexNet, VGG16, SVM, XGBoost, boosted random forest etc. for emotion classification from speech. Moreover, the proposed network surpasses all other existing methods proposed in the literature and obtains state-of-the-art results in most of the cases. Further, the proposed network has been successfully validated on two different language datasets: ‘EmoDB’ and ‘SAVEE’ which qualifies it as a language-independent emotion detection system from speech.
Similar content being viewed by others
References
Abdelwahab M, Busso C (2018) Study of dense network approaches for speech emotion recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, New York, pp 5084–5088
Abdul Qayyum AB, Arefeen A, Shahnaz C (2019) Convolutional Neural Network (CNN) Based Speech-Emotion Recognition. 2019 IEEE International Conference on Signa-Processing, Information, Communication & Systems (SPICSCON), Dhaka, Bangladesh, pp 122–125. https://doi.org/10.1109/SPICSCON48833.2019.9065172
Arora P, Chaspari T (2019) Exploring siamese neural network architectures for preserving speaker identity in speech emotion classification. In: Proceedings of the 4th International Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction, pp 15–18. ACM, New York
Barsoum E, Zhang C, Ferrer CC, Zhang Z (2016) Training deep networks for facial expression recognition with crowd-sourced label distribution. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, New York, pp 279–283
Birhala A, Ristea CN, Radoi A, Dutu LC (2020) Temporal aggregation of audio-visual modalities for emotion recognition. 2020 43rd International Conference on Telecommunications and Signal Processing (TSP), Milan, Italy, pp 305–308. https://doi.org/10.1109/TSP49548.2020.9163474
Blouin C, Mafolo V (2005) A study on the automatic detection and characterization of emotion in a voice service context. In: Ninth European Conference on Speech Communication and Technology
Bothe C, Magg S, Weber C, Wermter S (2018) Conversational analysis using utterance-level attention-based bidirectional recurrent neural networks. arXiv preprint arXiv:1805.06242
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier WF, Weiss B (2005) A database of German emotional speech. In: Ninth European Conference on Speech Communication and Technology
Burmania A, Busso C (2017) A stepwise analysis of aggregated crowdsourced labels describing multimodal emotional behaviors. In: INTERSPEECH, pp 152–156
Cao H, Cooper DG, Keutmann MK, Gur RC, Nenkova A, Verma R (2014) ‘CREMA-D’: crowd-sourced emotional multimodal actors dataset. IEEE Trans Affect Comput 5(4):377–390
Cummings KE, Clements MA (1995) Analysis of the glottal excitation of emotionally styled and stressed speech. J Acoust Soc Am 98(1):88–98
Dai D, Wu Z, Li R, Wu X, Jia J, Meng H (2019) Learning discriminative features from spectrograms using center loss for speech emotion recognition. ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, pp 7405–7409. https://doi.org/10.1109/ICASSP.2019.8683765
Doerfler M, Grill T (2017) Inside the spectrogram: Convolutional neural networks in audio processing. https://doi.org/10.1109/SAMPTA.2017.8024472
Fayek HM, Lech M, Cavedon L (2017) Evaluating deep learning architectures for speech emotion recognition. Neural Netw 92:60–68
Fourier Analysis and Synthesis (2018) Hyperphysics.Phy-Astr.Gsu.Edu. http://hyperphysics.phy-astr.gsu.edu/hbase/Audio/fourier.html#c1. Accessed 21 Nov 2018
Fox E (2018) Perspectives from affective science on understanding the nature of emotion. Brain Neurosci Adv. https://doi.org/10.1177/2398212818812628
Ghaleb E, Popa M, Asteriadis S (2019) Multimodal and temporal perception of audio-visual cues for emotion recognition. 2019 8th International Conference on Affective Computing and Interaction I (ACII), Cambridge, United Kingdom, pp 552–55. https://doi.org/10.1109/ACII.2019.8925444
Gulcehre C, Moczulski M, Bengio Y (2014) Adasecant: robust adaptive secant method for stochastic gradient. arXiv preprint arXiv:1412.7419
Gulcehre C, Sotelo J, Moczulski M, Bengio Y (2017) A robust adaptive stochastic gradient method for deep learning. arXiv preprint arXiv:1703.00788
Guo-Feng F, Qing S, Wang H, Hong W-C, Li H-J (2013) Support vector regression model based on empirical mode decomposition and auto regression for electric load forecasting. Energies 6(4):1887–1901
Guo-Feng F, Peng L-L, Hong W-C, Sun F (2016) Electric load forecasting by the SVR model with differential empirical mode decomposition and auto regression. Neurocomputing 173:958–970
Guo-Feng F, Guo Y-H, Zheng J-M, Hong W-C (2020) A generalized regression model based on hybrid empirical mode decomposition and support vector regression with back propagation neural network for mid-short term load forecasting. Journal of Forecasting 39(5):737–756
Guo-Feng F, Wei X, Li Y-T, Hong W-C (2020) Forecasting electricity consumption using a novel hybrid model. Sustain Cities Soc 61:102320
Hannun A, Case C, Casper J, Catanzaro B et al (2014) Deep Speech: Scaling Up End-to-End Speech Recognition. CoRR, arXiv:1412.5567
Hong W-C, Fan G-F (2019) Hybrid empirical mode decomposition with support vector regression model for short term load forecasting. Energies 12(6):1093
Huang C-W, Narayanan SS (2016) Attention Assisted discovery of sub-utterance structure in speech emotion recognition. In: Proceedings of Interspeech, pp 1387–1391
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Recognition P (CVPR), Honolulu HI, pp 2261–2269. https://doi.org/10.1109/CVPR.2017.243
Iwendi C, Bashir AK, Peshkar A, Sujatha R, Chatterjee JM, Pasupuleti S, Mishra R, Pillai S, Jo O (2020) COVID-19 Patient Health Prediction Using Boosted Random Forest Algorithm. Front Public Health 8:357. https://doi.org/10.3389/fpubh.2020.00357
Jackson P, Haq S (2014) Surrey Audio-Visual Expressed Emotion (SAVEE) Database. University of Surrey, Guildford
Jozefowicz R, Vinyals O, Schuster M, Shazeer N, Wu Y (2016) Exploring the Limits of Language Modeling. arXiv:1602.02410 [cs]
Lakomkin E, Zamani MA, Weber C, Magg S, Wermter S (2018) Emorl: continuous acoustic emotion classification using deep reinforcement learning. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, New York, pp 1–6
Lee J, Tashev I (2015) High-level feature representation using recurrent neural network for speech emotion recognition. In: INTERSPEECH, pp 1537–1540
Li M-W, Geng J, Zhang Wei-ChiangHLi-Dong (2019) Periodogram estimation based on LSSVR-CCPSO compensation for forecasting ship motion. Nonlinear Dyn 97(4):2579–2594
Martens J (2010) Deep learning via hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp 735–742
McFee B, Colin R, Liang D, Ellis D, Mcvicar M, Battenberg E, Nieto O (2015) librosa: Audio and music signal analysis in python, pp 18-24. https://doi.org/10.25080/Majora-7b98e3ed-003
Neiberg D, Elenius K, Karlsson I, Laskowski K (2006) Emotion recognition in spontaneous speech. In: Proceedings of Fonetik, pp 101–104
Oudeyer PY (2002) Novel useful features and algorithms for the recognition of emotions in human speech. In: Speech Prosody 2002, International Conference
Radford A, Jozefowicz R, Sutskever I (2017) Learning to Generate Reviews and Discovering Sentiment. arXiv:1704.01444 [cs]
Ravindran G, Shenbagadevi S, Selvam VS (2010) Cepstral and linear prediction techniques for improving intelligibility and audibility of impaired speech. J Biomed Sci Eng 3(01):85
Sauter DA, Eisner F, Ekman P, Scott SK (2010) Cross-cultural recognition of basic emotions through nonverbal emotional vocalizations. Proc Natl Acad Sci 107(6):2408–2412
Scherer KR (2003) Vocal communication of emotion: a review of research paradigms. Speech Commun 40(1–2):227–256
Singh R, Puri H, Aggarwal N, Gupta V (2020) An efficient language-independent acoustic emotion classification system. Arab J Sci Eng 45:3111–3121. https://doi.org/10.1007/s13369-019-04293-9. Accessed 7 Oct 2020
Smith LN (2017) Cyclical learning rates for training neural networks. IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, pp 464–472. https://doi.org/10.1109/WACV.2017.58
Smith LN (2018) A disciplined approach to neural network hyper-parameters: Part 1: learning rate, batch size, momentum, and weight decay. http://arxiv.org/abs/1803.09820
Smith LN, Topin N (2019) Super-convergence: very fast training of neural networks using large learning rates. Proc SPIE 11006, Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, 1100612. https://doi.org/10.1117/12.2520589
Ververidis D, Kotropoulos C (2006) Emotional speech recognition: resources, features, and methods. Speech Commun 48(9):1162–1181
Vlasenko B, Schuller B, Wendemuth A, Rigoll G (2008) On the influence of phonetic content variation for acoustic emotion recognition. In: International Tutorial and Research Workshop on Perception and Interactive Technologies for Speech-Based Systems. Springer, Berlin, pp 217–220
Wang ZQ, Tashev I (2017) Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, New York, pp 5150–5154
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M et al (2016) Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv:1609.08144 [cs]
Wu S, Zhong S, Liu Y (2017) Deep residual learning for image analysis. Multimed Tools Appl:1–17. https://doi.org/10.1007/s11042-017-4440-4
Zhang Z, Ding S, Sun Y (2020) A support vector regression model hybridized with chaotic krill herd algorithm and empirical mode decomposition for regression task. Neurocomputing 410:185–201
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Dhiman, R., Kang, G.S. & Gupta, V. Modified dense convolutional networks based emotion detection from speech using its paralinguistic features. Multimed Tools Appl 80, 32041–32069 (2021). https://doi.org/10.1007/s11042-021-11210-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11210-6