Modified dense convolutional networks based emotion detection from speech using its paralinguistic features

Dhiman, Ritika; Kang, Gurkanwal Singh; Gupta, Varun

doi:10.1007/s11042-021-11210-6

Modified dense convolutional networks based emotion detection from speech using its paralinguistic features

Published: 22 July 2021

Volume 80, pages 32041–32069, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Ritika Dhiman¹,
Gurkanwal Singh Kang¹ &
Varun Gupta¹

358 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Emotion recognition through speech is one of the fundamental approaches for human interaction. Speech modulations stipulate different emotions and context. In this paper, we propose modified dense convolutional networks (modified DenseNet201) for emotion detection from speech using its paralinguistic features such as vocal tract features. The proposed network performs emotion classification from speech using spectrograms of its audio files. The proposed network outperforms other alternative models like residual networks, AlexNet, VGG16, SVM, XGBoost, boosted random forest etc. for emotion classification from speech. Moreover, the proposed network surpasses all other existing methods proposed in the literature and obtains state-of-the-art results in most of the cases. Further, the proposed network has been successfully validated on two different language datasets: ‘EmoDB’ and ‘SAVEE’ which qualifies it as a language-independent emotion detection system from speech.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A lightweight 2D CNN based approach for speaker-independent emotion recognition from speech with new Indian Emotional Speech Corpora

Article 21 February 2023

Emotion Recognition in Speech Using Convolutional Neural Networks

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

References

Abdelwahab M, Busso C (2018) Study of dense network approaches for speech emotion recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, New York, pp 5084–5088
Abdul Qayyum AB, Arefeen A, Shahnaz C (2019) Convolutional Neural Network (CNN) Based Speech-Emotion Recognition. 2019 IEEE International Conference on Signa-Processing, Information, Communication & Systems (SPICSCON), Dhaka, Bangladesh, pp 122–125. https://doi.org/10.1109/SPICSCON48833.2019.9065172
Arora P, Chaspari T (2019) Exploring siamese neural network architectures for preserving speaker identity in speech emotion classification. In: Proceedings of the 4th International Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction, pp 15–18. ACM, New York
Barsoum E, Zhang C, Ferrer CC, Zhang Z (2016) Training deep networks for facial expression recognition with crowd-sourced label distribution. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, New York, pp 279–283
Birhala A, Ristea CN, Radoi A, Dutu LC (2020) Temporal aggregation of audio-visual modalities for emotion recognition. 2020 43rd International Conference on Telecommunications and Signal Processing (TSP), Milan, Italy, pp 305–308. https://doi.org/10.1109/TSP49548.2020.9163474
Blouin C, Mafolo V (2005) A study on the automatic detection and characterization of emotion in a voice service context. In: Ninth European Conference on Speech Communication and Technology
Bothe C, Magg S, Weber C, Wermter S (2018) Conversational analysis using utterance-level attention-based bidirectional recurrent neural networks. arXiv preprint arXiv:1805.06242
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier WF, Weiss B (2005) A database of German emotional speech. In: Ninth European Conference on Speech Communication and Technology
Burmania A, Busso C (2017) A stepwise analysis of aggregated crowdsourced labels describing multimodal emotional behaviors. In: INTERSPEECH, pp 152–156
Cao H, Cooper DG, Keutmann MK, Gur RC, Nenkova A, Verma R (2014) ‘CREMA-D’: crowd-sourced emotional multimodal actors dataset. IEEE Trans Affect Comput 5(4):377–390
Article Google Scholar
Cummings KE, Clements MA (1995) Analysis of the glottal excitation of emotionally styled and stressed speech. J Acoust Soc Am 98(1):88–98
Article Google Scholar
Dai D, Wu Z, Li R, Wu X, Jia J, Meng H (2019) Learning discriminative features from spectrograms using center loss for speech emotion recognition. ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, pp 7405–7409. https://doi.org/10.1109/ICASSP.2019.8683765
Doerfler M, Grill T (2017) Inside the spectrogram: Convolutional neural networks in audio processing. https://doi.org/10.1109/SAMPTA.2017.8024472
Fayek HM, Lech M, Cavedon L (2017) Evaluating deep learning architectures for speech emotion recognition. Neural Netw 92:60–68
Article Google Scholar
Fourier Analysis and Synthesis (2018) Hyperphysics.Phy-Astr.Gsu.Edu. http://hyperphysics.phy-astr.gsu.edu/hbase/Audio/fourier.html#c1. Accessed 21 Nov 2018
Fox E (2018) Perspectives from affective science on understanding the nature of emotion. Brain Neurosci Adv. https://doi.org/10.1177/2398212818812628
Article Google Scholar
Ghaleb E, Popa M, Asteriadis S (2019) Multimodal and temporal perception of audio-visual cues for emotion recognition. 2019 8th International Conference on Affective Computing and Interaction I (ACII), Cambridge, United Kingdom, pp 552–55. https://doi.org/10.1109/ACII.2019.8925444
Gulcehre C, Moczulski M, Bengio Y (2014) Adasecant: robust adaptive secant method for stochastic gradient. arXiv preprint arXiv:1412.7419
Gulcehre C, Sotelo J, Moczulski M, Bengio Y (2017) A robust adaptive stochastic gradient method for deep learning. arXiv preprint arXiv:1703.00788
Guo-Feng F, Qing S, Wang H, Hong W-C, Li H-J (2013) Support vector regression model based on empirical mode decomposition and auto regression for electric load forecasting. Energies 6(4):1887–1901
Article Google Scholar
Guo-Feng F, Peng L-L, Hong W-C, Sun F (2016) Electric load forecasting by the SVR model with differential empirical mode decomposition and auto regression. Neurocomputing 173:958–970
Article Google Scholar
Guo-Feng F, Guo Y-H, Zheng J-M, Hong W-C (2020) A generalized regression model based on hybrid empirical mode decomposition and support vector regression with back propagation neural network for mid-short term load forecasting. Journal of Forecasting 39(5):737–756
Article MathSciNet Google Scholar
Guo-Feng F, Wei X, Li Y-T, Hong W-C (2020) Forecasting electricity consumption using a novel hybrid model. Sustain Cities Soc 61:102320
Article Google Scholar
Hannun A, Case C, Casper J, Catanzaro B et al (2014) Deep Speech: Scaling Up End-to-End Speech Recognition. CoRR, arXiv:1412.5567
Hong W-C, Fan G-F (2019) Hybrid empirical mode decomposition with support vector regression model for short term load forecasting. Energies 12(6):1093
Huang C-W, Narayanan SS (2016) Attention Assisted discovery of sub-utterance structure in speech emotion recognition. In: Proceedings of Interspeech, pp 1387–1391
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Recognition P (CVPR), Honolulu HI, pp 2261–2269. https://doi.org/10.1109/CVPR.2017.243
Iwendi C, Bashir AK, Peshkar A, Sujatha R, Chatterjee JM, Pasupuleti S, Mishra R, Pillai S, Jo O (2020) COVID-19 Patient Health Prediction Using Boosted Random Forest Algorithm. Front Public Health 8:357. https://doi.org/10.3389/fpubh.2020.00357
Article Google Scholar
Jackson P, Haq S (2014) Surrey Audio-Visual Expressed Emotion (SAVEE) Database. University of Surrey, Guildford
Google Scholar
Jozefowicz R, Vinyals O, Schuster M, Shazeer N, Wu Y (2016) Exploring the Limits of Language Modeling. arXiv:1602.02410 [cs]
Lakomkin E, Zamani MA, Weber C, Magg S, Wermter S (2018) Emorl: continuous acoustic emotion classification using deep reinforcement learning. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, New York, pp 1–6
Lee J, Tashev I (2015) High-level feature representation using recurrent neural network for speech emotion recognition. In: INTERSPEECH, pp 1537–1540
Li M-W, Geng J, Zhang Wei-ChiangHLi-Dong (2019) Periodogram estimation based on LSSVR-CCPSO compensation for forecasting ship motion. Nonlinear Dyn 97(4):2579–2594
Article Google Scholar
Martens J (2010) Deep learning via hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp 735–742
McFee B, Colin R, Liang D, Ellis D, Mcvicar M, Battenberg E, Nieto O (2015) librosa: Audio and music signal analysis in python, pp 18-24. https://doi.org/10.25080/Majora-7b98e3ed-003
Neiberg D, Elenius K, Karlsson I, Laskowski K (2006) Emotion recognition in spontaneous speech. In: Proceedings of Fonetik, pp 101–104
Oudeyer PY (2002) Novel useful features and algorithms for the recognition of emotions in human speech. In: Speech Prosody 2002, International Conference
Radford A, Jozefowicz R, Sutskever I (2017) Learning to Generate Reviews and Discovering Sentiment. arXiv:1704.01444 [cs]
Ravindran G, Shenbagadevi S, Selvam VS (2010) Cepstral and linear prediction techniques for improving intelligibility and audibility of impaired speech. J Biomed Sci Eng 3(01):85
Article Google Scholar
Sauter DA, Eisner F, Ekman P, Scott SK (2010) Cross-cultural recognition of basic emotions through nonverbal emotional vocalizations. Proc Natl Acad Sci 107(6):2408–2412
Scherer KR (2003) Vocal communication of emotion: a review of research paradigms. Speech Commun 40(1–2):227–256
Article Google Scholar
Singh R, Puri H, Aggarwal N, Gupta V (2020) An efficient language-independent acoustic emotion classification system. Arab J Sci Eng 45:3111–3121. https://doi.org/10.1007/s13369-019-04293-9. Accessed 7 Oct 2020
Smith LN (2017) Cyclical learning rates for training neural networks. IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, pp 464–472. https://doi.org/10.1109/WACV.2017.58
Smith LN (2018) A disciplined approach to neural network hyper-parameters: Part 1: learning rate, batch size, momentum, and weight decay. http://arxiv.org/abs/1803.09820
Smith LN, Topin N (2019) Super-convergence: very fast training of neural networks using large learning rates. Proc SPIE 11006, Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, 1100612. https://doi.org/10.1117/12.2520589
Ververidis D, Kotropoulos C (2006) Emotional speech recognition: resources, features, and methods. Speech Commun 48(9):1162–1181
Article Google Scholar
Vlasenko B, Schuller B, Wendemuth A, Rigoll G (2008) On the influence of phonetic content variation for acoustic emotion recognition. In: International Tutorial and Research Workshop on Perception and Interactive Technologies for Speech-Based Systems. Springer, Berlin, pp 217–220
Google Scholar
Wang ZQ, Tashev I (2017) Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, New York, pp 5150–5154
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M et al (2016) Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv:1609.08144 [cs]
Wu S, Zhong S, Liu Y (2017) Deep residual learning for image analysis. Multimed Tools Appl:1–17. https://doi.org/10.1007/s11042-017-4440-4
Zhang Z, Ding S, Sun Y (2020) A support vector regression model hybridized with chaotic krill herd algorithm and empirical mode decomposition for regression task. Neurocomputing 410:185–201
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Chandigarh College of Engineering and Technology (Degree Wing), Chandigarh, India
Ritika Dhiman, Gurkanwal Singh Kang & Varun Gupta

Authors

Ritika Dhiman
View author publications
You can also search for this author in PubMed Google Scholar
Gurkanwal Singh Kang
View author publications
You can also search for this author in PubMed Google Scholar
Varun Gupta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Varun Gupta.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dhiman, R., Kang, G.S. & Gupta, V. Modified dense convolutional networks based emotion detection from speech using its paralinguistic features. Multimed Tools Appl 80, 32041–32069 (2021). https://doi.org/10.1007/s11042-021-11210-6

Download citation

Received: 03 November 2020
Revised: 22 January 2021
Accepted: 28 June 2021
Published: 22 July 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s11042-021-11210-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Modified dense convolutional networks based emotion detection from speech using its paralinguistic features

Abstract

Access this article

Similar content being viewed by others

A lightweight 2D CNN based approach for speaker-independent emotion recognition from speech with new Indian Emotional Speech Corpora

Emotion Recognition in Speech Using Convolutional Neural Networks

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Modified dense convolutional networks based emotion detection from speech using its paralinguistic features

Abstract

Access this article

Similar content being viewed by others

A lightweight 2D CNN based approach for speaker-independent emotion recognition from speech with new Indian Emotional Speech Corpora

Emotion Recognition in Speech Using Convolutional Neural Networks

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation