Understanding human emotions through speech spectrograms using deep neural network

Gupta, Vedika; Juyal, Stuti; Hu, Yu-Chen

doi:10.1007/s11227-021-04124-5

Understanding human emotions through speech spectrograms using deep neural network

Published: 29 October 2021

Volume 78, pages 6944–6973, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

483 Accesses
8 Citations
Explore all metrics

Abstract

This paper presents the analysis and classification of speech spectrograms for recognizing emotions in RAVDESS dataset. Feature extraction from speech utterances is performed using Mel-Frequency Cepstrum Coefficient. Thereafter, deep neural networks are employed to classify speech into six emotions (happy, sad, neutral, calm, disgust, and fear). Firstly, this paper presents a comprehensive comparative study on DNNs on prosodic features. The outcomes of all DNNs are presented in the paper. Secondly, the paper puts forward an analysis of Bag of Visual Words that uses speeded-up robust features (SURF) to cluster them using K-means and further classify them using support vector machine (SVM) into aforementioned emotions. Out of the five DNNs deployed, (i) Long Short-Term Memory (LSTM) on MFCC and, (ii) Multi-Layer Perceptron (MLP) classifier on MFCC, outperforms others, giving an accuracy score of 0.70 (in both cases). Further, the BoVW technique performed 53% of correct classification. Therefore, the proposed methodology constructs a Hybrid of Acoustic Features (HAF) and feeds them into an ensemble of bagged multi-layer perceptron classifier imparting an accuracy of 85%. Also, it achieves a precision score between 0.77 and 0.88 for the classification of six emotions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Emotion Recognition from Speech Using Deep Neural Network

Speech Emotion Recognition Using Convolutional Neural Networks on Spectrograms and Mel-frequency Cepstral Coefficients Images

Speech Emotion Recognition Through Extraction of Various Emotional Features Using Machine and Deep Learning Classifiers

References

Gupta V, Juyal S, Singh GP, Killa C, Gupta N (2020) Emotion recognition of audio/speech data using deep learning approaches. J Inf Optim Sci 41(6):1309–1317
Google Scholar
Wang H, Wei S, Fang B (2020) Facial expression recognition using iterative fusion of MO-HOG and deep features. J Supercomput 76:3211–3221
Article Google Scholar
Kommineni J, Mandala S, Sunar MS, Chakravarthy PM (2021) Accurate computing of facial expression recognition using a hybrid feature extraction technique. J Supercomput 77:5019–5044
Article Google Scholar
Do LN, Yang HJ, Nguyen HD, Kim SH, Lee GS, Na IS (2021) Deep neural network-based fusion model for emotion recognition using visual data. J Supercomput. https://doi.org/10.1007/s11227-021-03690-y
Article Google Scholar
Gupta V, Singh VK, Mukhija P, Ghose U (2019) Aspect-based sentiment analysis of mobile reviews. J Intell Fuzzy Syst 36(5):4721–4730
Article Google Scholar
Jain N, Gupta V, Shubham S, Madan A, Chaudhary A, Santosh KC (2021) Understanding cartoon emotion using integrated deep neural network on large dataset. Neural Comput Appl. https://doi.org/10.1007/s00521-021-06003-9
Article Google Scholar
Wang K, An N, Li BN, Zhang Y, Li L (2015) Speech emotion recognition using Fourier parameters. IEEE Trans Affect Comput 6(1):69–75
Article Google Scholar
.Xiao Z, Dellandrea E, Dou W, Chen L (2005) Features extraction and selection for emotional speech classification. In: Advanced video and signal based surveillance. AVSS 2005. IEEE Conference on, 2005, pp. 411-416
Dave N (2013) Feature extraction methods LPC, PLP and MFCC in speech recognition. Int J Adv Res Eng Technol 1(6):1–4
Google Scholar
Abrilian S, Devillers L, Buisine S, Martin JC (2005) EmoTV1: annotation of real-life emotions for the specification of multimodal affective interfaces. In: 11th International Conference on Human-Computer Interaction (HCI 2005) pp. 195–200
Smith H, Schneider A (2009) Critiquing models of emotions. Sociol Methods Res 37(4):560–589
Article MathSciNet Google Scholar
Rao KS, Yegnanarayana B (2006) Prosody modification using instants of significant excitation. IEEE Trans Audio Speech Lang Process 14(3):972–980
Article Google Scholar
Yegnanarayana B, Veldhuis RN (1998) Extraction of vocal-tract system characteristics from speech signals. IEEE Trans Speech Audio Process 6(4):313–327
Article Google Scholar
Fernandez R, Picard RW (2002) Dialog act classification from prosodic features using SVMs. In: Speech prosody 2002, International conference
Makhoul J (1975) Linear prediction: a tutorial review. Proc IEEE 63(4):561–580
Article Google Scholar
Koolagudi SG, Rao KS (2010) Real life emotion classification using vop and pitch based spectral features. In: 2010 Annual IEEE India Conference (INDICON) IEEE pp. 1–4
Mao Q, Dong M, Huang Z, Zhan Y (2014) Learning salient features for SER using CNNs. IEEE Trans Multimed 16(8):2203–2213
Article Google Scholar
Tomba K, Dumoulin J, Mugellini E, Khaled OA Hawila S (2018) Stress detection through speech analysis. In: ICETE (1) pp. 560–564
Mao Q, Xue W, Rao Q, Zhang F, Zhan Y (2006) Domain adaptation for SER by sharing priors between related source and target classes. In: Acoustics, speech and signal processing (ICASSP), 2016 IEEE International Conference on. IEEE, pp. 2608–2612.
Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5):
Article Google Scholar
Alshamsi H, Kepuska V, Alshamsi H, Meng H (2018) Automated facial expression and SER app development on smart phones using cloud computing. In: 2018 IEEE 9th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON). IEEE, pp. 730–738
Hossan MA, Memon S, Gregory MA (2010) A novel approach for MFCC feature extraction. In: 2010 4th International Conference on Signal Processing and Communication Systems. IEEE, pp. 1–5
Kwok HK, Jones DL (2000) Improved instantaneous frequency estimation using an adaptive short-time Fourier transform. IEEE Trans Signal Process 48(10):2964–2972
Article MathSciNet Google Scholar
Wold S, Esbensen K, Geladi P (1987) Principal component analysis. Chemom Intell Lab Syst 2(1–3):37–52
Article Google Scholar
Lim W, Jang D, Lee T (2016) SER using convolutional and RNNs. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). IEEE, pp. 1–4
Zheng WQ, Yu JS, Zou YX (2015). An experimental study of speech emotion recognition based on deep convolutional neural networks. In: 2015 International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, pp. 827–831
Zhao JF, Mao X, Chen L (2019) Jan.) Speech emotion recognition using deep 1D & 2D CNN LSTM networks. (2019). Biomed Signal Process Control 47:312–323
Article Google Scholar
Dey R,Salem FM (2017)Gate-variants of Gated Recurrent Unit (GRU) neural networks. In: 2017 IEEE 60^th International Midwest Symposium on Circuits and Systems (MWSCAS).
Spyrou E, Nikopoulou R, Vernikos I, Mylonas P (2019) Emotion recognition from speech using the bag-of-visual words on audio segment spectrograms. Technologies 7(1):20
Article Google Scholar
Bay H, Ess A, Tuytelaars T, Van Gool L (2008) Speeded-up robust features (SURF). Comput Vis Image Underst 110:346–359
Article Google Scholar
Tang H, Meng CH, Lee LS (2010) An initial attempt for phoneme recognition using Structured Support Vector Machine (SVM). In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, pp. 4926–4929
Deng L, Acero A, Plumpe M, Huang X (2000) Large-vocabulary speech recognition under adverse acoustic environments. In: Sixth International Conference on Spoken Language Processing.
Schuller B, Müller R, Lang M, Rigoll G (2005) Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles. In: Ninth European Conference on Speech Communication and Technology
Juyal S, Killa C, Singh GP, Gupta N, Gupta V (2021) Emotion recognition from speech using deep neural network. In: Srivastava S, Khari M, Gonzalez CR, Chaudhary G, Arora P (eds) Concepts and real-time applications of deep learning. EAI/Springer innovations in communication and computing Springer, Cham
Google Scholar
Pao TL, Chen YT, Yeh JH, Li PJ (2006) Mandarin emotional speech recognition based on SVM and NN. In: 18th International Conference on Pattern Recognition (ICPR'06). IEEE pp. 1096–1100
Cen L, Ser W, Yu ZL (2008) SER using canonical correlation analysis and probabilistic neural network. In: 2008 Seventh International Conference on Machine Learning and Applications. IEEE, pp. 859–862
Lika RA, Seldon HL, Kiong LC (2014) Feature analysis of speech emotion data on arousal-valence dimension using adaptive neuro-fuzzy classifier. In: 2014 International conference on Industrial Automation, Information and Communications Technology. IEEE, pp. 104–110
Zhang B, Essl G, Provost EM (2015) Recognizing emotion from singing and speaking using shared models. In: 2015 International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, pp. 139–145
Bertero D, Fung P (2017) A first look into a CNN for speech emotion detection. In 2017 IEEE International Conference On Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 5115–5119

Download references

Author information

Authors and Affiliations

Department of Computer Science & Engineering, Bharati Vidyapeeth’s College of Engineering, New Delhi, India
Vedika Gupta & Stuti Juyal
Department of Computer Science & Information Management, Providence University, 200, Sec. 7, Taiwan Boulevard, Shalu Dist., Taichung City, 43301, Taiwan, Republic of China
Yu-Chen Hu

Authors

Vedika Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Stuti Juyal
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Chen Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yu-Chen Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gupta, V., Juyal, S. & Hu, YC. Understanding human emotions through speech spectrograms using deep neural network. J Supercomput 78, 6944–6973 (2022). https://doi.org/10.1007/s11227-021-04124-5

Download citation

Accepted: 29 September 2021
Published: 29 October 2021
Issue Date: April 2022
DOI: https://doi.org/10.1007/s11227-021-04124-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Understanding human emotions through speech spectrograms using deep neural network

Abstract

Access this article

Similar content being viewed by others

Emotion Recognition from Speech Using Deep Neural Network

Speech Emotion Recognition Using Convolutional Neural Networks on Spectrograms and Mel-frequency Cepstral Coefficients Images

Speech Emotion Recognition Through Extraction of Various Emotional Features Using Machine and Deep Learning Classifiers

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Understanding human emotions through speech spectrograms using deep neural network

Abstract

Access this article

Similar content being viewed by others

Emotion Recognition from Speech Using Deep Neural Network

Speech Emotion Recognition Using Convolutional Neural Networks on Spectrograms and Mel-frequency Cepstral Coefficients Images

Speech Emotion Recognition Through Extraction of Various Emotional Features Using Machine and Deep Learning Classifiers

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation