Deep learning structure for emotion prediction using MFCC from native languages

Rao, A. Suresh; Reddy, A. Pramod; Vulpala, Pragathi; Rani, K. Shwetha; Hemalatha, P.

doi:10.1007/s10772-023-10047-8

Deep learning structure for emotion prediction using MFCC from native languages

Published: 05 October 2023

Volume 26, pages 721–733, (2023)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

A. Suresh Rao¹,
A. Pramod Reddy ORCID: orcid.org/0000-0002-3912-3302¹,
Pragathi Vulpala¹,
K. Shwetha Rani¹ &
…
P. Hemalatha²

134 Accesses
Explore all metrics

Abstract

The role of AI in speech has been transformed to recognize and categorize emotions conveyed through speech. The research employed audio recordings from different datasets, including the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), Berlin emotional data, and a self-developed Telugu dataset. The main contribution focused on using deep neural network-based models to categorize emotional reactions elicited by spoken monologues in various situations. The goal is to recognize eight distinct emotions: neutral, calm, happy, sad, angry, fearful, disgusted, and surprised. The evaluation of the model’s performance was done using the F1 score, which is a measure that combines precision and recall. The model achieved a weighted average F1 score of 0.91 on the test set and performed well in the "Angry" class with a score of 0.95. However, the model’s performance in the "Sad" class was not as high, achieving a score of 0.87, which is still better than the state-of-the-art results. The contribution with an effective model for recognizing emotional reactions conveyed through spoken language, utilizing neural networks and a combination of datasets to improve the understanding of emotions in speech.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Article Open access 07 May 2022

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Automatic speech recognition: a survey

Article 10 November 2020

Notes

References

Al-Ali, A. K. H., Dean, D., Senadji, B., Chandran, V., & Naik, G. R. (2017). Enhanced forensic speaker verification using a combination of dwt and mfcc feature warping in the presence of noise and reverberation conditions. IEEE Access, 5, 15400–15413.
Article Google Scholar
Bediou, B., Krolak-Salmon, P., Saoud, M., Henaff, M.-A., Burt, M., Dalery, J., & D’Amato, T. (2005). Facial expression and sex recognition in schizophrenia and depression. The Canadian Journal of Psychiatry, 50(9), 525–533.
Article Google Scholar
Boersma, P. (2011). Praat: Doing phonetics by computer [computer program]. http://www.praat.org/
Chen, M., Hao, Y., Hwang, K., Wang, L., & Wang, L. (2017). Disease prediction by machine learning over big data from healthcare communities. IEEE Access, 5, 8869–8879.
Article Google Scholar
Chen, M., Zhang, Y., Qiu, M., Guizani, N., & Hao, Y. (2018). Spha: Smart personal health advisor based on deep analytics. IEEE Communications Magazine, 56(3), 164–169.
Article Google Scholar
Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.
Article Google Scholar
Deshmukh, G., Gaonkar, A., Golwalkar, G., & Kulkarni, S. (2019). Speech based emotion recognition using machine learning. In 2019 3rd International conference on computing methodologies and communication (ICCMC) (pp. 812–817). IEEE.
Ekman, P., & Keltner, D. (1997). Universal facial expressions of emotion. In U. Segerstrale & P. Molnar (Eds.), Nonverbal communication: Where nature meets culture (Vol. 27, p. 46). Springer.
Google Scholar
El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3), 572–587.
Article MATH Google Scholar
Gao, Y., Li, B., Wang, N., & Zhu, T. (2017). Speech emotion recognition using local and global features. In International conference on brain informatics (pp. 3–13). Springer.
Geethashree, A., & Ravi, D. (2018). Kannada emotional speech database: Design, development and evaluation. In Proceedings of international conference on cognition and recognition (pp. 135–143). Springer.
Global Health Data Exchange (GHDx)., Institute Of Health Metrics And Evaluation. “GBD Results Tool | GHDx.” GBD Results Tool | GHDx. ghdx.healthdata.org, 2019. http://ghdx.healthdata.org/gbd-results-tool?params=gbd-api-2019-permalink/d780dffbe8a381b25e1416884959e88b
Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). MIT Press.
MATH Google Scholar
Huang, Z., Dong, M., Mao, Q., & Zhan, Y. (2014). Speech emotion recognition using cnn. In Proceedings of the 22nd ACM international conference on multimedia (pp. 801–804).
Huang, X., Acero, A., Hon, H.-W., & Reddy, R. (2001). Spoken language processing: A guide to theory, algorithm, and system development. Prentice Hall PTR.
Google Scholar
Iqbal, A., Barua, K. (2019). A real-time emotion recognition from speech using gradient boosting. In 2019 International conference on electrical, computer and communication engineering (ECCE) (pp. 1–5). IEEE
Jannat, R., Tynes, I., Lime, L. L., Adorno, J., & Canavan, S. (2018). Ubiquitous emotion recognition using audio and video data. In Proceedings of the 2018 ACM international joint conference and 2018 International symposium on pervasive and ubiquitous computing and wearable computers (pp. 956–959).
Jin, B., & Liu, G. (2017). Speech emotion recognition based on hyper-prosodic features. In 2017 International conference on computer technology, electronics and communication (ICCTEC) (pp. 82–87). IEEE.
Khaleghi, B., Khamis, A., Karray, F. O., & Razavi, S. N. (2013). Multisensor data fusion: A review of the state-of-the-art. Information Fusion, 14(1), 28–44.
Article Google Scholar
Koolagudi, S. G., Maity, S., Kumar, V. A., Chakrabarti, S., & Rao, K. S. (2009). Iitkgp-sesc: Speech database for emotion analysis. In International conference on contemporary computing (pp. 485–492). Springer.
Kwon, O.-W., Chan, K., Hao, J., & Lee, T.-W. (2003). Emotion recognition by speech signals. In Eighth European conference on speech communication and technology.
LeCun, Y., Bengio, Y., et al. (1995). Convolutional networks for images, speech, and time series. The Handbook of Brain Theory and Neural Networks, 3361(10), 1995.
Google Scholar
Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in north American English. PLoS ONE, 13(5), 1–35. https://doi.org/10.1371/journal.pone.0196391
Article Google Scholar
Pinto, M. G., Polignano, M., Lops, P., Semeraro, G. (2020). Emotions understanding model from spoken language using deep neural networks and mel-frequency cepstral coefficients. In 2020 IEEE conference on evolving and adaptive intelligent systems (EAIS) (pp. 1–5). IEEE.
Rajisha, T., Sunija, A., & Riyas, K. (2016). Performance analysis of Malayalam language speech emotion recognition system using ANN/SVM. Procedia Technology, 24, 1097–1104.
Article Google Scholar
Reddy, A. P., & Vijayarajan, V. (2017). Extraction of emotions from speech-a survey. International Journal of Applied Engineering Research, 12(16), 5760–5767.
Google Scholar
Schroder, M., Bevacqua, E., Cowie, R., Eyben, F., Gunes, H., Heylen, D., Ter Maat, M., McKeown, G., Pammi, S., Pantic, M., et al. (2011). Building autonomous sensitive artificial listeners. IEEE Transactions on Affective Computing, 3(2), 165–183.
Article Google Scholar
Syed, Z. S., Memon, S. A., Shah, M. S., & Syed, A. S. (2020). Introducing the Urdu-Sindhi speech emotion corpus: A novel dataset of speech recordings for emotion recognition for two low-resource languages. International Journal of Advanced Computer Science and Applications, 11(4), 1–6.
Article Google Scholar
Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. A., Schuller, B., & Zafeiriou, S. (2016). Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5200–5204). IEEE.
Vasquez-Correa, J. C., Arias-Vergara, T., Orozco-Arroyave, J. R., Vargas-Bonilla, J. F., & Noeth, E. (2016). Wavelet-based time-frequency representations for automatic recognition of emotions from speech. In Speech communication; 12. ITG symposium (pp. 1–5). VDE.
Wang, S., Soladie, C., & Seguier, R. (2019). Ocae: Organization-controlled autoencoder for unsupervised speech emotion analysis. In 2019 5th International conference on frontiers of signal processing (ICFSP) (pp. 72–76). IEEE
Yenigalla, P., Kumar, A., Tripathi, S., Singh, C., Kar, S., & Vepa, J. (2018). Speech emotion recognition using spectrogram & phoneme embedding. In Interspeech (pp. 3688–3692).
Zhang, Q., An, N., Wang, K., Ren, F., & Li, L. (2013). Speech emotion recognition using combination of features. In 2013 Fourth International Conference on intelligent control and information processing (ICICIP) (pp. 523–528). IEEE
Zhang, B., Essl, G., & Provost, E. M. (2015). Recognizing emotion from singing and speaking using shared models. In 2015 International conference on affective computing and intelligent interaction (ACII) (pp. 139–145). IEEE.

Download references

Acknowledgements

We thank all the volunteers who helped us in making the Telugu database. Presently the database is under review with the committee for endorsement and will be publicly available. The RAVDESS dataset is available at https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio.

Author information

Authors and Affiliations

Department of CSE, TKR College of Engineering Technology, Saroornagar, Hyderabad, Telangana, 500097, India
A. Suresh Rao, A. Pramod Reddy, Pragathi Vulpala & K. Shwetha Rani
Vignan’s Institute of Management Technology for Women, Ghatkesar, Hyderabad, Telangana, 501301, India
P. Hemalatha

Authors

A. Suresh Rao
View author publications
You can also search for this author in PubMed Google Scholar
A. Pramod Reddy
View author publications
You can also search for this author in PubMed Google Scholar
Pragathi Vulpala
View author publications
You can also search for this author in PubMed Google Scholar
K. Shwetha Rani
View author publications
You can also search for this author in PubMed Google Scholar
P. Hemalatha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. Pramod Reddy.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare. All co-authors have seen and agree with the contents of the manuscript and there is no financial interest to report. We certify that the submission is original work and is not under review at any other publication.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Rao, A.S., Reddy, A.P., Vulpala, P. et al. Deep learning structure for emotion prediction using MFCC from native languages. Int J Speech Technol 26, 721–733 (2023). https://doi.org/10.1007/s10772-023-10047-8

Download citation

Received: 23 April 2023
Accepted: 07 September 2023
Published: 05 October 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s10772-023-10047-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep learning structure for emotion prediction using MFCC from native languages

Abstract

Access this article

Similar content being viewed by others

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

A comprehensive survey on automatic speech recognition using neural networks

Automatic speech recognition: a survey

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Deep learning structure for emotion prediction using MFCC from native languages

Abstract

Access this article

Similar content being viewed by others

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

A comprehensive survey on automatic speech recognition using neural networks

Automatic speech recognition: a survey

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation