Abstract
The role of AI in speech has been transformed to recognize and categorize emotions conveyed through speech. The research employed audio recordings from different datasets, including the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), Berlin emotional data, and a self-developed Telugu dataset. The main contribution focused on using deep neural network-based models to categorize emotional reactions elicited by spoken monologues in various situations. The goal is to recognize eight distinct emotions: neutral, calm, happy, sad, angry, fearful, disgusted, and surprised. The evaluation of the model’s performance was done using the F1 score, which is a measure that combines precision and recall. The model achieved a weighted average F1 score of 0.91 on the test set and performed well in the "Angry" class with a score of 0.95. However, the model’s performance in the "Sad" class was not as high, achieving a score of 0.87, which is still better than the state-of-the-art results. The contribution with an effective model for recognizing emotional reactions conveyed through spoken language, utilizing neural networks and a combination of datasets to improve the understanding of emotions in speech.
Similar content being viewed by others
References
Al-Ali, A. K. H., Dean, D., Senadji, B., Chandran, V., & Naik, G. R. (2017). Enhanced forensic speaker verification using a combination of dwt and mfcc feature warping in the presence of noise and reverberation conditions. IEEE Access, 5, 15400–15413.
Bediou, B., Krolak-Salmon, P., Saoud, M., Henaff, M.-A., Burt, M., Dalery, J., & D’Amato, T. (2005). Facial expression and sex recognition in schizophrenia and depression. The Canadian Journal of Psychiatry, 50(9), 525–533.
Boersma, P. (2011). Praat: Doing phonetics by computer [computer program]. http://www.praat.org/
Chen, M., Hao, Y., Hwang, K., Wang, L., & Wang, L. (2017). Disease prediction by machine learning over big data from healthcare communities. IEEE Access, 5, 8869–8879.
Chen, M., Zhang, Y., Qiu, M., Guizani, N., & Hao, Y. (2018). Spha: Smart personal health advisor based on deep analytics. IEEE Communications Magazine, 56(3), 164–169.
Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.
Deshmukh, G., Gaonkar, A., Golwalkar, G., & Kulkarni, S. (2019). Speech based emotion recognition using machine learning. In 2019 3rd International conference on computing methodologies and communication (ICCMC) (pp. 812–817). IEEE.
Ekman, P., & Keltner, D. (1997). Universal facial expressions of emotion. In U. Segerstrale & P. Molnar (Eds.), Nonverbal communication: Where nature meets culture (Vol. 27, p. 46). Springer.
El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3), 572–587.
Gao, Y., Li, B., Wang, N., & Zhu, T. (2017). Speech emotion recognition using local and global features. In International conference on brain informatics (pp. 3–13). Springer.
Geethashree, A., & Ravi, D. (2018). Kannada emotional speech database: Design, development and evaluation. In Proceedings of international conference on cognition and recognition (pp. 135–143). Springer.
Global Health Data Exchange (GHDx)., Institute Of Health Metrics And Evaluation. “GBD Results Tool | GHDx.” GBD Results Tool | GHDx. ghdx.healthdata.org, 2019. http://ghdx.healthdata.org/gbd-results-tool?params=gbd-api-2019-permalink/d780dffbe8a381b25e1416884959e88b
Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). MIT Press.
Huang, Z., Dong, M., Mao, Q., & Zhan, Y. (2014). Speech emotion recognition using cnn. In Proceedings of the 22nd ACM international conference on multimedia (pp. 801–804).
Huang, X., Acero, A., Hon, H.-W., & Reddy, R. (2001). Spoken language processing: A guide to theory, algorithm, and system development. Prentice Hall PTR.
Iqbal, A., Barua, K. (2019). A real-time emotion recognition from speech using gradient boosting. In 2019 International conference on electrical, computer and communication engineering (ECCE) (pp. 1–5). IEEE
Jannat, R., Tynes, I., Lime, L. L., Adorno, J., & Canavan, S. (2018). Ubiquitous emotion recognition using audio and video data. In Proceedings of the 2018 ACM international joint conference and 2018 International symposium on pervasive and ubiquitous computing and wearable computers (pp. 956–959).
Jin, B., & Liu, G. (2017). Speech emotion recognition based on hyper-prosodic features. In 2017 International conference on computer technology, electronics and communication (ICCTEC) (pp. 82–87). IEEE.
Khaleghi, B., Khamis, A., Karray, F. O., & Razavi, S. N. (2013). Multisensor data fusion: A review of the state-of-the-art. Information Fusion, 14(1), 28–44.
Koolagudi, S. G., Maity, S., Kumar, V. A., Chakrabarti, S., & Rao, K. S. (2009). Iitkgp-sesc: Speech database for emotion analysis. In International conference on contemporary computing (pp. 485–492). Springer.
Kwon, O.-W., Chan, K., Hao, J., & Lee, T.-W. (2003). Emotion recognition by speech signals. In Eighth European conference on speech communication and technology.
LeCun, Y., Bengio, Y., et al. (1995). Convolutional networks for images, speech, and time series. The Handbook of Brain Theory and Neural Networks, 3361(10), 1995.
Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in north American English. PLoS ONE, 13(5), 1–35. https://doi.org/10.1371/journal.pone.0196391
Pinto, M. G., Polignano, M., Lops, P., Semeraro, G. (2020). Emotions understanding model from spoken language using deep neural networks and mel-frequency cepstral coefficients. In 2020 IEEE conference on evolving and adaptive intelligent systems (EAIS) (pp. 1–5). IEEE.
Rajisha, T., Sunija, A., & Riyas, K. (2016). Performance analysis of Malayalam language speech emotion recognition system using ANN/SVM. Procedia Technology, 24, 1097–1104.
Reddy, A. P., & Vijayarajan, V. (2017). Extraction of emotions from speech-a survey. International Journal of Applied Engineering Research, 12(16), 5760–5767.
Schroder, M., Bevacqua, E., Cowie, R., Eyben, F., Gunes, H., Heylen, D., Ter Maat, M., McKeown, G., Pammi, S., Pantic, M., et al. (2011). Building autonomous sensitive artificial listeners. IEEE Transactions on Affective Computing, 3(2), 165–183.
Syed, Z. S., Memon, S. A., Shah, M. S., & Syed, A. S. (2020). Introducing the Urdu-Sindhi speech emotion corpus: A novel dataset of speech recordings for emotion recognition for two low-resource languages. International Journal of Advanced Computer Science and Applications, 11(4), 1–6.
Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. A., Schuller, B., & Zafeiriou, S. (2016). Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5200–5204). IEEE.
Vasquez-Correa, J. C., Arias-Vergara, T., Orozco-Arroyave, J. R., Vargas-Bonilla, J. F., & Noeth, E. (2016). Wavelet-based time-frequency representations for automatic recognition of emotions from speech. In Speech communication; 12. ITG symposium (pp. 1–5). VDE.
Wang, S., Soladie, C., & Seguier, R. (2019). Ocae: Organization-controlled autoencoder for unsupervised speech emotion analysis. In 2019 5th International conference on frontiers of signal processing (ICFSP) (pp. 72–76). IEEE
Yenigalla, P., Kumar, A., Tripathi, S., Singh, C., Kar, S., & Vepa, J. (2018). Speech emotion recognition using spectrogram & phoneme embedding. In Interspeech (pp. 3688–3692).
Zhang, Q., An, N., Wang, K., Ren, F., & Li, L. (2013). Speech emotion recognition using combination of features. In 2013 Fourth International Conference on intelligent control and information processing (ICICIP) (pp. 523–528). IEEE
Zhang, B., Essl, G., & Provost, E. M. (2015). Recognizing emotion from singing and speaking using shared models. In 2015 International conference on affective computing and intelligent interaction (ACII) (pp. 139–145). IEEE.
Acknowledgements
We thank all the volunteers who helped us in making the Telugu database. Presently the database is under review with the committee for endorsement and will be publicly available. The RAVDESS dataset is available at https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare. All co-authors have seen and agree with the contents of the manuscript and there is no financial interest to report. We certify that the submission is original work and is not under review at any other publication.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Rao, A.S., Reddy, A.P., Vulpala, P. et al. Deep learning structure for emotion prediction using MFCC from native languages. Int J Speech Technol 26, 721–733 (2023). https://doi.org/10.1007/s10772-023-10047-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-023-10047-8