Skip to main content
Log in

Deep learning structure for emotion prediction using MFCC from native languages

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

The role of AI in speech has been transformed to recognize and categorize emotions conveyed through speech. The research employed audio recordings from different datasets, including the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), Berlin emotional data, and a self-developed Telugu dataset. The main contribution focused on using deep neural network-based models to categorize emotional reactions elicited by spoken monologues in various situations. The goal is to recognize eight distinct emotions: neutral, calm, happy, sad, angry, fearful, disgusted, and surprised. The evaluation of the model’s performance was done using the F1 score, which is a measure that combines precision and recall. The model achieved a weighted average F1 score of 0.91 on the test set and performed well in the "Angry" class with a score of 0.95. However, the model’s performance in the "Sad" class was not as high, achieving a score of 0.87, which is still better than the state-of-the-art results. The contribution with an effective model for recognizing emotional reactions conveyed through spoken language, utilizing neural networks and a combination of datasets to improve the understanding of emotions in speech.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. https://www.who.int/news-room/fact-sheets/detail/depression.

  2. https://smartlaboratory.org/ravdess/.

References

  • Al-Ali, A. K. H., Dean, D., Senadji, B., Chandran, V., & Naik, G. R. (2017). Enhanced forensic speaker verification using a combination of dwt and mfcc feature warping in the presence of noise and reverberation conditions. IEEE Access, 5, 15400–15413.

    Article  Google Scholar 

  • Bediou, B., Krolak-Salmon, P., Saoud, M., Henaff, M.-A., Burt, M., Dalery, J., & D’Amato, T. (2005). Facial expression and sex recognition in schizophrenia and depression. The Canadian Journal of Psychiatry, 50(9), 525–533.

    Article  Google Scholar 

  • Boersma, P. (2011). Praat: Doing phonetics by computer [computer program]. http://www.praat.org/

  • Chen, M., Hao, Y., Hwang, K., Wang, L., & Wang, L. (2017). Disease prediction by machine learning over big data from healthcare communities. IEEE Access, 5, 8869–8879.

    Article  Google Scholar 

  • Chen, M., Zhang, Y., Qiu, M., Guizani, N., & Hao, Y. (2018). Spha: Smart personal health advisor based on deep analytics. IEEE Communications Magazine, 56(3), 164–169.

    Article  Google Scholar 

  • Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.

    Article  Google Scholar 

  • Deshmukh, G., Gaonkar, A., Golwalkar, G., & Kulkarni, S. (2019). Speech based emotion recognition using machine learning. In 2019 3rd International conference on computing methodologies and communication (ICCMC) (pp. 812–817). IEEE.

  • Ekman, P., & Keltner, D. (1997). Universal facial expressions of emotion. In U. Segerstrale & P. Molnar (Eds.), Nonverbal communication: Where nature meets culture (Vol. 27, p. 46). Springer.

    Google Scholar 

  • El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3), 572–587.

    Article  MATH  Google Scholar 

  • Gao, Y., Li, B., Wang, N., & Zhu, T. (2017). Speech emotion recognition using local and global features. In International conference on brain informatics (pp. 3–13). Springer.

  • Geethashree, A., & Ravi, D. (2018). Kannada emotional speech database: Design, development and evaluation. In Proceedings of international conference on cognition and recognition (pp. 135–143). Springer.

  • Global Health Data Exchange (GHDx)., Institute Of Health Metrics And Evaluation. “GBD Results Tool | GHDx.” GBD Results Tool | GHDx. ghdx.healthdata.org, 2019. http://ghdx.healthdata.org/gbd-results-tool?params=gbd-api-2019-permalink/d780dffbe8a381b25e1416884959e88b

  • Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). MIT Press.

    MATH  Google Scholar 

  • Huang, Z., Dong, M., Mao, Q., & Zhan, Y. (2014). Speech emotion recognition using cnn. In Proceedings of the 22nd ACM international conference on multimedia (pp. 801–804).

  • Huang, X., Acero, A., Hon, H.-W., & Reddy, R. (2001). Spoken language processing: A guide to theory, algorithm, and system development. Prentice Hall PTR.

    Google Scholar 

  • Iqbal, A., Barua, K. (2019). A real-time emotion recognition from speech using gradient boosting. In 2019 International conference on electrical, computer and communication engineering (ECCE) (pp. 1–5). IEEE

  • Jannat, R., Tynes, I., Lime, L. L., Adorno, J., & Canavan, S. (2018). Ubiquitous emotion recognition using audio and video data. In Proceedings of the 2018 ACM international joint conference and 2018 International symposium on pervasive and ubiquitous computing and wearable computers (pp. 956–959).

  • Jin, B., & Liu, G. (2017). Speech emotion recognition based on hyper-prosodic features. In 2017 International conference on computer technology, electronics and communication (ICCTEC) (pp. 82–87). IEEE.

  • Khaleghi, B., Khamis, A., Karray, F. O., & Razavi, S. N. (2013). Multisensor data fusion: A review of the state-of-the-art. Information Fusion, 14(1), 28–44.

    Article  Google Scholar 

  • Koolagudi, S. G., Maity, S., Kumar, V. A., Chakrabarti, S., & Rao, K. S. (2009). Iitkgp-sesc: Speech database for emotion analysis. In International conference on contemporary computing (pp. 485–492). Springer.

  • Kwon, O.-W., Chan, K., Hao, J., & Lee, T.-W. (2003). Emotion recognition by speech signals. In Eighth European conference on speech communication and technology.

  • LeCun, Y., Bengio, Y., et al. (1995). Convolutional networks for images, speech, and time series. The Handbook of Brain Theory and Neural Networks, 3361(10), 1995.

    Google Scholar 

  • Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in north American English. PLoS ONE, 13(5), 1–35. https://doi.org/10.1371/journal.pone.0196391

    Article  Google Scholar 

  • Pinto, M. G., Polignano, M., Lops, P., Semeraro, G. (2020). Emotions understanding model from spoken language using deep neural networks and mel-frequency cepstral coefficients. In 2020 IEEE conference on evolving and adaptive intelligent systems (EAIS) (pp. 1–5). IEEE.

  • Rajisha, T., Sunija, A., & Riyas, K. (2016). Performance analysis of Malayalam language speech emotion recognition system using ANN/SVM. Procedia Technology, 24, 1097–1104.

    Article  Google Scholar 

  • Reddy, A. P., & Vijayarajan, V. (2017). Extraction of emotions from speech-a survey. International Journal of Applied Engineering Research, 12(16), 5760–5767.

    Google Scholar 

  • Schroder, M., Bevacqua, E., Cowie, R., Eyben, F., Gunes, H., Heylen, D., Ter Maat, M., McKeown, G., Pammi, S., Pantic, M., et al. (2011). Building autonomous sensitive artificial listeners. IEEE Transactions on Affective Computing, 3(2), 165–183.

    Article  Google Scholar 

  • Syed, Z. S., Memon, S. A., Shah, M. S., & Syed, A. S. (2020). Introducing the Urdu-Sindhi speech emotion corpus: A novel dataset of speech recordings for emotion recognition for two low-resource languages. International Journal of Advanced Computer Science and Applications, 11(4), 1–6.

    Article  Google Scholar 

  • Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. A., Schuller, B., & Zafeiriou, S. (2016). Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5200–5204). IEEE.

  • Vasquez-Correa, J. C., Arias-Vergara, T., Orozco-Arroyave, J. R., Vargas-Bonilla, J. F., & Noeth, E. (2016). Wavelet-based time-frequency representations for automatic recognition of emotions from speech. In Speech communication; 12. ITG symposium (pp. 1–5). VDE.

  • Wang, S., Soladie, C., & Seguier, R. (2019). Ocae: Organization-controlled autoencoder for unsupervised speech emotion analysis. In 2019 5th International conference on frontiers of signal processing (ICFSP) (pp. 72–76). IEEE

  • Yenigalla, P., Kumar, A., Tripathi, S., Singh, C., Kar, S., & Vepa, J. (2018). Speech emotion recognition using spectrogram & phoneme embedding. In Interspeech (pp. 3688–3692).

  • Zhang, Q., An, N., Wang, K., Ren, F., & Li, L. (2013). Speech emotion recognition using combination of features. In 2013 Fourth International Conference on intelligent control and information processing (ICICIP) (pp. 523–528). IEEE

  • Zhang, B., Essl, G., & Provost, E. M. (2015). Recognizing emotion from singing and speaking using shared models. In 2015 International conference on affective computing and intelligent interaction (ACII) (pp. 139–145). IEEE.

Download references

Acknowledgements

We thank all the volunteers who helped us in making the Telugu database. Presently the database is under review with the committee for endorsement and will be publicly available. The RAVDESS dataset is available at https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. Pramod Reddy.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare. All co-authors have seen and agree with the contents of the manuscript and there is no financial interest to report. We certify that the submission is original work and is not under review at any other publication.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rao, A.S., Reddy, A.P., Vulpala, P. et al. Deep learning structure for emotion prediction using MFCC from native languages. Int J Speech Technol 26, 721–733 (2023). https://doi.org/10.1007/s10772-023-10047-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-023-10047-8

Keywords

Navigation