Skip to main content
Log in

Speech-based emotion recognition using a hybrid RNN-CNN network

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Speech emotion recognition is probably among the most exciting and dynamic areas of modern research focused on speech signals analysis, which allows estimating and classifying speakers' rich spectrum of emotions. The following paper aims to develop a novel deep learning (DL)-based model for detecting speech emotion variation to overcome several weaknesses of the existing intelligent data-driven approaches. A new architecture for a DL network, referred to as the RNN–CNN, is proposed and applied in this paper to perform the SER task by operating directly on raw speech signals. Specifically, the challenge was effectively combining an initial convolution layer with a wide kernel as an efficient way to address and mitigate the problems caused by noise found in raw speech signals. In this experimental analysis, the 3 databases used to evaluate the proposed RNN–CNN model are RML, RAVDESS, and SAVEE. The effectiveness of such methodologies can be detected with remarkable efficacy, whose improved accuracy rates depict contrasting trends from those findings of the previous works analyzed through respective datasets. This assessment has validated the robust performance and applicability of the suggested models for diverse speech databases and underlined their potential in further speech-based emotion recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Availability of data and materials

No datasets were generated or analysed during the current study.

References

  1. Khaleghi, A., Mohammadi, M.R., Shahi, K., Nasrabadi, A.M.: Computational neuroscience approach to psychiatry: a review on theory-driven approaches. Clin. Psychopharmacol. Neurosci. 20(1), 26 (2022)

    Google Scholar 

  2. Chen, Z., Li, J., Liu, H., Wang, X., Wang, H., Zheng, Q.: Learning multi-scale features for speech emotion recognition with connection attention mechanism. Expert Syst. Appl. 214, 118943 (2023)

    MATH  Google Scholar 

  3. Leong, S.C., Tang, Y.M., Lai, C.H., Lee, C.K.M.: Facial expression and body gesture emotion recognition: a systematic review on the use of visual data in affective computing. Comput. Sci. Rev. 48, 100545 (2023)

    Google Scholar 

  4. Zhang, Z., Zhong, S., Liu, Y.: TorchEEGEMO: a deep learning toolbox towards EEG-based emotion recognition. Expert Syst. Appl. 249, 123550 (2024)

    Google Scholar 

  5. Li, J., Zhang, X., Li, F., Huang, L.: Speech emotion recognition based on optimized deep features of dual-channel complementary spectrogram. Inf. Sci. (N Y) 649, 119649 (2023)

    MATH  Google Scholar 

  6. S. Madanian et al. Speech emotion recognition using machine learning—A systematic review. Intelligent Systems with Applications, p. 200266 (2023).

  7. Mehrish, A., Majumder, N., Bharadwaj, R., Mihalcea, R., Poria, S.: A review of deep learning techniques for speech processing. Inf. Fusion 99, 101869 (2023)

    Google Scholar 

  8. Salmanpour, M.R., Rezaeijo, S.M., Hosseinzadeh, M., Rahmim, A.: Deep versus handcrafted tensor radiomics features: prediction of survival in head and neck cancer using machine learning and fusion techniques. Diagnostics 13(10), 1696 (2023)

    Google Scholar 

  9. Bhangale, K.B., Kothandaraman, M.: Survey of deep learning paradigms for speech processing. Wirel. Pers. Commun. 125(2), 1913–1949 (2022)

    MATH  Google Scholar 

  10. Bhangale, K.B., Mohanaprasad, K.: A review on speech processing using machine learning paradigm. Int. J. Speech Technol. 24(2), 367–388 (2021)

    Google Scholar 

  11. Wani, T.M., Gunawan, T.S., Qadri, S.A.A., Kartiwi, M., Ambikairajah, E.: A comprehensive review of speech emotion recognition systems. IEEE Access 9, 47795–47814 (2021)

    Google Scholar 

  12. Jahangir, R., Teh, Y.W., Hanif, F., Mujtaba, G.: Deep learning approaches for speech emotion recognition: state of the art and research challenges. Multimed. Tools Appl. 80(16), 23745–23812 (2021)

    Google Scholar 

  13. Khalil, R.A., Jones, E., Babar, M.I., Jan, T., Zafar, M.H., Alhussain, T.: Speech emotion recognition using deep learning techniques: a review. IEEE Access 7, 117327–117345 (2019)

    Google Scholar 

  14. Kwon, S.: MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach. Expert Syst. Appl. 167, 114177 (2021)

    MATH  Google Scholar 

  15. Zhao, J., Mao, X., Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 47, 312–323 (2019)

    MATH  Google Scholar 

  16. Padi, S., Sadjadi, S. O., Sriram, R. D., Manocha, D.: Improved speech emotion recognition using transfer learning and spectrogram augmentation. In: Proceedings of the 2021 International Conference on Multimodal Interaction, pp. 645–652 (2021).

  17. Singh, J., Saheer, L.B., Faust, O.: Speech emotion recognition using attention model. Int. J. Environ. Res. Public Health 20(6), 5140 (2023)

    MATH  Google Scholar 

  18. Li, J., Zhang, X., Huang, L., Li, F., Duan, S., Sun, Y.: Speech emotion recognition using a dual-channel complementary spectrogram and the CNN-SSAE neutral network. Appl. Sci. 12(19), 9518 (2022)

    MATH  Google Scholar 

  19. Er, M.B.: A novel approach for classification of speech emotions based on deep and acoustic features. IEEE Access 8, 221640–221653 (2020)

    MATH  Google Scholar 

  20. Chen, M., He, X., Yang, J., Zhang, H.: 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 25(10), 1440–1444 (2018)

    MATH  Google Scholar 

  21. Sonawane, S., Kulkarni, N.: Speech emotion recognition based on MFCC and convolutional neural network. Int. J. Adv. Sci. Res. Eng. Trends (2020).

  22. Sajjad, M., Kwon, S.: Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access 8, 79861–79875 (2020)

    Google Scholar 

  23. Makhmudov, F., Kutlimuratov, A., Akhmedov, F., Abdallah, M.S., Cho, Y.-I.: Modeling speech emotion recognition via attention-oriented parallel CNN encoders. Electronics (Basel) 11(23), 4047 (2022)

    Google Scholar 

  24. Tanveer, M., et al.: Ensemble deep learning in speech signal tasks: a review. Neurocomputing 550, 126436 (2023)

    MATH  Google Scholar 

  25. Khaleghi, A., Shahi, K., Saidi, M., Babaee, N., Kaveh, R., Mohammadian, A.: Linear and nonlinear analysis of multimodal physiological data for affective arousal recognition. Cogn. Neurodyn., pp. 1–12 (2024).

  26. Khaleghi, A., Birgani, P.M., Fooladi, M.F., Mohammadi, M.R.: Applicable features of electroencephalogram for ADHD diagnosis. Res. Biomed. Eng. 36, 1–11 (2020)

    MATH  Google Scholar 

  27. Nanni, L., Ghidoni, S., Brahnam, S.: Handcrafted vs. non-handcrafted features for computer vision classification. Pattern Recognit. 71, 158–172 (2017)

    MATH  Google Scholar 

  28. Bekhouche, S.E., Dornaika, F., Benlamoudi, A., Ouafi, A., Taleb-Ahmed, A.: A comparative study of human facial age estimation: handcrafted features vs. deep features. Multimed. Tools Appl 79(35), 26605–26622 (2020)

    Google Scholar 

  29. Karim, F., Majumdar, S., Darabi, H., Chen, S.: LSTM fully convolutional networks for time series classification. IEEE Access 6, 1662–1669 (2017)

    MATH  Google Scholar 

  30. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings, pp. 249–256 (2010)

  31. Smith, L. N.: Cyclical learning rates for training neural networks. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, pp. 464–472 (2017)

  32. Oyedotun, O.K., Papadopoulos, K., Aouada, D.: A new perspective for understanding generalization gap of deep neural networks trained with large batch sizes. Appl. Intell. 53(12), 15621–15637 (2023)

    Google Scholar 

  33. Liu, Z., Xu, Z., Jin, J., Shen, Z., Darrell, T.: Dropout reduces underfitting. in International Conference on Machine Learning, PMLR, 2023, pp. 22233–22248.

  34. Wang, Y., Guan, L., Venetsanopoulos, A.N.: Kernel cross-modal factor analysis for information fusion with application to bimodal emotion recognition. IEEE Trans. Multimed. 14(3), 597–607 (2012)

    MATH  Google Scholar 

  35. Haq, S., Jackson, P. J. B.: Multimodal emotion recognition. In: Machine audition: principles, algorithms and systems, IGI global, pp. 398–423 (2011)

  36. Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018)

    Google Scholar 

  37. Srikantamurthy, M.M., Rallabandi, V.P.S., Dudekula, D.B., Natarajan, S., Park, J.: Classification of benign and malignant subtypes of breast cancer histopathology imaging using hybrid CNN-LSTM based transfer learning. BMC Med. Imaging 23(1), 19 (2023)

    Google Scholar 

  38. Wang, J., Cheng, S., Tian, J., Gao, Y.: A 2D CNN-LSTM hybrid algorithm using time series segments of EEG data for motor imagery classification. Biomed. Signal Process. Control 83, 104627 (2023)

    Google Scholar 

  39. Cai, Y., Zhao, R., Wang, H., Chen, L., Lian, Y., Zhong, Y.: Cnn-lstm driving style classification model based on driver operation time series data. IEEE Access 11, 16203–16212 (2023)

    MATH  Google Scholar 

  40. Wang, X., Chen, X., Cao, C.: Human emotion recognition by optimally fusing facial expression and speech feature. Signal Process Image Commun. 84, 115831 (2020)

    MATH  Google Scholar 

  41. H. Aouani and Y. Ben Ayed, “Speech emotion recognition with deep learning,” Procedia Comput Sci, vol. 176, pp. 251–260, 2020.

  42. Hajarolasvadi, N., Demirel, H.: 3D CNN-based speech emotion recognition using k-means clustering and spectrograms. Entropy 21(5), 479 (2019)

    Google Scholar 

  43. Demır, A., Atıla, O., Şengür, A.: Deep learning and audio based emotion recognition. In: 2019 International Artificial Intelligence and Data Processing Symposium (IDAP), IEEE, pp. 1–6 (2019).

  44. Noroozi, F., Marjanovic, M., Njegus, A., Escalera, S., Anbarjafari, G.: Audio-visual emotion recognition in video clips. IEEE Trans. Affect. Comput. 10(1), 60–75 (2017)

    Google Scholar 

  45. Zhang, S., Zhang, S., Huang, T., Gao, W.: Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans. Multimedia 20(6), 1576–1590 (2017)

    MATH  Google Scholar 

  46. Tuncer, T., Dogan, S., Acharya, U.R.: Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques. Knowl Based Syst. 211, 106547 (2021)

    Google Scholar 

  47. Jason, C.A., Kumar, S.: An appraisal on speech and emotion recognition technologies based on machine learning. Language (Baltim) 67, 68 (2020)

    MATH  Google Scholar 

  48. Mustaqeem, Kwon, S.: A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 20(1), 183 (2019)

    MATH  Google Scholar 

  49. Issa, D., Demirci, M.F., Yazici, A.: Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 59, 101894 (2020)

    MATH  Google Scholar 

  50. . Mansouri-Benssassi, E, Ye, J.: Speech emotion recognition with early visual cross-modal enhancement using spiking neural networks. In: 2019 International Joint Conference on Neural Networks (IJCNN), IEEE, pp. 1–8 (2019).

  51. Assunção, G., Menezes, P., Perdigão, F.: Speaker awareness for speech emotion recognition. Int. J. Online Biomed. Eng. 16(4), 15–22 (2020)

    MATH  Google Scholar 

  52. Mekruksavanich, S., Jitpattanakul, A., Hnoohom, N.: Negative emotion recognition using deep learning for thai language. In: 2020 joint international conference on digital arts, media and technology with ECTI northern section conference on electrical, electronics, computer and telecommunications engineering (ECTI DAMT & NCON), IEEE, 2020, pp. 71–74.

  53. Thakare, C., Chaurasia, N.K., Rathod, D., Joshi, G., Gudadhe, S.: Comparative analysis of emotion recognition system. Int. Res. J. Eng. Technol 6, 380–384 (2019)

    MATH  Google Scholar 

  54. Özseven, T.: A novel feature selection method for speech emotion recognition. Appl. Acoust. 146, 320–326 (2019)

    MATH  Google Scholar 

Download references

Acknowledgements

We would like to take this opportunity to acknowledge that there are no individuals or organizations that require acknowledgment for their contributions to this work.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Contributions

JN performed Data collection, simulation and analysis. WZ evaluate the first draft of the manuscript, editing and writing. JN performed Data collection. simulation and analysis. WZ evaluate the first draft of the manuscript. editing and writing.

Corresponding author

Correspondence to Jingtao Ning.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethical approval

The research paper has received ethical approval from the institutional review board, ensuring the protection of participants' rights and compliance with the relevant ethical guidelines.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ning, J., Zhang, W. Speech-based emotion recognition using a hybrid RNN-CNN network. SIViP 19, 124 (2025). https://doi.org/10.1007/s11760-024-03574-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11760-024-03574-7

Keywords

Navigation