Abstract
Speech emotion recognition is probably among the most exciting and dynamic areas of modern research focused on speech signals analysis, which allows estimating and classifying speakers' rich spectrum of emotions. The following paper aims to develop a novel deep learning (DL)-based model for detecting speech emotion variation to overcome several weaknesses of the existing intelligent data-driven approaches. A new architecture for a DL network, referred to as the RNN–CNN, is proposed and applied in this paper to perform the SER task by operating directly on raw speech signals. Specifically, the challenge was effectively combining an initial convolution layer with a wide kernel as an efficient way to address and mitigate the problems caused by noise found in raw speech signals. In this experimental analysis, the 3 databases used to evaluate the proposed RNN–CNN model are RML, RAVDESS, and SAVEE. The effectiveness of such methodologies can be detected with remarkable efficacy, whose improved accuracy rates depict contrasting trends from those findings of the previous works analyzed through respective datasets. This assessment has validated the robust performance and applicability of the suggested models for diverse speech databases and underlined their potential in further speech-based emotion recognition.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03574-7/MediaObjects/11760_2024_3574_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03574-7/MediaObjects/11760_2024_3574_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03574-7/MediaObjects/11760_2024_3574_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03574-7/MediaObjects/11760_2024_3574_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03574-7/MediaObjects/11760_2024_3574_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03574-7/MediaObjects/11760_2024_3574_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03574-7/MediaObjects/11760_2024_3574_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03574-7/MediaObjects/11760_2024_3574_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03574-7/MediaObjects/11760_2024_3574_Fig9_HTML.png)
Similar content being viewed by others
Availability of data and materials
No datasets were generated or analysed during the current study.
References
Khaleghi, A., Mohammadi, M.R., Shahi, K., Nasrabadi, A.M.: Computational neuroscience approach to psychiatry: a review on theory-driven approaches. Clin. Psychopharmacol. Neurosci. 20(1), 26 (2022)
Chen, Z., Li, J., Liu, H., Wang, X., Wang, H., Zheng, Q.: Learning multi-scale features for speech emotion recognition with connection attention mechanism. Expert Syst. Appl. 214, 118943 (2023)
Leong, S.C., Tang, Y.M., Lai, C.H., Lee, C.K.M.: Facial expression and body gesture emotion recognition: a systematic review on the use of visual data in affective computing. Comput. Sci. Rev. 48, 100545 (2023)
Zhang, Z., Zhong, S., Liu, Y.: TorchEEGEMO: a deep learning toolbox towards EEG-based emotion recognition. Expert Syst. Appl. 249, 123550 (2024)
Li, J., Zhang, X., Li, F., Huang, L.: Speech emotion recognition based on optimized deep features of dual-channel complementary spectrogram. Inf. Sci. (N Y) 649, 119649 (2023)
S. Madanian et al. Speech emotion recognition using machine learning—A systematic review. Intelligent Systems with Applications, p. 200266 (2023).
Mehrish, A., Majumder, N., Bharadwaj, R., Mihalcea, R., Poria, S.: A review of deep learning techniques for speech processing. Inf. Fusion 99, 101869 (2023)
Salmanpour, M.R., Rezaeijo, S.M., Hosseinzadeh, M., Rahmim, A.: Deep versus handcrafted tensor radiomics features: prediction of survival in head and neck cancer using machine learning and fusion techniques. Diagnostics 13(10), 1696 (2023)
Bhangale, K.B., Kothandaraman, M.: Survey of deep learning paradigms for speech processing. Wirel. Pers. Commun. 125(2), 1913–1949 (2022)
Bhangale, K.B., Mohanaprasad, K.: A review on speech processing using machine learning paradigm. Int. J. Speech Technol. 24(2), 367–388 (2021)
Wani, T.M., Gunawan, T.S., Qadri, S.A.A., Kartiwi, M., Ambikairajah, E.: A comprehensive review of speech emotion recognition systems. IEEE Access 9, 47795–47814 (2021)
Jahangir, R., Teh, Y.W., Hanif, F., Mujtaba, G.: Deep learning approaches for speech emotion recognition: state of the art and research challenges. Multimed. Tools Appl. 80(16), 23745–23812 (2021)
Khalil, R.A., Jones, E., Babar, M.I., Jan, T., Zafar, M.H., Alhussain, T.: Speech emotion recognition using deep learning techniques: a review. IEEE Access 7, 117327–117345 (2019)
Kwon, S.: MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach. Expert Syst. Appl. 167, 114177 (2021)
Zhao, J., Mao, X., Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 47, 312–323 (2019)
Padi, S., Sadjadi, S. O., Sriram, R. D., Manocha, D.: Improved speech emotion recognition using transfer learning and spectrogram augmentation. In: Proceedings of the 2021 International Conference on Multimodal Interaction, pp. 645–652 (2021).
Singh, J., Saheer, L.B., Faust, O.: Speech emotion recognition using attention model. Int. J. Environ. Res. Public Health 20(6), 5140 (2023)
Li, J., Zhang, X., Huang, L., Li, F., Duan, S., Sun, Y.: Speech emotion recognition using a dual-channel complementary spectrogram and the CNN-SSAE neutral network. Appl. Sci. 12(19), 9518 (2022)
Er, M.B.: A novel approach for classification of speech emotions based on deep and acoustic features. IEEE Access 8, 221640–221653 (2020)
Chen, M., He, X., Yang, J., Zhang, H.: 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 25(10), 1440–1444 (2018)
Sonawane, S., Kulkarni, N.: Speech emotion recognition based on MFCC and convolutional neural network. Int. J. Adv. Sci. Res. Eng. Trends (2020).
Sajjad, M., Kwon, S.: Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access 8, 79861–79875 (2020)
Makhmudov, F., Kutlimuratov, A., Akhmedov, F., Abdallah, M.S., Cho, Y.-I.: Modeling speech emotion recognition via attention-oriented parallel CNN encoders. Electronics (Basel) 11(23), 4047 (2022)
Tanveer, M., et al.: Ensemble deep learning in speech signal tasks: a review. Neurocomputing 550, 126436 (2023)
Khaleghi, A., Shahi, K., Saidi, M., Babaee, N., Kaveh, R., Mohammadian, A.: Linear and nonlinear analysis of multimodal physiological data for affective arousal recognition. Cogn. Neurodyn., pp. 1–12 (2024).
Khaleghi, A., Birgani, P.M., Fooladi, M.F., Mohammadi, M.R.: Applicable features of electroencephalogram for ADHD diagnosis. Res. Biomed. Eng. 36, 1–11 (2020)
Nanni, L., Ghidoni, S., Brahnam, S.: Handcrafted vs. non-handcrafted features for computer vision classification. Pattern Recognit. 71, 158–172 (2017)
Bekhouche, S.E., Dornaika, F., Benlamoudi, A., Ouafi, A., Taleb-Ahmed, A.: A comparative study of human facial age estimation: handcrafted features vs. deep features. Multimed. Tools Appl 79(35), 26605–26622 (2020)
Karim, F., Majumdar, S., Darabi, H., Chen, S.: LSTM fully convolutional networks for time series classification. IEEE Access 6, 1662–1669 (2017)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings, pp. 249–256 (2010)
Smith, L. N.: Cyclical learning rates for training neural networks. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, pp. 464–472 (2017)
Oyedotun, O.K., Papadopoulos, K., Aouada, D.: A new perspective for understanding generalization gap of deep neural networks trained with large batch sizes. Appl. Intell. 53(12), 15621–15637 (2023)
Liu, Z., Xu, Z., Jin, J., Shen, Z., Darrell, T.: Dropout reduces underfitting. in International Conference on Machine Learning, PMLR, 2023, pp. 22233–22248.
Wang, Y., Guan, L., Venetsanopoulos, A.N.: Kernel cross-modal factor analysis for information fusion with application to bimodal emotion recognition. IEEE Trans. Multimed. 14(3), 597–607 (2012)
Haq, S., Jackson, P. J. B.: Multimodal emotion recognition. In: Machine audition: principles, algorithms and systems, IGI global, pp. 398–423 (2011)
Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018)
Srikantamurthy, M.M., Rallabandi, V.P.S., Dudekula, D.B., Natarajan, S., Park, J.: Classification of benign and malignant subtypes of breast cancer histopathology imaging using hybrid CNN-LSTM based transfer learning. BMC Med. Imaging 23(1), 19 (2023)
Wang, J., Cheng, S., Tian, J., Gao, Y.: A 2D CNN-LSTM hybrid algorithm using time series segments of EEG data for motor imagery classification. Biomed. Signal Process. Control 83, 104627 (2023)
Cai, Y., Zhao, R., Wang, H., Chen, L., Lian, Y., Zhong, Y.: Cnn-lstm driving style classification model based on driver operation time series data. IEEE Access 11, 16203–16212 (2023)
Wang, X., Chen, X., Cao, C.: Human emotion recognition by optimally fusing facial expression and speech feature. Signal Process Image Commun. 84, 115831 (2020)
H. Aouani and Y. Ben Ayed, “Speech emotion recognition with deep learning,” Procedia Comput Sci, vol. 176, pp. 251–260, 2020.
Hajarolasvadi, N., Demirel, H.: 3D CNN-based speech emotion recognition using k-means clustering and spectrograms. Entropy 21(5), 479 (2019)
Demır, A., Atıla, O., Şengür, A.: Deep learning and audio based emotion recognition. In: 2019 International Artificial Intelligence and Data Processing Symposium (IDAP), IEEE, pp. 1–6 (2019).
Noroozi, F., Marjanovic, M., Njegus, A., Escalera, S., Anbarjafari, G.: Audio-visual emotion recognition in video clips. IEEE Trans. Affect. Comput. 10(1), 60–75 (2017)
Zhang, S., Zhang, S., Huang, T., Gao, W.: Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans. Multimedia 20(6), 1576–1590 (2017)
Tuncer, T., Dogan, S., Acharya, U.R.: Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques. Knowl Based Syst. 211, 106547 (2021)
Jason, C.A., Kumar, S.: An appraisal on speech and emotion recognition technologies based on machine learning. Language (Baltim) 67, 68 (2020)
Mustaqeem, Kwon, S.: A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 20(1), 183 (2019)
Issa, D., Demirci, M.F., Yazici, A.: Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 59, 101894 (2020)
. Mansouri-Benssassi, E, Ye, J.: Speech emotion recognition with early visual cross-modal enhancement using spiking neural networks. In: 2019 International Joint Conference on Neural Networks (IJCNN), IEEE, pp. 1–8 (2019).
Assunção, G., Menezes, P., Perdigão, F.: Speaker awareness for speech emotion recognition. Int. J. Online Biomed. Eng. 16(4), 15–22 (2020)
Mekruksavanich, S., Jitpattanakul, A., Hnoohom, N.: Negative emotion recognition using deep learning for thai language. In: 2020 joint international conference on digital arts, media and technology with ECTI northern section conference on electrical, electronics, computer and telecommunications engineering (ECTI DAMT & NCON), IEEE, 2020, pp. 71–74.
Thakare, C., Chaurasia, N.K., Rathod, D., Joshi, G., Gudadhe, S.: Comparative analysis of emotion recognition system. Int. Res. J. Eng. Technol 6, 380–384 (2019)
Özseven, T.: A novel feature selection method for speech emotion recognition. Appl. Acoust. 146, 320–326 (2019)
Acknowledgements
We would like to take this opportunity to acknowledge that there are no individuals or organizations that require acknowledgment for their contributions to this work.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Contributions
JN performed Data collection, simulation and analysis. WZ evaluate the first draft of the manuscript, editing and writing. JN performed Data collection. simulation and analysis. WZ evaluate the first draft of the manuscript. editing and writing.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethical approval
The research paper has received ethical approval from the institutional review board, ensuring the protection of participants' rights and compliance with the relevant ethical guidelines.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ning, J., Zhang, W. Speech-based emotion recognition using a hybrid RNN-CNN network. SIViP 19, 124 (2025). https://doi.org/10.1007/s11760-024-03574-7
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11760-024-03574-7