Speech-based emotion recognition using a hybrid RNN-CNN network

Ning, Jingtao; Zhang, Wenchuan

doi:10.1007/s11760-024-03574-7

Speech-based emotion recognition using a hybrid RNN-CNN network

Original Paper
Published: 12 December 2024

Volume 19, article number 124, (2025)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

Jingtao Ning¹ &
Wenchuan Zhang¹

210 Accesses
Explore all metrics

Abstract

Speech emotion recognition is probably among the most exciting and dynamic areas of modern research focused on speech signals analysis, which allows estimating and classifying speakers' rich spectrum of emotions. The following paper aims to develop a novel deep learning (DL)-based model for detecting speech emotion variation to overcome several weaknesses of the existing intelligent data-driven approaches. A new architecture for a DL network, referred to as the RNN–CNN, is proposed and applied in this paper to perform the SER task by operating directly on raw speech signals. Specifically, the challenge was effectively combining an initial convolution layer with a wide kernel as an efficient way to address and mitigate the problems caused by noise found in raw speech signals. In this experimental analysis, the 3 databases used to evaluate the proposed RNN–CNN model are RML, RAVDESS, and SAVEE. The effectiveness of such methodologies can be detected with remarkable efficacy, whose improved accuracy rates depict contrasting trends from those findings of the previous works analyzed through respective datasets. This assessment has validated the robust performance and applicability of the suggested models for diverse speech databases and underlined their potential in further speech-based emotion recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech Emotion Classification Using Deep Learning

Speech Emotion Recognition Using Deep Learning

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

Availability of data and materials

No datasets were generated or analysed during the current study.

References

Khaleghi, A., Mohammadi, M.R., Shahi, K., Nasrabadi, A.M.: Computational neuroscience approach to psychiatry: a review on theory-driven approaches. Clin. Psychopharmacol. Neurosci. 20(1), 26 (2022)
Google Scholar
Chen, Z., Li, J., Liu, H., Wang, X., Wang, H., Zheng, Q.: Learning multi-scale features for speech emotion recognition with connection attention mechanism. Expert Syst. Appl. 214, 118943 (2023)
MATH Google Scholar
Leong, S.C., Tang, Y.M., Lai, C.H., Lee, C.K.M.: Facial expression and body gesture emotion recognition: a systematic review on the use of visual data in affective computing. Comput. Sci. Rev. 48, 100545 (2023)
Google Scholar
Zhang, Z., Zhong, S., Liu, Y.: TorchEEGEMO: a deep learning toolbox towards EEG-based emotion recognition. Expert Syst. Appl. 249, 123550 (2024)
Google Scholar
Li, J., Zhang, X., Li, F., Huang, L.: Speech emotion recognition based on optimized deep features of dual-channel complementary spectrogram. Inf. Sci. (N Y) 649, 119649 (2023)
MATH Google Scholar
S. Madanian et al. Speech emotion recognition using machine learning—A systematic review. Intelligent Systems with Applications, p. 200266 (2023).
Mehrish, A., Majumder, N., Bharadwaj, R., Mihalcea, R., Poria, S.: A review of deep learning techniques for speech processing. Inf. Fusion 99, 101869 (2023)
Google Scholar
Salmanpour, M.R., Rezaeijo, S.M., Hosseinzadeh, M., Rahmim, A.: Deep versus handcrafted tensor radiomics features: prediction of survival in head and neck cancer using machine learning and fusion techniques. Diagnostics 13(10), 1696 (2023)
Google Scholar
Bhangale, K.B., Kothandaraman, M.: Survey of deep learning paradigms for speech processing. Wirel. Pers. Commun. 125(2), 1913–1949 (2022)
MATH Google Scholar
Bhangale, K.B., Mohanaprasad, K.: A review on speech processing using machine learning paradigm. Int. J. Speech Technol. 24(2), 367–388 (2021)
Google Scholar
Wani, T.M., Gunawan, T.S., Qadri, S.A.A., Kartiwi, M., Ambikairajah, E.: A comprehensive review of speech emotion recognition systems. IEEE Access 9, 47795–47814 (2021)
Google Scholar
Jahangir, R., Teh, Y.W., Hanif, F., Mujtaba, G.: Deep learning approaches for speech emotion recognition: state of the art and research challenges. Multimed. Tools Appl. 80(16), 23745–23812 (2021)
Google Scholar
Khalil, R.A., Jones, E., Babar, M.I., Jan, T., Zafar, M.H., Alhussain, T.: Speech emotion recognition using deep learning techniques: a review. IEEE Access 7, 117327–117345 (2019)
Google Scholar
Kwon, S.: MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach. Expert Syst. Appl. 167, 114177 (2021)
MATH Google Scholar
Zhao, J., Mao, X., Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 47, 312–323 (2019)
MATH Google Scholar
Padi, S., Sadjadi, S. O., Sriram, R. D., Manocha, D.: Improved speech emotion recognition using transfer learning and spectrogram augmentation. In: Proceedings of the 2021 International Conference on Multimodal Interaction, pp. 645–652 (2021).
Singh, J., Saheer, L.B., Faust, O.: Speech emotion recognition using attention model. Int. J. Environ. Res. Public Health 20(6), 5140 (2023)
MATH Google Scholar
Li, J., Zhang, X., Huang, L., Li, F., Duan, S., Sun, Y.: Speech emotion recognition using a dual-channel complementary spectrogram and the CNN-SSAE neutral network. Appl. Sci. 12(19), 9518 (2022)
MATH Google Scholar
Er, M.B.: A novel approach for classification of speech emotions based on deep and acoustic features. IEEE Access 8, 221640–221653 (2020)
MATH Google Scholar
Chen, M., He, X., Yang, J., Zhang, H.: 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 25(10), 1440–1444 (2018)
MATH Google Scholar
Sonawane, S., Kulkarni, N.: Speech emotion recognition based on MFCC and convolutional neural network. Int. J. Adv. Sci. Res. Eng. Trends (2020).
Sajjad, M., Kwon, S.: Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access 8, 79861–79875 (2020)
Google Scholar
Makhmudov, F., Kutlimuratov, A., Akhmedov, F., Abdallah, M.S., Cho, Y.-I.: Modeling speech emotion recognition via attention-oriented parallel CNN encoders. Electronics (Basel) 11(23), 4047 (2022)
Google Scholar
Tanveer, M., et al.: Ensemble deep learning in speech signal tasks: a review. Neurocomputing 550, 126436 (2023)
MATH Google Scholar
Khaleghi, A., Shahi, K., Saidi, M., Babaee, N., Kaveh, R., Mohammadian, A.: Linear and nonlinear analysis of multimodal physiological data for affective arousal recognition. Cogn. Neurodyn., pp. 1–12 (2024).
Khaleghi, A., Birgani, P.M., Fooladi, M.F., Mohammadi, M.R.: Applicable features of electroencephalogram for ADHD diagnosis. Res. Biomed. Eng. 36, 1–11 (2020)
MATH Google Scholar
Nanni, L., Ghidoni, S., Brahnam, S.: Handcrafted vs. non-handcrafted features for computer vision classification. Pattern Recognit. 71, 158–172 (2017)
MATH Google Scholar
Bekhouche, S.E., Dornaika, F., Benlamoudi, A., Ouafi, A., Taleb-Ahmed, A.: A comparative study of human facial age estimation: handcrafted features vs. deep features. Multimed. Tools Appl 79(35), 26605–26622 (2020)
Google Scholar
Karim, F., Majumdar, S., Darabi, H., Chen, S.: LSTM fully convolutional networks for time series classification. IEEE Access 6, 1662–1669 (2017)
MATH Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings, pp. 249–256 (2010)
Smith, L. N.: Cyclical learning rates for training neural networks. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, pp. 464–472 (2017)
Oyedotun, O.K., Papadopoulos, K., Aouada, D.: A new perspective for understanding generalization gap of deep neural networks trained with large batch sizes. Appl. Intell. 53(12), 15621–15637 (2023)
Google Scholar
Liu, Z., Xu, Z., Jin, J., Shen, Z., Darrell, T.: Dropout reduces underfitting. in International Conference on Machine Learning, PMLR, 2023, pp. 22233–22248.
Wang, Y., Guan, L., Venetsanopoulos, A.N.: Kernel cross-modal factor analysis for information fusion with application to bimodal emotion recognition. IEEE Trans. Multimed. 14(3), 597–607 (2012)
MATH Google Scholar
Haq, S., Jackson, P. J. B.: Multimodal emotion recognition. In: Machine audition: principles, algorithms and systems, IGI global, pp. 398–423 (2011)
Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018)
Google Scholar
Srikantamurthy, M.M., Rallabandi, V.P.S., Dudekula, D.B., Natarajan, S., Park, J.: Classification of benign and malignant subtypes of breast cancer histopathology imaging using hybrid CNN-LSTM based transfer learning. BMC Med. Imaging 23(1), 19 (2023)
Google Scholar
Wang, J., Cheng, S., Tian, J., Gao, Y.: A 2D CNN-LSTM hybrid algorithm using time series segments of EEG data for motor imagery classification. Biomed. Signal Process. Control 83, 104627 (2023)
Google Scholar
Cai, Y., Zhao, R., Wang, H., Chen, L., Lian, Y., Zhong, Y.: Cnn-lstm driving style classification model based on driver operation time series data. IEEE Access 11, 16203–16212 (2023)
MATH Google Scholar
Wang, X., Chen, X., Cao, C.: Human emotion recognition by optimally fusing facial expression and speech feature. Signal Process Image Commun. 84, 115831 (2020)
MATH Google Scholar
H. Aouani and Y. Ben Ayed, “Speech emotion recognition with deep learning,” Procedia Comput Sci, vol. 176, pp. 251–260, 2020.
Hajarolasvadi, N., Demirel, H.: 3D CNN-based speech emotion recognition using k-means clustering and spectrograms. Entropy 21(5), 479 (2019)
Google Scholar
Demır, A., Atıla, O., Şengür, A.: Deep learning and audio based emotion recognition. In: 2019 International Artificial Intelligence and Data Processing Symposium (IDAP), IEEE, pp. 1–6 (2019).
Noroozi, F., Marjanovic, M., Njegus, A., Escalera, S., Anbarjafari, G.: Audio-visual emotion recognition in video clips. IEEE Trans. Affect. Comput. 10(1), 60–75 (2017)
Google Scholar
Zhang, S., Zhang, S., Huang, T., Gao, W.: Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans. Multimedia 20(6), 1576–1590 (2017)
MATH Google Scholar
Tuncer, T., Dogan, S., Acharya, U.R.: Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques. Knowl Based Syst. 211, 106547 (2021)
Google Scholar
Jason, C.A., Kumar, S.: An appraisal on speech and emotion recognition technologies based on machine learning. Language (Baltim) 67, 68 (2020)
MATH Google Scholar
Mustaqeem, Kwon, S.: A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 20(1), 183 (2019)
MATH Google Scholar
Issa, D., Demirci, M.F., Yazici, A.: Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 59, 101894 (2020)
MATH Google Scholar
. Mansouri-Benssassi, E, Ye, J.: Speech emotion recognition with early visual cross-modal enhancement using spiking neural networks. In: 2019 International Joint Conference on Neural Networks (IJCNN), IEEE, pp. 1–8 (2019).
Assunção, G., Menezes, P., Perdigão, F.: Speaker awareness for speech emotion recognition. Int. J. Online Biomed. Eng. 16(4), 15–22 (2020)
MATH Google Scholar
Mekruksavanich, S., Jitpattanakul, A., Hnoohom, N.: Negative emotion recognition using deep learning for thai language. In: 2020 joint international conference on digital arts, media and technology with ECTI northern section conference on electrical, electronics, computer and telecommunications engineering (ECTI DAMT & NCON), IEEE, 2020, pp. 71–74.
Thakare, C., Chaurasia, N.K., Rathod, D., Joshi, G., Gudadhe, S.: Comparative analysis of emotion recognition system. Int. Res. J. Eng. Technol 6, 380–384 (2019)
MATH Google Scholar
Özseven, T.: A novel feature selection method for speech emotion recognition. Appl. Acoust. 146, 320–326 (2019)
MATH Google Scholar

Download references

Acknowledgements

We would like to take this opportunity to acknowledge that there are no individuals or organizations that require acknowledgment for their contributions to this work.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

College of Information Engineering, Lanzhou Petrochemical University of Vocational Technology, Lanzhou, 730060, Gansu, China
Jingtao Ning & Wenchuan Zhang

Authors

Jingtao Ning
View author publications
You can also search for this author in PubMed Google Scholar
Wenchuan Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

JN performed Data collection, simulation and analysis. WZ evaluate the first draft of the manuscript, editing and writing. JN performed Data collection. simulation and analysis. WZ evaluate the first draft of the manuscript. editing and writing.

Corresponding author

Correspondence to Jingtao Ning.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethical approval

The research paper has received ethical approval from the institutional review board, ensuring the protection of participants' rights and compliance with the relevant ethical guidelines.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ning, J., Zhang, W. Speech-based emotion recognition using a hybrid RNN-CNN network. SIViP 19, 124 (2025). https://doi.org/10.1007/s11760-024-03574-7

Download citation

Received: 28 August 2024
Revised: 04 November 2024
Accepted: 17 November 2024
Published: 12 December 2024
DOI: https://doi.org/10.1007/s11760-024-03574-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech-based emotion recognition using a hybrid RNN-CNN network

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Speech Emotion Classification Using Deep Learning

Speech Emotion Recognition Using Deep Learning

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Speech-based emotion recognition using a hybrid RNN-CNN network

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Speech Emotion Classification Using Deep Learning

Speech Emotion Recognition Using Deep Learning

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation