Skip to main content
Log in

Towards an efficient backbone for preserving features in speech emotion recognition: deep-shallow convolution with recurrent neural network

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Speech emotion recognition (SER) has attracted a great deal of research interest, which plays as a critical role in human-machine interactions. Unlike other visual tasks, SER becomes intractable when the convolutional neural networks (CNNs) are employed, owing to their limitation in handling log-mel spectrograms. Therefore, it is useful to establish a feature-extraction backbone that allows CNNs to maintain information integrity of speech utterances when utilizing log-mel spectrograms. Moreover, a neural network with a deep stack of layers can lead to a performance degradation due to various challenges, including information loss, overfitting, or vanishing gradient issues. Many studies employ hybrid/multi-modal methods or specialized network designs to mitigate these obstacles. However, those methods often are unstable, hard to configure and non-adaptive to different tasks. In this research, we propose a reusable backbone pertaining to CNN blocks for undertaking SER tasks, as inspired by the FishNet model. denoted as deep-swallow convolution with RNN (DSCRNN), this proposed backbone method preserves features from both deep and shallow layers, which is effective in improving quality of features extracted from log-mel spectrograms. Simulation results indicate that our proposed DSCRNN backbone achieves improved accuracy rates of 2% and 11% when comparing with those from a baseline model with traditional CNN blocks in a speaker-independent evaluation utilizing the RAVDESS dataset with 4 classes and 8 classes, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availibility

The code associated with the paper is available at https://github.com/devpriyagoel/deep-shallow-convolution-with-recurrent-neural-network.

Code availability

Yes.

References

  1. Abdullah SMSA, Ameen SYA, Sadeeq MA, Zeebaree S (2021) Multimodal emotion recognition using deep learning. J Appl Sci Technol Trends 2(02):52–58

    Article  Google Scholar 

  2. Bänziger T, Scherer KR (2005) The role of intonation in emotional expressions. Speech Commun 46(3–4):252–267

    Article  Google Scholar 

  3. Bechara A, Damasio H, Damasio AR (2000) Emotion, decision making and the orbitofrontal cortex. Cereb Cortex 10(3):295–307

    Article  Google Scholar 

  4. Breazeal C (2002) Regulation and entrainment in human–robot interaction. Int J Robot Res 21(10–11):883–902. https://doi.org/10.1177/0278364902021010096

    Article  Google Scholar 

  5. Cen L, Wu F, Yu ZL, Hu F (2016) A real-time speech emotion recognition system and its application in online learning. In: Emotions, technology, design, and learning. Elsevier, pp 27–46

  6. Chen L, Mao X, Xue Y, Cheng LL (2012) Speech emotion recognition: features and classification models. Digit Signal Process 22(6):1154–1160. https://doi.org/10.1016/j.dsp.2012.05.007

    Article  MathSciNet  Google Scholar 

  7. Chen M, He X, Yang J, Zhang H (2018) 3D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process Lett 25(10):1440–1444. https://doi.org/10.1109/LSP.2018.2860246

    Article  Google Scholar 

  8. Cowie R (2009) Perceiving emotion: towards a realistic understanding of the task. Philos Trans R Soc Lond Ser B Biol Sci 364:3515–3525. https://doi.org/10.1098/rstb.2009.0139

    Article  Google Scholar 

  9. Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human–computer interaction. IEEE Signal Process Mag 18(1):32–80

    Article  Google Scholar 

  10. El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit 44(3):572–587. https://doi.org/10.1016/j.patcog.2010.09.020

    Article  MATH  Google Scholar 

  11. ElAyadi MMH, Kamel MS, Karray F (2007) Speech emotion recognition using gaussian mixture vector autoregressive models. In: IEEE international conference on acoustics, speech and signal processing, 2007. ICASSP 2007, vol 4, pp IV-957–IV-960

  12. Giannopoulos P, Perikos I, Hatzilygeroudis I (2018) Deep learning approaches for facial emotion recognition: a case study on fer-2013. In: Advances in hybridization of intelligent methods. Springer, pp 1–16

  13. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 7132–7141. https://doi.org/10.1109/CVPR.2018.00745

  14. Ingale AB, Chaudhari D (2012) Speech emotion recognition. Int J Soft Comput Eng (IJSCE) 2(1):235–238

    Google Scholar 

  15. Jalal M, Loweimi E, Moore R, Hain T (2019) Learning temporal clusters using capsule routing for speech emotion recognition, pp 1701–1705. https://doi.org/10.21437/Interspeech.2019-3068

  16. Jones C, Sutherland J (2008) Acoustic emotion recognition for affective computer gaming. In: Affect and emotion in human–computer interaction. Springer, pp 209–219

  17. Lee C, Narayanan S, Pieraccini R (2002) Classifying emotions in human-machine spoken dialogs. In: Proceedings of the ICME proceedings ICME, vol 1, pp 737–740. https://doi.org/10.1109/ICME.2002.1035887

  18. Lee J, Tashev I (2015) High-level feature representation using recurrent neural network for speech emotion recognition. Interspeech 2015. ISCA: international speech communication association

  19. Lim W, Jang D, Lee T (2016) Speech emotion recognition using convolutional and recurrent neural networks. In: 2016 Asia-Pacific signal and information processing association annual summit and conference (APSIPA), pp 1–4

  20. Livingstone SR, Russo FA (2018) The Ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5):e0196391

    Article  Google Scholar 

  21. Mao X, Chen L, Fu L (2009) Multi-level speech emotion recognition based on hmm and ann. In: 2009 WRI World congress on computer science and information engineering, vol 7, pp 225–229

  22. Meng H, Yan T, Yuan F, Wei H (2019) Speech emotion recognition from 3D log-mel spectrograms with deep learning network. IEEE Access 7:125868–125881. https://doi.org/10.1109/ACCESS.2019.2938007

    Article  Google Scholar 

  23. Nwe T, Foo S, De Silva L (2003) Speech emotion recognition using hidden Markov models. Speech Commun 41:603–623. https://doi.org/10.1016/S0167-6393(03)00099-2

    Article  Google Scholar 

  24. Osawa H, Orszulak J, Godfrey KM, Coughlin JF (2010) Maintaining learning motivation of older people by combining household appliance with a communication robot. In: 2010 IEEE/RSJ international conference on intelligent robots and systems, pp 5310–5316

  25. Petrushin V (1999) Emotion in speech: recognition and application to call centers. In: Proceedings of artificial neural networks in engineering (710, 22)

  26. Ranganathan H, Chakraborty S, Panchanathan S (2016) Multimodal emotion recognition using deep learning architectures. In: 2016 IEEE 0(WACV), pp 1–9

  27. Ren M, Nie W, Liu A, Su Y (2019) Multi-modal correlated network for emotion recognition in speech. Vis Inform 33:150–155

    Article  Google Scholar 

  28. Rozgic V, Ananthakrishnan S, Saleem S, Kumar R, Vembu A, Prasad R (2012). Emotion recognition using acoustic and lexical features. In: 13th annual conference of the international speech communication association 2012, INTERSPEECH 2012 (1)

  29. Schuller B, Rigoll G, Lang M (2003) Hidden Markov model-based speech emotion recognition. In: 2003 International conference on multimedia and expo. ICME ’03. Proceedings (Cat. No.03TH8698) (1, I-401). https://doi.org/10.1109/ICME.2003.1220939

  30. Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: 2004 IEEE international conference on acoustics, speech, and signal processing, vol 1, pp I–577

  31. Schuller B, Vlasenko B, Eyben F, Rigoll G, Wendemuth A (2009) Acoustic emotion recognition: a benchmark comparison of performances. In: 2009 IEEE workshop on automatic speech recognition and understanding, pp 552–557

  32. Song P, Jin Y, Zha C, Zhao L (2015) Speech emotion recognition method based on hidden factor analysis. Electron Lett 51(1):112–114

    Article  Google Scholar 

  33. Sun S, Pang J, Shi J, Yi S, Ouyang W (2019) Fishnet: a versatile backbone for image, region, and pixel level prediction. arXiv preprint arXiv:1901.03495

  34. Tokuno S, Tsumatori G, Shono S, Takei E, Yamamoto T, Suzuki G, Shimura M (2011) Usage of emotion recognition in military health care. In: 2011 defense science research conference and expo (dsr), pp 1–5

  35. Zeng H, Wu Z, Zhang J, Yang C, Zhang H, Dai G, Kong W (2019) EEG emotion classification using an improved SincNet-based deep learning model. Brain Sci. https://doi.org/10.3390/brainsci9110326

    Article  Google Scholar 

  36. Zhang Q, Chen X, Zhan Q, Yang T, Xia S (2017) Respiration-based emotion recognition with deep learning. Comput Ind 92:84–90

    Article  Google Scholar 

  37. Zhao Z, Zheng Y, Zhang Z, Wang H, Zhao Y, Li C (2018) Exploring spatio-temporal representations by integrating attention-based bidirectional-LSTM-RNNs and FCNs for speech emotion recognition interspeech

  38. Zheng WQ, Yu JS, Zou YX (2015) An experimental study of speech emotion recognition based on deep convolutional neural networks. In: 2015 International conference on affective computing and intelligent interaction (acii), pp 827-831. https://doi.org/10.1109/ACII.2015.7344669

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

DPG and KM implement the algorithm and write the paper. NDN, NS, and CPL provides guidance and revise the paper.

Corresponding author

Correspondence to Natesan Srinivasan.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Consent for publication

Yes.

Consent to participate

Yes.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Goel, D.P., Mahajan, K., Nguyen, N.D. et al. Towards an efficient backbone for preserving features in speech emotion recognition: deep-shallow convolution with recurrent neural network. Neural Comput & Applic 35, 2457–2469 (2023). https://doi.org/10.1007/s00521-022-07723-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-07723-2

Keywords

Navigation