Towards an efficient backbone for preserving features in speech emotion recognition: deep-shallow convolution with recurrent neural network

Goel, Dev Priya; Mahajan, Kushagra; Nguyen, Ngoc Duy; Srinivasan, Natesan; Lim, Chee Peng

doi:10.1007/s00521-022-07723-2

Towards an efficient backbone for preserving features in speech emotion recognition: deep-shallow convolution with recurrent neural network

Original Article
Published: 28 August 2022

Volume 35, pages 2457–2469, (2023)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Dev Priya Goel¹^na1,
Kushagra Mahajan¹^na1,
Ngoc Duy Nguyen²,
Natesan Srinivasan ORCID: orcid.org/0000-0001-7527-1989¹ &
…
Chee Peng Lim²

469 Accesses
7 Citations
Explore all metrics

Abstract

Speech emotion recognition (SER) has attracted a great deal of research interest, which plays as a critical role in human-machine interactions. Unlike other visual tasks, SER becomes intractable when the convolutional neural networks (CNNs) are employed, owing to their limitation in handling log-mel spectrograms. Therefore, it is useful to establish a feature-extraction backbone that allows CNNs to maintain information integrity of speech utterances when utilizing log-mel spectrograms. Moreover, a neural network with a deep stack of layers can lead to a performance degradation due to various challenges, including information loss, overfitting, or vanishing gradient issues. Many studies employ hybrid/multi-modal methods or specialized network designs to mitigate these obstacles. However, those methods often are unstable, hard to configure and non-adaptive to different tasks. In this research, we propose a reusable backbone pertaining to CNN blocks for undertaking SER tasks, as inspired by the FishNet model. denoted as deep-swallow convolution with RNN (DSCRNN), this proposed backbone method preserves features from both deep and shallow layers, which is effective in improving quality of features extracted from log-mel spectrograms. Simulation results indicate that our proposed DSCRNN backbone achieves improved accuracy rates of 2% and 11% when comparing with those from a baseline model with traditional CNN blocks in a speaker-independent evaluation utilizing the RAVDESS dataset with 4 classes and 8 classes, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech Emotion Recognition Using U-Net

Deep Learning Algorithms for Speech Emotion Recognition with Hybrid Spectral Features

Article 16 November 2023

Multi-dimensional Convolutional Neural Network for Speech Emotion Recognition

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availibility

The code associated with the paper is available at https://github.com/devpriyagoel/deep-shallow-convolution-with-recurrent-neural-network.

Code availability

Yes.

References

Abdullah SMSA, Ameen SYA, Sadeeq MA, Zeebaree S (2021) Multimodal emotion recognition using deep learning. J Appl Sci Technol Trends 2(02):52–58
Article Google Scholar
Bänziger T, Scherer KR (2005) The role of intonation in emotional expressions. Speech Commun 46(3–4):252–267
Article Google Scholar
Bechara A, Damasio H, Damasio AR (2000) Emotion, decision making and the orbitofrontal cortex. Cereb Cortex 10(3):295–307
Article Google Scholar
Breazeal C (2002) Regulation and entrainment in human–robot interaction. Int J Robot Res 21(10–11):883–902. https://doi.org/10.1177/0278364902021010096
Article Google Scholar
Cen L, Wu F, Yu ZL, Hu F (2016) A real-time speech emotion recognition system and its application in online learning. In: Emotions, technology, design, and learning. Elsevier, pp 27–46
Chen L, Mao X, Xue Y, Cheng LL (2012) Speech emotion recognition: features and classification models. Digit Signal Process 22(6):1154–1160. https://doi.org/10.1016/j.dsp.2012.05.007
Article MathSciNet Google Scholar
Chen M, He X, Yang J, Zhang H (2018) 3D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process Lett 25(10):1440–1444. https://doi.org/10.1109/LSP.2018.2860246
Article Google Scholar
Cowie R (2009) Perceiving emotion: towards a realistic understanding of the task. Philos Trans R Soc Lond Ser B Biol Sci 364:3515–3525. https://doi.org/10.1098/rstb.2009.0139
Article Google Scholar
Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human–computer interaction. IEEE Signal Process Mag 18(1):32–80
Article Google Scholar
El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit 44(3):572–587. https://doi.org/10.1016/j.patcog.2010.09.020
Article MATH Google Scholar
ElAyadi MMH, Kamel MS, Karray F (2007) Speech emotion recognition using gaussian mixture vector autoregressive models. In: IEEE international conference on acoustics, speech and signal processing, 2007. ICASSP 2007, vol 4, pp IV-957–IV-960
Giannopoulos P, Perikos I, Hatzilygeroudis I (2018) Deep learning approaches for facial emotion recognition: a case study on fer-2013. In: Advances in hybridization of intelligent methods. Springer, pp 1–16
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 7132–7141. https://doi.org/10.1109/CVPR.2018.00745
Ingale AB, Chaudhari D (2012) Speech emotion recognition. Int J Soft Comput Eng (IJSCE) 2(1):235–238
Google Scholar
Jalal M, Loweimi E, Moore R, Hain T (2019) Learning temporal clusters using capsule routing for speech emotion recognition, pp 1701–1705. https://doi.org/10.21437/Interspeech.2019-3068
Jones C, Sutherland J (2008) Acoustic emotion recognition for affective computer gaming. In: Affect and emotion in human–computer interaction. Springer, pp 209–219
Lee C, Narayanan S, Pieraccini R (2002) Classifying emotions in human-machine spoken dialogs. In: Proceedings of the ICME proceedings ICME, vol 1, pp 737–740. https://doi.org/10.1109/ICME.2002.1035887
Lee J, Tashev I (2015) High-level feature representation using recurrent neural network for speech emotion recognition. Interspeech 2015. ISCA: international speech communication association
Lim W, Jang D, Lee T (2016) Speech emotion recognition using convolutional and recurrent neural networks. In: 2016 Asia-Pacific signal and information processing association annual summit and conference (APSIPA), pp 1–4
Livingstone SR, Russo FA (2018) The Ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5):e0196391
Article Google Scholar
Mao X, Chen L, Fu L (2009) Multi-level speech emotion recognition based on hmm and ann. In: 2009 WRI World congress on computer science and information engineering, vol 7, pp 225–229
Meng H, Yan T, Yuan F, Wei H (2019) Speech emotion recognition from 3D log-mel spectrograms with deep learning network. IEEE Access 7:125868–125881. https://doi.org/10.1109/ACCESS.2019.2938007
Article Google Scholar
Nwe T, Foo S, De Silva L (2003) Speech emotion recognition using hidden Markov models. Speech Commun 41:603–623. https://doi.org/10.1016/S0167-6393(03)00099-2
Article Google Scholar
Osawa H, Orszulak J, Godfrey KM, Coughlin JF (2010) Maintaining learning motivation of older people by combining household appliance with a communication robot. In: 2010 IEEE/RSJ international conference on intelligent robots and systems, pp 5310–5316
Petrushin V (1999) Emotion in speech: recognition and application to call centers. In: Proceedings of artificial neural networks in engineering (710, 22)
Ranganathan H, Chakraborty S, Panchanathan S (2016) Multimodal emotion recognition using deep learning architectures. In: 2016 IEEE 0(WACV), pp 1–9
Ren M, Nie W, Liu A, Su Y (2019) Multi-modal correlated network for emotion recognition in speech. Vis Inform 33:150–155
Article Google Scholar
Rozgic V, Ananthakrishnan S, Saleem S, Kumar R, Vembu A, Prasad R (2012). Emotion recognition using acoustic and lexical features. In: 13th annual conference of the international speech communication association 2012, INTERSPEECH 2012 (1)
Schuller B, Rigoll G, Lang M (2003) Hidden Markov model-based speech emotion recognition. In: 2003 International conference on multimedia and expo. ICME ’03. Proceedings (Cat. No.03TH8698) (1, I-401). https://doi.org/10.1109/ICME.2003.1220939
Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: 2004 IEEE international conference on acoustics, speech, and signal processing, vol 1, pp I–577
Schuller B, Vlasenko B, Eyben F, Rigoll G, Wendemuth A (2009) Acoustic emotion recognition: a benchmark comparison of performances. In: 2009 IEEE workshop on automatic speech recognition and understanding, pp 552–557
Song P, Jin Y, Zha C, Zhao L (2015) Speech emotion recognition method based on hidden factor analysis. Electron Lett 51(1):112–114
Article Google Scholar
Sun S, Pang J, Shi J, Yi S, Ouyang W (2019) Fishnet: a versatile backbone for image, region, and pixel level prediction. arXiv preprint arXiv:1901.03495
Tokuno S, Tsumatori G, Shono S, Takei E, Yamamoto T, Suzuki G, Shimura M (2011) Usage of emotion recognition in military health care. In: 2011 defense science research conference and expo (dsr), pp 1–5
Zeng H, Wu Z, Zhang J, Yang C, Zhang H, Dai G, Kong W (2019) EEG emotion classification using an improved SincNet-based deep learning model. Brain Sci. https://doi.org/10.3390/brainsci9110326
Article Google Scholar
Zhang Q, Chen X, Zhan Q, Yang T, Xia S (2017) Respiration-based emotion recognition with deep learning. Comput Ind 92:84–90
Article Google Scholar
Zhao Z, Zheng Y, Zhang Z, Wang H, Zhao Y, Li C (2018) Exploring spatio-temporal representations by integrating attention-based bidirectional-LSTM-RNNs and FCNs for speech emotion recognition interspeech
Zheng WQ, Yu JS, Zou YX (2015) An experimental study of speech emotion recognition based on deep convolutional neural networks. In: 2015 International conference on affective computing and intelligent interaction (acii), pp 827-831. https://doi.org/10.1109/ACII.2015.7344669

Download references

Funding

Not applicable.

Author information

Dev Priya Goel and Kushagra Mahajan have contributed equally to this work.

Authors and Affiliations

Department of Mathematics, Indian Institute of Technology, Guwahati, 781039, India
Dev Priya Goel, Kushagra Mahajan & Natesan Srinivasan
Institute for Intelligent Systems Research and Innovation, Deakin University, Waurn Ponds, VIC, 3216, Australia
Ngoc Duy Nguyen & Chee Peng Lim

Authors

Dev Priya Goel
View author publications
You can also search for this author inPubMed Google Scholar
Kushagra Mahajan
View author publications
You can also search for this author inPubMed Google Scholar
Ngoc Duy Nguyen
View author publications
You can also search for this author inPubMed Google Scholar
Natesan Srinivasan
View author publications
You can also search for this author inPubMed Google Scholar
Chee Peng Lim
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

DPG and KM implement the algorithm and write the paper. NDN, NS, and CPL provides guidance and revise the paper.

Corresponding author

Correspondence to Natesan Srinivasan.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Consent for publication

Yes.

Consent to participate

Yes.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Goel, D.P., Mahajan, K., Nguyen, N.D. et al. Towards an efficient backbone for preserving features in speech emotion recognition: deep-shallow convolution with recurrent neural network. Neural Comput & Applic 35, 2457–2469 (2023). https://doi.org/10.1007/s00521-022-07723-2

Download citation

Received: 03 November 2021
Accepted: 09 August 2022
Published: 28 August 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s00521-022-07723-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards an efficient backbone for preserving features in speech emotion recognition: deep-shallow convolution with recurrent neural network

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Speech Emotion Recognition Using U-Net

Deep Learning Algorithms for Speech Emotion Recognition with Hybrid Spectral Features

Multi-dimensional Convolutional Neural Network for Speech Emotion Recognition

Explore related subjects

Data availibility

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Consent for publication

Consent to participate

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now