A lightweight 2D CNN based approach for speaker-independent emotion recognition from speech with new Indian Emotional Speech Corpora

Singh, Youddha Beer; Goel, Shivani

doi:10.1007/s11042-023-14577-w

A lightweight 2D CNN based approach for speaker-independent emotion recognition from speech with new Indian Emotional Speech Corpora

Published: 21 February 2023

Volume 82, pages 23055–23073, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

279 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Speech Emotion Recognition (SER) is the process of recognizing emotions by extracting few features of speech signals. It is becoming very popular in Human Computer Interaction (HCI) applications. The challenge is to extract relevant features of speech to recognize emotions with a low computational cost. In this paper, a lightweight Convolutional Neural Network (LCNN) based model has been proposed which extracts useful features automatically. The speech samples are converted into spectrograms of size 224 × 224 for LCNN input. 5 CNN layers and stride are used for down-sampling the feature maps in place of pooling layers which reduces the computational cost. It has been evaluated for accuracy on publicly available benchmark datasets EMOVO (81%), EMODB (87%), and SAVEE (80%). The accuracy of proposed model is also found to be better than SER CNN-assisted model, ResNet-18 and ResNet-34 models. Very few speech datasets are available in Indian ascent. So, authors have created a new Indian Emotional Speech Corpora (IESC) in English language with 600 speech samples recorded from 8 speakers using 2 sentences in 5 emotions. It will be made publicly available for researchers. The accuracy of the proposed LCNN model on IESC is found to be 95% which is better than existing datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

An efficient algorithm for recognition of emotions from speaker and language independent speech using deep learning

Article 20 January 2021

Speech Emotion Recognition Using Deep Learning

Abbreviations

ANN:: Artificial Neural Network
CNN:: Convolutional Neural Network
GMM:: Gaussian Mixture Model
HMM:: Hidden Markov Model
IESC:: Indian Emotional Speech Corpora
LSTM:: Long Short-Term Memory
MFCC:: Mel-frequency Cepstral Coefficient
MFMC:: Mel Frequency Magnitude Coefficient
RNN:: Recurrent Neural Network
SER:: Speech Emotion Recognition

References

Bansal S, Dev A (2013) Emotional Hindi speech database. In: 2013 international conference oriental COCOSDA held jointly with the 2013 conference on Asian spoken language research and evaluation (O-COCOSDA/CASLRE) (2013), pp 1-4. IEEE. https://doi.org/10.1109/ICSDA.2013.6709867
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In ninth European conference on speech communication and technology, Lisbon, Portugal, September 4-8, 2005, pp 1517-1520. https://doi.org/10.21437/Interspeech.2005-446
Chunhui Gu et al (2018) AVA: a video dataset of Spatio-temporally localized atomic visual actions. In: International Conference on Computer Vision and Pattern Recognition. https://arxiv.org/abs/1705.08421
Costantini G, Iaderola I, Paoloni A, Todisco M (2014) Emovo corpus: an italian emotional speech database. In: In international conference on language resources and evaluation, (LREC'14), Reykjavik, 26-31 maggio 2014, European language resources association (ELRA), Parigi, 2014, pp 3501–3504. http://www.lrec-conf.org/proceedings/lrec2014/pdf/591_Paper.pdf
Dai K, Fell HJ, Mac Auslan J (2008) Recognizing emotion in speech using neural networks. In: 4th IASTED international conference on telehealth and assistive technologies, pp 31-36. https://doi.org/10.5555/1722763
Daneshfar F, Kabudian SJ (2020) Speech emotion recognition using discriminative dimension reduction by employing a modified quantum behaved particle swarm optimization algorithm. Multimed Tools Appl 79(1):1261–1289. https://doi.org/10.1007/s11042-019-08222-8
Article Google Scholar
Deng J, Xu X, Zhang Z, Fruhholz S, Schuller B (2017) Universum autoencoder-based domain adaptation for speech emotion recognition. In IEEE Signal Processing Letters 24:500–504. https://doi.org/10.1109/lsp.2017.2672753
El Ayadi M, Kamel M, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn 44:572–587. https://doi.org/10.1016/j.patcog.2010.09.020
Article MATH Google Scholar
Firoz SA, Raji SA, Babu AP (2009) Automatic emotion recognition from speech using artificial neural networks with gender-dependent databases. In: IEEE international conference on advances in computing, control, and telecommunication technologies. 28–29 Dec 2009 Bangalore, India. https://doi.org/10.1109/ACT.2009.49
Han K, Yu D, Tashev I (2014) Speech emotion recognition using deep neural network and extreme learning machine. In: Fifteenth Annual Conference of the International Speech Communication Association, Singapore, Malasia, pp 223–227. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/IS140441.pdf
Haq S, Jackson PJ (2011) Multimodal emotion recognition. In: Machine audition: principles, algorithms and systems, pp 398-423. IGI global. https://doi.org/10.4018/978-1-61520-919-4.ch017
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, pp 770–778. https://doi.org/10.1109/CVPR.2016.90
IESC: www.kaggle.com/dataset/60f09eaaea16bf15f44d4ada0b10b62f64d6296262b8f2d879572fbb1e5ea51f
Issa D, Demirci MF, Yazici A (2020) Speech emotion recognition with deep convolutional neural networks. Biomed Signal Process Control 59:101894. https://doi.org/10.1016/j.bspc.2020.101894
Article Google Scholar
Jahangir R, Teh YW, Hanif F, Mujtaba G (2021) Deep learning approaches for speech emotion recognition: state of the art and research challenges. Multimed Tools Appl 80:23745–23812. https://doi.org/10.1007/s11042-020-09874-7
Article Google Scholar
Kaya H, Karpov AA (2018) Efficient and effective strategies for cross-corpus acoustic emotion recognition. Neurocomputing 275:1028–1034. https://doi.org/10.1016/j.neucom.2017.09.049
Article Google Scholar
Khanchandani KB, Hussain MA (2009) Emotion recognition using multilayer perceptron and generalized feed forward neural network, CSIR 68:367–371. http://hdl.handle.net/123456789/3787
Kim Y, Lee H, Provost EM (2013) Deep learning for robust feature generation in audio visual emotion recognition. In IEEE international conference on acoustics, speech and signal processing, Vancouver, BC, pp 3687-3691. https://doi.org/10.1109/ICASSP.2013.6638346
Koolagudi S, Rao K (2012) Emotion recognition from speech: a review. Int J Speech Technol 15:99–117. https://doi.org/10.1007/s10772-011-9125-1
Article Google Scholar
Koolagudi S, Maity S, Kumar V, Chakrabarti S, Rao K (2009) IITKGP-SESC: speech database for emotion analysis. In: 3rd International Conference on Contemporary Computing, 17–19 August Noida, India, pp 485–492. Communications in Computer and Information Science, volume 40. ISBN 978–3–642-03546-3. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-03547-0_46
Koolagudi S, Murthy Y, Bhaskar S (2018) Choice of a classifier, based on properties of a dataset: case study-speech emotion recognition. Int J Speech Technol 21:167–183. https://doi.org/10.1007/s10772-018-9495-8
Article Google Scholar
Kwon S (2020) A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 20(1):183. https://doi.org/10.3390/s20010183
Article Google Scholar
Lakomkin E, Zamani MA, Weber C, Magg S, Wermter S (2018) On the robustness of speech emotion recognition for human-robot interaction with deep neural networks. 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 854–860. https://doi.org/10.1109/IROS.2018.8593571
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. https://doi.org/10.1038/nature14539
Lee MC, Chiang SY, Yeh SC, Wen TF (2020) Study on emotion recognition and companion Chatbot using deep neural network. Multimed Tools Appl 79:19629–19657. https://doi.org/10.1007/s11042-020-08841-6
Article Google Scholar
Li S, Xing X Fan W, Cai B, Fordson P (2021) Spatiotemporal and frequential cascaded attention networks for speech emotion recognition. Neurocomputing 238-248. https://doi.org/10.1016/j.neucom.2021.02.094
Liu Z-T, Wu M, Cao WH, Mao JW, Xu JP, Tan GZ (2018) Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing 273:271–280. https://doi.org/10.1016/j.neucom.2017.07.050
Article Google Scholar
Livingstone S, Russo F (2018) The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in north American English. PLoS ONE 13:e0196391. https://doi.org/10.1371/journal.pone.0196391
Article Google Scholar
Martin O, Kotsia I, Macq B, Pitas I (2006) The eNTERFACE' 05 audio-visual emotion database. 22nd international conference on data engineering workshops (ICDEW'06), Atlanta, GA, USA 8. 145 https://doi.org/10.1109/ICDEW
Mo S, Niu J, Yiming S, Sajal Das K (2018) A novel feature set for video emotion recognition. Neurocomputing 291:11–20. https://doi.org/10.1016/j.neucom.2018.02.052
Article Google Scholar
Nakatsu R, Nicholson J, Tosa N (2000) Emotion recognition and its application to computer agents with spontaneous interactive capabilities. Knowl-Based Syst 13:497–504. https://doi.org/10.1016/s0950-7051(00)00070-8
Article Google Scholar
Niu J, Qian Y, Yu K (2014) Acoustic emotion recognition using deep neural network. In: IEEE 9th international symposium Chinese spoken languages and process (ISCSLP), pp 128-132. https://doi.org/10.1109/ISCSLP.2014.6936657
Özseven T (2018) Investigation of the effect of spectrogram images and different texture analysis methods on speech emotion recognition. Appl Acoust 142:70–77. https://doi.org/10.1016/j.apacoust.2018.08.003
Article Google Scholar
Özseven T (2019) A novel feature selection method for speech emotion recognition. Appl Acoust 146:320–326. https://doi.org/10.1016/j.apacoust.2018.11.028
Article Google Scholar
Partila P, Voznak M (2013) Speech emotions recognition using 2-d neural classifier. In: Nostradamus: Prediction, modeling and analysis of complex systems. Springer, Heidelberg, pp 221–231. https://doi.org/10.1007/978-3-319-00542-3_23
Polzehl T, Sundaram S, Ketabdar H, Wagner M, Metze F (2009) emotion classification in children's speech using fusion of acoustic and linguistic features. In tenth Annual Conference of The International Speech Communication Association, Brighton, United Kingdom, September 6-10, 2009, pp 340–343. https://researchsystem.canberra.edu.au/ws/portalfiles/portal/29337473/fulltext_published.pdf
Polzin T, Waibel A (1998) Detecting emotions in speech. In: Proceedings of the Cooperative Multimodal Communication, Second International Conference, CMC'98, Tilburg, The Netherlands, January 28-30, 1998
Savargiv M, Bastanfard A, (2014) Study on unit-selection and statistical parametric speech synthesis techniques. J Comput Robot 7(1):19–25. http://www.qjcr.ir/article_649_5c6e6b9b8ff146dac392223000b491db.pdf
Savargiv M, Bastanfard A (2015) Persian speech emotion recognition. In: 7th conference on information and knowledge technology (IKT) 1-5. https://doi.org/10.1109/IKT.2015.7288756
Savargiv M, Bastanfard A (2016) Real-time speech emotion recognition by minimum number of features. Artif Intell Robot (IRANOPEN):72–76. https://doi.org/10.1109/RIOS.2016.7529493
Sharma R, Pachori RB, Sircar P (2020) Automated emotion recognition based on higher order statistics and deep learning algorithm. Biomed Signal Process Control 58:101867. https://doi.org/10.1016/j.bspc.2020.101867
Article Google Scholar
Singh YB, Goel S (2018) Survey on human emotion recognition: speech database, features, and classification. In: International conference on advances in computing, communication control and networking (ICACCCN), India, pp 298-301. https://doi.org/10.1109/ICACCCN.2018.8748379
Singh YB, Goel S (2021) An efficient algorithm for recognition of emotions from speaker and language independent speech using deep learning. Multimed Tools Appl 80:14001–14018. https://doi.org/10.1007/s11042-020-10399-2
Article Google Scholar
Stuhlsatz A, Meyer C, Eyben F, Zielke T, Meier G, Schuller B (2011) Deep neural networks for acoustic emotion recognition: raising the benchmarks. IEEE international conference on acoustics, speech and signal processing (ICASSP), Prague, pp 5688-5691. https://doi.org/10.1109/ICASSP.2011.5947651
Tang H, Chu SM, Hasegawa-Johnson M, Huang TS (2009) Emotion recognition from speech VIA boosted Gaussian mixture models. IEEE international conference on multimedia and expo, New York, NY, pp 294–297. https://doi.org/10.1109/ICME.2009.5202493
Tang D, Zeng J, Li M (2018) An end-to-end deep learning framework with speech emotion recognition of atypical individuals. In INTERSPEECH 2018:162-166. https://doi.org/10.21437/Interspeech.2018-2581
Tenbosch L (2003) Emotions, speech and the ASR framework. Speech Comm 40:213–225. https://doi.org/10.1016/s0167-6393(02)00083-3
Article MATH Google Scholar
Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Nicolaou MA, Schuller B, Zafeiriou S (2016) Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), Shanghai, pp 5200-5204. https://doi.org/10.1109/ICASSP.2016.7472669
Ververidis D, Kotropoulos C (2006) Emotional speech recognition: resources, features, and methods. Speech Comm 48:1162–1181. https://doi.org/10.1016/j.specom.2006.04.003
Article Google Scholar
Wang L, Zhang Z, Design CX (2005) Theory and applications, Support Vector Machines. Springer-Verlag, Berlin Heidelber. https://doi.org/10.1002/9781118197448
Wang K, An N, LiBN, Zhang Y, Li L (2015) Speech emotion recognition using Fourier parameters. In IEEE Transactions on Affective Computing, 6:69–75. https://doi.org/10.1109/TAFFC.2015.2392101
Womack B, Hansen J (1999) N-channel hidden Markov models for combined stressed speech classification and recognition. IEEE Trans Speech Audio Process 7:668–677. https://doi.org/10.1109/89.799692
Article Google Scholar
Wu L, Hong R, Wang Y, Wang M (2019) Cross-entropy adversarial view adaptation for person re-identification. IEEE Trans Circuits Syst Video Technol 30(7):2081–2092. https://doi.org/10.1109/TCSVT.2019.2909549
Article Google Scholar
Yang W (2018) Survey on deep multi-modal data analytics: collaboration, rivalry and fusion. J ACM 37(4) article 111 26 pages. https://doi.org/10.1145/1122445.1122456
Zayene B, Jlassi C, Arous N (2020) 3D convolutional recurrent global neural network for speech emotion recognition. 5th IEEE international conference on advanced technologies for signal and image processing (ATSIP), pp 1-5. https://doi.org/10.1109/ATSIP49331.2020.9231597
Zhang Y, Liu Y, Weninger F, Schuller B (2017) Multi-task deep neural network with shared hidden layers: breaking down the wall between emotion representations. 2017 IEEE international conference on acoustics, Speech and Signal Processing (ICASSP), 2017, pp 4990–4994. https://doi.org/10.1109/ICASSP.2017.7953106
Zhang W, Zhao D, Chai Z, Yang LT, Liu X, Gong F, Yang S (2017) Deep learning and SVM-based emotion recognition from Chinese speech for smart affective services. Software Pract Exp 47(8):1127–1138. https://doi.org/10.1002/spe.2487
Article Google Scholar
Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed Signal Process Control 47:312–323. https://doi.org/10.1016/j.bspc.2018.08.035
Article Google Scholar
Zheng W, Xin M, Wang X, Wang B (2014) A novel speech emotion recognition method via incomplete sparse least square regression. IEEE Signal Process Lett 21:569–572. https://doi.org/10.1109/lsp.2014.2308954
Article Google Scholar
Zhou J, Wang G, Yang Y, Chen P (2006) Speech emotion recognition BASED on rough set and SVM. In: 5th IEEE international conference on cognitive informatics, Beijing, pp 53-61. https://doi.org/10.1109/COGINF.2006.365676
Zvarevashe K, Olugbara O (2020) Ensemble learning of hybrid acoustic features for speech emotion recognition. Algorithms 3(3):70. https://doi.org/10.3390/a13030070
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science Engineering and Technology, Bennett University, Greater Noida, UP, 201310, India
Youddha Beer Singh & Shivani Goel
Department of Computer Science and Information Technology, KIET Group of Institutions, Delhi-NCR, Ghaziabad, UP, 201306, India
Youddha Beer Singh

Authors

Youddha Beer Singh
View author publications
You can also search for this author in PubMed Google Scholar
Shivani Goel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Youddha Beer Singh.

Ethics declarations

Conflict of interest

The authors declare that there are no known conflicting personal relationships or financial interests within the manuscript to influence the work presented in this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Singh, Y.B., Goel, S. A lightweight 2D CNN based approach for speaker-independent emotion recognition from speech with new Indian Emotional Speech Corpora. Multimed Tools Appl 82, 23055–23073 (2023). https://doi.org/10.1007/s11042-023-14577-w

Download citation

Received: 11 September 2021
Revised: 30 January 2022
Accepted: 31 January 2023
Published: 21 February 2023
Issue Date: June 2023
DOI: https://doi.org/10.1007/s11042-023-14577-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A lightweight 2D CNN based approach for speaker-independent emotion recognition from speech with new Indian Emotional Speech Corpora

Abstract

Access this article

Similar content being viewed by others

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

An efficient algorithm for recognition of emotions from speaker and language independent speech using deep learning

Speech Emotion Recognition Using Deep Learning

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A lightweight 2D CNN based approach for speaker-independent emotion recognition from speech with new Indian Emotional Speech Corpora

Abstract

Access this article

Similar content being viewed by others

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

An efficient algorithm for recognition of emotions from speaker and language independent speech using deep learning

Speech Emotion Recognition Using Deep Learning

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation