Abstract
Speech Emotion Recognition (SER) is the process of recognizing emotions by extracting few features of speech signals. It is becoming very popular in Human Computer Interaction (HCI) applications. The challenge is to extract relevant features of speech to recognize emotions with a low computational cost. In this paper, a lightweight Convolutional Neural Network (LCNN) based model has been proposed which extracts useful features automatically. The speech samples are converted into spectrograms of size 224 × 224 for LCNN input. 5 CNN layers and stride are used for down-sampling the feature maps in place of pooling layers which reduces the computational cost. It has been evaluated for accuracy on publicly available benchmark datasets EMOVO (81%), EMODB (87%), and SAVEE (80%). The accuracy of proposed model is also found to be better than SER CNN-assisted model, ResNet-18 and ResNet-34 models. Very few speech datasets are available in Indian ascent. So, authors have created a new Indian Emotional Speech Corpora (IESC) in English language with 600 speech samples recorded from 8 speakers using 2 sentences in 5 emotions. It will be made publicly available for researchers. The accuracy of the proposed LCNN model on IESC is found to be 95% which is better than existing datasets.
Similar content being viewed by others
Abbreviations
- ANN:
-
Artificial Neural Network
- CNN:
-
Convolutional Neural Network
- GMM:
-
Gaussian Mixture Model
- HMM:
-
Hidden Markov Model
- IESC:
-
Indian Emotional Speech Corpora
- LSTM:
-
Long Short-Term Memory
- MFCC:
-
Mel-frequency Cepstral Coefficient
- MFMC:
-
Mel Frequency Magnitude Coefficient
- RNN:
-
Recurrent Neural Network
- SER:
-
Speech Emotion Recognition
References
Bansal S, Dev A (2013) Emotional Hindi speech database. In: 2013 international conference oriental COCOSDA held jointly with the 2013 conference on Asian spoken language research and evaluation (O-COCOSDA/CASLRE) (2013), pp 1-4. IEEE. https://doi.org/10.1109/ICSDA.2013.6709867
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In ninth European conference on speech communication and technology, Lisbon, Portugal, September 4-8, 2005, pp 1517-1520. https://doi.org/10.21437/Interspeech.2005-446
Chunhui Gu et al (2018) AVA: a video dataset of Spatio-temporally localized atomic visual actions. In: International Conference on Computer Vision and Pattern Recognition. https://arxiv.org/abs/1705.08421
Costantini G, Iaderola I, Paoloni A, Todisco M (2014) Emovo corpus: an italian emotional speech database. In: In international conference on language resources and evaluation, (LREC'14), Reykjavik, 26-31 maggio 2014, European language resources association (ELRA), Parigi, 2014, pp 3501–3504. http://www.lrec-conf.org/proceedings/lrec2014/pdf/591_Paper.pdf
Dai K, Fell HJ, Mac Auslan J (2008) Recognizing emotion in speech using neural networks. In: 4th IASTED international conference on telehealth and assistive technologies, pp 31-36. https://doi.org/10.5555/1722763
Daneshfar F, Kabudian SJ (2020) Speech emotion recognition using discriminative dimension reduction by employing a modified quantum behaved particle swarm optimization algorithm. Multimed Tools Appl 79(1):1261–1289. https://doi.org/10.1007/s11042-019-08222-8
Deng J, Xu X, Zhang Z, Fruhholz S, Schuller B (2017) Universum autoencoder-based domain adaptation for speech emotion recognition. In IEEE Signal Processing Letters 24:500–504. https://doi.org/10.1109/lsp.2017.2672753
El Ayadi M, Kamel M, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn 44:572–587. https://doi.org/10.1016/j.patcog.2010.09.020
Firoz SA, Raji SA, Babu AP (2009) Automatic emotion recognition from speech using artificial neural networks with gender-dependent databases. In: IEEE international conference on advances in computing, control, and telecommunication technologies. 28–29 Dec 2009 Bangalore, India. https://doi.org/10.1109/ACT.2009.49
Han K, Yu D, Tashev I (2014) Speech emotion recognition using deep neural network and extreme learning machine. In: Fifteenth Annual Conference of the International Speech Communication Association, Singapore, Malasia, pp 223–227. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/IS140441.pdf
Haq S, Jackson PJ (2011) Multimodal emotion recognition. In: Machine audition: principles, algorithms and systems, pp 398-423. IGI global. https://doi.org/10.4018/978-1-61520-919-4.ch017
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, pp 770–778. https://doi.org/10.1109/CVPR.2016.90
IESC: www.kaggle.com/dataset/60f09eaaea16bf15f44d4ada0b10b62f64d6296262b8f2d879572fbb1e5ea51f
Issa D, Demirci MF, Yazici A (2020) Speech emotion recognition with deep convolutional neural networks. Biomed Signal Process Control 59:101894. https://doi.org/10.1016/j.bspc.2020.101894
Jahangir R, Teh YW, Hanif F, Mujtaba G (2021) Deep learning approaches for speech emotion recognition: state of the art and research challenges. Multimed Tools Appl 80:23745–23812. https://doi.org/10.1007/s11042-020-09874-7
Kaya H, Karpov AA (2018) Efficient and effective strategies for cross-corpus acoustic emotion recognition. Neurocomputing 275:1028–1034. https://doi.org/10.1016/j.neucom.2017.09.049
Khanchandani KB, Hussain MA (2009) Emotion recognition using multilayer perceptron and generalized feed forward neural network, CSIR 68:367–371. http://hdl.handle.net/123456789/3787
Kim Y, Lee H, Provost EM (2013) Deep learning for robust feature generation in audio visual emotion recognition. In IEEE international conference on acoustics, speech and signal processing, Vancouver, BC, pp 3687-3691. https://doi.org/10.1109/ICASSP.2013.6638346
Koolagudi S, Rao K (2012) Emotion recognition from speech: a review. Int J Speech Technol 15:99–117. https://doi.org/10.1007/s10772-011-9125-1
Koolagudi S, Maity S, Kumar V, Chakrabarti S, Rao K (2009) IITKGP-SESC: speech database for emotion analysis. In: 3rd International Conference on Contemporary Computing, 17–19 August Noida, India, pp 485–492. Communications in Computer and Information Science, volume 40. ISBN 978–3–642-03546-3. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-03547-0_46
Koolagudi S, Murthy Y, Bhaskar S (2018) Choice of a classifier, based on properties of a dataset: case study-speech emotion recognition. Int J Speech Technol 21:167–183. https://doi.org/10.1007/s10772-018-9495-8
Kwon S (2020) A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 20(1):183. https://doi.org/10.3390/s20010183
Lakomkin E, Zamani MA, Weber C, Magg S, Wermter S (2018) On the robustness of speech emotion recognition for human-robot interaction with deep neural networks. 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 854–860. https://doi.org/10.1109/IROS.2018.8593571
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. https://doi.org/10.1038/nature14539
Lee MC, Chiang SY, Yeh SC, Wen TF (2020) Study on emotion recognition and companion Chatbot using deep neural network. Multimed Tools Appl 79:19629–19657. https://doi.org/10.1007/s11042-020-08841-6
Li S, Xing X Fan W, Cai B, Fordson P (2021) Spatiotemporal and frequential cascaded attention networks for speech emotion recognition. Neurocomputing 238-248. https://doi.org/10.1016/j.neucom.2021.02.094
Liu Z-T, Wu M, Cao WH, Mao JW, Xu JP, Tan GZ (2018) Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing 273:271–280. https://doi.org/10.1016/j.neucom.2017.07.050
Livingstone S, Russo F (2018) The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in north American English. PLoS ONE 13:e0196391. https://doi.org/10.1371/journal.pone.0196391
Martin O, Kotsia I, Macq B, Pitas I (2006) The eNTERFACE' 05 audio-visual emotion database. 22nd international conference on data engineering workshops (ICDEW'06), Atlanta, GA, USA 8. 145 https://doi.org/10.1109/ICDEW
Mo S, Niu J, Yiming S, Sajal Das K (2018) A novel feature set for video emotion recognition. Neurocomputing 291:11–20. https://doi.org/10.1016/j.neucom.2018.02.052
Nakatsu R, Nicholson J, Tosa N (2000) Emotion recognition and its application to computer agents with spontaneous interactive capabilities. Knowl-Based Syst 13:497–504. https://doi.org/10.1016/s0950-7051(00)00070-8
Niu J, Qian Y, Yu K (2014) Acoustic emotion recognition using deep neural network. In: IEEE 9th international symposium Chinese spoken languages and process (ISCSLP), pp 128-132. https://doi.org/10.1109/ISCSLP.2014.6936657
Özseven T (2018) Investigation of the effect of spectrogram images and different texture analysis methods on speech emotion recognition. Appl Acoust 142:70–77. https://doi.org/10.1016/j.apacoust.2018.08.003
Özseven T (2019) A novel feature selection method for speech emotion recognition. Appl Acoust 146:320–326. https://doi.org/10.1016/j.apacoust.2018.11.028
Partila P, Voznak M (2013) Speech emotions recognition using 2-d neural classifier. In: Nostradamus: Prediction, modeling and analysis of complex systems. Springer, Heidelberg, pp 221–231. https://doi.org/10.1007/978-3-319-00542-3_23
Polzehl T, Sundaram S, Ketabdar H, Wagner M, Metze F (2009) emotion classification in children's speech using fusion of acoustic and linguistic features. In tenth Annual Conference of The International Speech Communication Association, Brighton, United Kingdom, September 6-10, 2009, pp 340–343. https://researchsystem.canberra.edu.au/ws/portalfiles/portal/29337473/fulltext_published.pdf
Polzin T, Waibel A (1998) Detecting emotions in speech. In: Proceedings of the Cooperative Multimodal Communication, Second International Conference, CMC'98, Tilburg, The Netherlands, January 28-30, 1998
Savargiv M, Bastanfard A, (2014) Study on unit-selection and statistical parametric speech synthesis techniques. J Comput Robot 7(1):19–25. http://www.qjcr.ir/article_649_5c6e6b9b8ff146dac392223000b491db.pdf
Savargiv M, Bastanfard A (2015) Persian speech emotion recognition. In: 7th conference on information and knowledge technology (IKT) 1-5. https://doi.org/10.1109/IKT.2015.7288756
Savargiv M, Bastanfard A (2016) Real-time speech emotion recognition by minimum number of features. Artif Intell Robot (IRANOPEN):72–76. https://doi.org/10.1109/RIOS.2016.7529493
Sharma R, Pachori RB, Sircar P (2020) Automated emotion recognition based on higher order statistics and deep learning algorithm. Biomed Signal Process Control 58:101867. https://doi.org/10.1016/j.bspc.2020.101867
Singh YB, Goel S (2018) Survey on human emotion recognition: speech database, features, and classification. In: International conference on advances in computing, communication control and networking (ICACCCN), India, pp 298-301. https://doi.org/10.1109/ICACCCN.2018.8748379
Singh YB, Goel S (2021) An efficient algorithm for recognition of emotions from speaker and language independent speech using deep learning. Multimed Tools Appl 80:14001–14018. https://doi.org/10.1007/s11042-020-10399-2
Stuhlsatz A, Meyer C, Eyben F, Zielke T, Meier G, Schuller B (2011) Deep neural networks for acoustic emotion recognition: raising the benchmarks. IEEE international conference on acoustics, speech and signal processing (ICASSP), Prague, pp 5688-5691. https://doi.org/10.1109/ICASSP.2011.5947651
Tang H, Chu SM, Hasegawa-Johnson M, Huang TS (2009) Emotion recognition from speech VIA boosted Gaussian mixture models. IEEE international conference on multimedia and expo, New York, NY, pp 294–297. https://doi.org/10.1109/ICME.2009.5202493
Tang D, Zeng J, Li M (2018) An end-to-end deep learning framework with speech emotion recognition of atypical individuals. In INTERSPEECH 2018:162-166. https://doi.org/10.21437/Interspeech.2018-2581
Tenbosch L (2003) Emotions, speech and the ASR framework. Speech Comm 40:213–225. https://doi.org/10.1016/s0167-6393(02)00083-3
Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Nicolaou MA, Schuller B, Zafeiriou S (2016) Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), Shanghai, pp 5200-5204. https://doi.org/10.1109/ICASSP.2016.7472669
Ververidis D, Kotropoulos C (2006) Emotional speech recognition: resources, features, and methods. Speech Comm 48:1162–1181. https://doi.org/10.1016/j.specom.2006.04.003
Wang L, Zhang Z, Design CX (2005) Theory and applications, Support Vector Machines. Springer-Verlag, Berlin Heidelber. https://doi.org/10.1002/9781118197448
Wang K, An N, LiBN, Zhang Y, Li L (2015) Speech emotion recognition using Fourier parameters. In IEEE Transactions on Affective Computing, 6:69–75. https://doi.org/10.1109/TAFFC.2015.2392101
Womack B, Hansen J (1999) N-channel hidden Markov models for combined stressed speech classification and recognition. IEEE Trans Speech Audio Process 7:668–677. https://doi.org/10.1109/89.799692
Wu L, Hong R, Wang Y, Wang M (2019) Cross-entropy adversarial view adaptation for person re-identification. IEEE Trans Circuits Syst Video Technol 30(7):2081–2092. https://doi.org/10.1109/TCSVT.2019.2909549
Yang W (2018) Survey on deep multi-modal data analytics: collaboration, rivalry and fusion. J ACM 37(4) article 111 26 pages. https://doi.org/10.1145/1122445.1122456
Zayene B, Jlassi C, Arous N (2020) 3D convolutional recurrent global neural network for speech emotion recognition. 5th IEEE international conference on advanced technologies for signal and image processing (ATSIP), pp 1-5. https://doi.org/10.1109/ATSIP49331.2020.9231597
Zhang Y, Liu Y, Weninger F, Schuller B (2017) Multi-task deep neural network with shared hidden layers: breaking down the wall between emotion representations. 2017 IEEE international conference on acoustics, Speech and Signal Processing (ICASSP), 2017, pp 4990–4994. https://doi.org/10.1109/ICASSP.2017.7953106
Zhang W, Zhao D, Chai Z, Yang LT, Liu X, Gong F, Yang S (2017) Deep learning and SVM-based emotion recognition from Chinese speech for smart affective services. Software Pract Exp 47(8):1127–1138. https://doi.org/10.1002/spe.2487
Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed Signal Process Control 47:312–323. https://doi.org/10.1016/j.bspc.2018.08.035
Zheng W, Xin M, Wang X, Wang B (2014) A novel speech emotion recognition method via incomplete sparse least square regression. IEEE Signal Process Lett 21:569–572. https://doi.org/10.1109/lsp.2014.2308954
Zhou J, Wang G, Yang Y, Chen P (2006) Speech emotion recognition BASED on rough set and SVM. In: 5th IEEE international conference on cognitive informatics, Beijing, pp 53-61. https://doi.org/10.1109/COGINF.2006.365676
Zvarevashe K, Olugbara O (2020) Ensemble learning of hybrid acoustic features for speech emotion recognition. Algorithms 3(3):70. https://doi.org/10.3390/a13030070
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there are no known conflicting personal relationships or financial interests within the manuscript to influence the work presented in this paper.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Singh, Y.B., Goel, S. A lightweight 2D CNN based approach for speaker-independent emotion recognition from speech with new Indian Emotional Speech Corpora. Multimed Tools Appl 82, 23055–23073 (2023). https://doi.org/10.1007/s11042-023-14577-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-14577-w