Skip to main content
Log in

A lightweight 2D CNN based approach for speaker-independent emotion recognition from speech with new Indian Emotional Speech Corpora

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Speech Emotion Recognition (SER) is the process of recognizing emotions by extracting few features of speech signals. It is becoming very popular in Human Computer Interaction (HCI) applications. The challenge is to extract relevant features of speech to recognize emotions with a low computational cost. In this paper, a lightweight Convolutional Neural Network (LCNN) based model has been proposed which extracts useful features automatically. The speech samples are converted into spectrograms of size 224 × 224 for LCNN input. 5 CNN layers and stride are used for down-sampling the feature maps in place of pooling layers which reduces the computational cost. It has been evaluated for accuracy on publicly available benchmark datasets EMOVO (81%), EMODB (87%), and SAVEE (80%). The accuracy of proposed model is also found to be better than SER CNN-assisted model, ResNet-18 and ResNet-34 models. Very few speech datasets are available in Indian ascent. So, authors have created a new Indian Emotional Speech Corpora (IESC) in English language with 600 speech samples recorded from 8 speakers using 2 sentences in 5 emotions. It will be made publicly available for researchers. The accuracy of the proposed LCNN model on IESC is found to be 95% which is better than existing datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Abbreviations

ANN:

Artificial Neural Network

CNN:

Convolutional Neural Network

GMM:

Gaussian Mixture Model

HMM:

Hidden Markov Model

IESC:

Indian Emotional Speech Corpora

LSTM:

Long Short-Term Memory

MFCC:

Mel-frequency Cepstral Coefficient

MFMC:

Mel Frequency Magnitude Coefficient

RNN:

Recurrent Neural Network

SER:

Speech Emotion Recognition

References

  1. Bansal S, Dev A (2013) Emotional Hindi speech database. In: 2013 international conference oriental COCOSDA held jointly with the 2013 conference on Asian spoken language research and evaluation (O-COCOSDA/CASLRE) (2013), pp 1-4. IEEE. https://doi.org/10.1109/ICSDA.2013.6709867

  2. Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In ninth European conference on speech communication and technology, Lisbon, Portugal, September 4-8, 2005, pp 1517-1520. https://doi.org/10.21437/Interspeech.2005-446

  3. Chunhui Gu et al (2018) AVA: a video dataset of Spatio-temporally localized atomic visual actions. In: International Conference on Computer Vision and Pattern Recognition. https://arxiv.org/abs/1705.08421

  4. Costantini G, Iaderola I, Paoloni A, Todisco M (2014) Emovo corpus: an italian emotional speech database. In: In international conference on language resources and evaluation, (LREC'14), Reykjavik, 26-31 maggio 2014, European language resources association (ELRA), Parigi, 2014, pp 3501–3504. http://www.lrec-conf.org/proceedings/lrec2014/pdf/591_Paper.pdf

  5. Dai K, Fell HJ, Mac Auslan J (2008) Recognizing emotion in speech using neural networks. In: 4th IASTED international conference on telehealth and assistive technologies, pp 31-36. https://doi.org/10.5555/1722763

  6. Daneshfar F, Kabudian SJ (2020) Speech emotion recognition using discriminative dimension reduction by employing a modified quantum behaved particle swarm optimization algorithm. Multimed Tools Appl 79(1):1261–1289. https://doi.org/10.1007/s11042-019-08222-8

    Article  Google Scholar 

  7. Deng J, Xu X, Zhang Z, Fruhholz S, Schuller B (2017) Universum autoencoder-based domain adaptation for speech emotion recognition. In IEEE Signal Processing Letters 24:500–504. https://doi.org/10.1109/lsp.2017.2672753

  8. El Ayadi M, Kamel M, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn 44:572–587. https://doi.org/10.1016/j.patcog.2010.09.020

    Article  MATH  Google Scholar 

  9. Firoz SA, Raji SA, Babu AP (2009) Automatic emotion recognition from speech using artificial neural networks with gender-dependent databases. In: IEEE international conference on advances in computing, control, and telecommunication technologies. 28–29 Dec 2009 Bangalore, India. https://doi.org/10.1109/ACT.2009.49

  10. Han K, Yu D, Tashev I (2014) Speech emotion recognition using deep neural network and extreme learning machine. In: Fifteenth Annual Conference of the International Speech Communication Association, Singapore, Malasia, pp 223–227. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/IS140441.pdf

  11. Haq S, Jackson PJ (2011) Multimodal emotion recognition. In: Machine audition: principles, algorithms and systems, pp 398-423. IGI global. https://doi.org/10.4018/978-1-61520-919-4.ch017

  12. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, pp 770–778. https://doi.org/10.1109/CVPR.2016.90

  13. IESC: www.kaggle.com/dataset/60f09eaaea16bf15f44d4ada0b10b62f64d6296262b8f2d879572fbb1e5ea51f

  14. Issa D, Demirci MF, Yazici A (2020) Speech emotion recognition with deep convolutional neural networks. Biomed Signal Process Control 59:101894. https://doi.org/10.1016/j.bspc.2020.101894

    Article  Google Scholar 

  15. Jahangir R, Teh YW, Hanif F, Mujtaba G (2021) Deep learning approaches for speech emotion recognition: state of the art and research challenges. Multimed Tools Appl 80:23745–23812. https://doi.org/10.1007/s11042-020-09874-7

    Article  Google Scholar 

  16. Kaya H, Karpov AA (2018) Efficient and effective strategies for cross-corpus acoustic emotion recognition. Neurocomputing 275:1028–1034. https://doi.org/10.1016/j.neucom.2017.09.049

    Article  Google Scholar 

  17. Khanchandani KB, Hussain MA (2009) Emotion recognition using multilayer perceptron and generalized feed forward neural network, CSIR 68:367–371. http://hdl.handle.net/123456789/3787

  18. Kim Y, Lee H, Provost EM (2013) Deep learning for robust feature generation in audio visual emotion recognition. In IEEE international conference on acoustics, speech and signal processing, Vancouver, BC, pp 3687-3691. https://doi.org/10.1109/ICASSP.2013.6638346

  19. Koolagudi S, Rao K (2012) Emotion recognition from speech: a review. Int J Speech Technol 15:99–117. https://doi.org/10.1007/s10772-011-9125-1

    Article  Google Scholar 

  20. Koolagudi S, Maity S, Kumar V, Chakrabarti S, Rao K (2009) IITKGP-SESC: speech database for emotion analysis. In: 3rd International Conference on Contemporary Computing, 17–19 August Noida, India, pp 485–492. Communications in Computer and Information Science, volume 40. ISBN 978–3–642-03546-3. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-03547-0_46

  21. Koolagudi S, Murthy Y, Bhaskar S (2018) Choice of a classifier, based on properties of a dataset: case study-speech emotion recognition. Int J Speech Technol 21:167–183. https://doi.org/10.1007/s10772-018-9495-8

    Article  Google Scholar 

  22. Kwon S (2020) A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 20(1):183. https://doi.org/10.3390/s20010183

    Article  Google Scholar 

  23. Lakomkin E, Zamani MA, Weber C, Magg S, Wermter S (2018) On the robustness of speech emotion recognition for human-robot interaction with deep neural networks. 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 854–860. https://doi.org/10.1109/IROS.2018.8593571

  24. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. https://doi.org/10.1038/nature14539

  25. Lee MC, Chiang SY, Yeh SC, Wen TF (2020) Study on emotion recognition and companion Chatbot using deep neural network. Multimed Tools Appl 79:19629–19657. https://doi.org/10.1007/s11042-020-08841-6

    Article  Google Scholar 

  26. Li S, Xing X Fan W, Cai B, Fordson P (2021) Spatiotemporal and frequential cascaded attention networks for speech emotion recognition. Neurocomputing 238-248. https://doi.org/10.1016/j.neucom.2021.02.094

  27. Liu Z-T, Wu M, Cao WH, Mao JW, Xu JP, Tan GZ (2018) Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing 273:271–280. https://doi.org/10.1016/j.neucom.2017.07.050

    Article  Google Scholar 

  28. Livingstone S, Russo F (2018) The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in north American English. PLoS ONE 13:e0196391. https://doi.org/10.1371/journal.pone.0196391

    Article  Google Scholar 

  29. Martin O, Kotsia I, Macq B, Pitas I (2006) The eNTERFACE' 05 audio-visual emotion database. 22nd international conference on data engineering workshops (ICDEW'06), Atlanta, GA, USA 8. 145 https://doi.org/10.1109/ICDEW

  30. Mo S, Niu J, Yiming S, Sajal Das K (2018) A novel feature set for video emotion recognition. Neurocomputing 291:11–20. https://doi.org/10.1016/j.neucom.2018.02.052

    Article  Google Scholar 

  31. Nakatsu R, Nicholson J, Tosa N (2000) Emotion recognition and its application to computer agents with spontaneous interactive capabilities. Knowl-Based Syst 13:497–504. https://doi.org/10.1016/s0950-7051(00)00070-8

    Article  Google Scholar 

  32. Niu J, Qian Y, Yu K (2014) Acoustic emotion recognition using deep neural network. In: IEEE 9th international symposium Chinese spoken languages and process (ISCSLP), pp 128-132. https://doi.org/10.1109/ISCSLP.2014.6936657

  33. Özseven T (2018) Investigation of the effect of spectrogram images and different texture analysis methods on speech emotion recognition. Appl Acoust 142:70–77. https://doi.org/10.1016/j.apacoust.2018.08.003

    Article  Google Scholar 

  34. Özseven T (2019) A novel feature selection method for speech emotion recognition. Appl Acoust 146:320–326. https://doi.org/10.1016/j.apacoust.2018.11.028

    Article  Google Scholar 

  35. Partila P, Voznak M (2013) Speech emotions recognition using 2-d neural classifier. In: Nostradamus: Prediction, modeling and analysis of complex systems. Springer, Heidelberg, pp 221–231. https://doi.org/10.1007/978-3-319-00542-3_23

  36. Polzehl T, Sundaram S, Ketabdar H, Wagner M, Metze F (2009) emotion classification in children's speech using fusion of acoustic and linguistic features. In tenth Annual Conference of The International Speech Communication Association, Brighton, United Kingdom, September 6-10, 2009, pp 340–343. https://researchsystem.canberra.edu.au/ws/portalfiles/portal/29337473/fulltext_published.pdf

  37. Polzin T, Waibel A (1998) Detecting emotions in speech. In: Proceedings of the Cooperative Multimodal Communication, Second International Conference, CMC'98, Tilburg, The Netherlands, January 28-30, 1998

  38. Savargiv M, Bastanfard A, (2014) Study on unit-selection and statistical parametric speech synthesis techniques. J Comput Robot 7(1):19–25. http://www.qjcr.ir/article_649_5c6e6b9b8ff146dac392223000b491db.pdf

  39. Savargiv M, Bastanfard A (2015) Persian speech emotion recognition. In: 7th conference on information and knowledge technology (IKT) 1-5. https://doi.org/10.1109/IKT.2015.7288756

  40. Savargiv M, Bastanfard A (2016) Real-time speech emotion recognition by minimum number of features. Artif Intell Robot (IRANOPEN):72–76. https://doi.org/10.1109/RIOS.2016.7529493

  41. Sharma R, Pachori RB, Sircar P (2020) Automated emotion recognition based on higher order statistics and deep learning algorithm. Biomed Signal Process Control 58:101867. https://doi.org/10.1016/j.bspc.2020.101867

    Article  Google Scholar 

  42. Singh YB, Goel S (2018) Survey on human emotion recognition: speech database, features, and classification. In: International conference on advances in computing, communication control and networking (ICACCCN), India, pp 298-301. https://doi.org/10.1109/ICACCCN.2018.8748379

  43. Singh YB, Goel S (2021) An efficient algorithm for recognition of emotions from speaker and language independent speech using deep learning. Multimed Tools Appl 80:14001–14018. https://doi.org/10.1007/s11042-020-10399-2

    Article  Google Scholar 

  44. Stuhlsatz A, Meyer C, Eyben F, Zielke T, Meier G, Schuller B (2011) Deep neural networks for acoustic emotion recognition: raising the benchmarks. IEEE international conference on acoustics, speech and signal processing (ICASSP), Prague, pp 5688-5691. https://doi.org/10.1109/ICASSP.2011.5947651

  45. Tang H, Chu SM, Hasegawa-Johnson M, Huang TS (2009) Emotion recognition from speech VIA boosted Gaussian mixture models. IEEE international conference on multimedia and expo, New York, NY, pp 294–297. https://doi.org/10.1109/ICME.2009.5202493

  46. Tang D, Zeng J, Li M (2018) An end-to-end deep learning framework with speech emotion recognition of atypical individuals. In INTERSPEECH 2018:162-166. https://doi.org/10.21437/Interspeech.2018-2581

  47. Tenbosch L (2003) Emotions, speech and the ASR framework. Speech Comm 40:213–225. https://doi.org/10.1016/s0167-6393(02)00083-3

    Article  MATH  Google Scholar 

  48. Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Nicolaou MA, Schuller B, Zafeiriou S (2016) Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), Shanghai, pp 5200-5204. https://doi.org/10.1109/ICASSP.2016.7472669

  49. Ververidis D, Kotropoulos C (2006) Emotional speech recognition: resources, features, and methods. Speech Comm 48:1162–1181. https://doi.org/10.1016/j.specom.2006.04.003

    Article  Google Scholar 

  50. Wang L, Zhang Z, Design CX (2005) Theory and applications, Support Vector Machines. Springer-Verlag, Berlin Heidelber. https://doi.org/10.1002/9781118197448

  51. Wang K, An N, LiBN, Zhang Y, Li L (2015) Speech emotion recognition using Fourier parameters. In IEEE Transactions on Affective Computing, 6:69–75. https://doi.org/10.1109/TAFFC.2015.2392101

  52. Womack B, Hansen J (1999) N-channel hidden Markov models for combined stressed speech classification and recognition. IEEE Trans Speech Audio Process 7:668–677. https://doi.org/10.1109/89.799692

    Article  Google Scholar 

  53. Wu L, Hong R, Wang Y, Wang M (2019) Cross-entropy adversarial view adaptation for person re-identification. IEEE Trans Circuits Syst Video Technol 30(7):2081–2092. https://doi.org/10.1109/TCSVT.2019.2909549

    Article  Google Scholar 

  54. Yang W (2018) Survey on deep multi-modal data analytics: collaboration, rivalry and fusion. J ACM 37(4) article 111 26 pages. https://doi.org/10.1145/1122445.1122456

  55. Zayene B, Jlassi C, Arous N (2020) 3D convolutional recurrent global neural network for speech emotion recognition. 5th IEEE international conference on advanced technologies for signal and image processing (ATSIP), pp 1-5. https://doi.org/10.1109/ATSIP49331.2020.9231597

  56. Zhang Y, Liu Y, Weninger F, Schuller B (2017) Multi-task deep neural network with shared hidden layers: breaking down the wall between emotion representations. 2017 IEEE international conference on acoustics, Speech and Signal Processing (ICASSP), 2017, pp 4990–4994. https://doi.org/10.1109/ICASSP.2017.7953106

  57. Zhang W, Zhao D, Chai Z, Yang LT, Liu X, Gong F, Yang S (2017) Deep learning and SVM-based emotion recognition from Chinese speech for smart affective services. Software Pract Exp 47(8):1127–1138. https://doi.org/10.1002/spe.2487

    Article  Google Scholar 

  58. Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed Signal Process Control 47:312–323. https://doi.org/10.1016/j.bspc.2018.08.035

    Article  Google Scholar 

  59. Zheng W, Xin M, Wang X, Wang B (2014) A novel speech emotion recognition method via incomplete sparse least square regression. IEEE Signal Process Lett 21:569–572. https://doi.org/10.1109/lsp.2014.2308954

    Article  Google Scholar 

  60. Zhou J, Wang G, Yang Y, Chen P (2006) Speech emotion recognition BASED on rough set and SVM. In: 5th IEEE international conference on cognitive informatics, Beijing, pp 53-61. https://doi.org/10.1109/COGINF.2006.365676

  61. Zvarevashe K, Olugbara O (2020) Ensemble learning of hybrid acoustic features for speech emotion recognition. Algorithms 3(3):70. https://doi.org/10.3390/a13030070

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Youddha Beer Singh.

Ethics declarations

Conflict of interest

The authors declare that there are no known conflicting personal relationships or financial interests within the manuscript to influence the work presented in this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Singh, Y.B., Goel, S. A lightweight 2D CNN based approach for speaker-independent emotion recognition from speech with new Indian Emotional Speech Corpora. Multimed Tools Appl 82, 23055–23073 (2023). https://doi.org/10.1007/s11042-023-14577-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-14577-w

Keywords

Navigation