Skip to main content

Comparative Analysis of Windows for Speech Emotion Recognition Using CNN

  • Conference paper
  • First Online:
Optimization, Learning Algorithms and Applications (OL2A 2023)

Abstract

The paper presents the comparison of accuracy in the Speech Emotion Recognition task using the Hamming and Hanning windows for framing the speech and determining the spectrogram to be used as input of a convolutional neural network. The detection of between 4 and 10 emotional states was tested for both windows. The results show significant differences in accuracy between the two window types and provide valuable insights for the development of more efficient emotional state detection systems. The best accuracy between 4 and 10 emotions was 64.1% (4 emotions), 57.8% (5 emotions), 59.8% (6 emotions), 48.4% (7 emotions), 47.8% (8 emotions), 51.4% (9 emotions), and 45.9% (10 emotions). These accuracy is at the state-of-the art level.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Lopes, R.P., et al.: Digital technologies for innovative mental health rehabilitation. Electronics (Switzerland) 10(18), 1–15 (2021)

    Google Scholar 

  2. Teixeira, J.P., Freitas, D.: Segmental durations predicted with a neural network. In: 8th European Conference on Speech Communication and Technology, EUROSPEECH 2003, pp. 169–172 (2003)

    Google Scholar 

  3. Teixeira, J.P., Freitas, D., Braga, D., Barros, M.J., Latsch, V.: Phonetic events from the labeling the European Portuguese database for speech synthesis, FEUP/IPB-DB. In: 7th European Conference on Speech Communication and Technology, EUROSPEECH 2001, Scandinavia, pp. 1707–1710 (2001)

    Google Scholar 

  4. Teixeira, F.L., Teixeira, J.P., Soares, S.F.P., Abreu, J.L.P.: F0, LPC, and MFCC analysis for emotion recognition based on speech. In: Pereira, A.I., Košir, A., Fernandes, F.P., Pacheco, M.F., Teixeira, J.P., Lopes, R.P. (eds.) Optimization, Learning Algorithms and Applications, OL2A 2022. Communications in Computer and Information Science, vol, 1754, pp. 389–404. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-23236-7_27

  5. Kraus, M.W.: Supplemental material for voice-only communication enhances empathic accuracy. Am. Psychol. 72(7), 644–654 (2017). http://supp.apa.org/psycarticles/supplemental/amp0000147/amp0000147_supp.html

  6. Hamsa, S., Shahin, I., Iraqi, Y., Damiani, E., Nassif, A.B., Werghi, N.: Speaker identification from emotional and noisy speech using learned voice segregation and speech VGG. SSRN Electron. J. 224, 119871 (2022). https://doi.org/10.1016/j.eswa.2023.119871

    Article  Google Scholar 

  7. Aucouturier, J.J., Johansson, P., Hall, L., Segnini, R., Mercadié, L., Watanabe, K.: Covert digital manipulation of vocal emotion alter speakers’ emotional states in a congruent direction. Proc. Natl. Acad. Sci. U.S.A. 113(4), 948–953 (2016)

    Article  Google Scholar 

  8. de Lope, J., Graña, M.: An ongoing review of speech emotion recognition. Neurocomputing 528, 1–11 (2023). https://doi.org/10.1016/j.neucom.2023.01.002

    Article  Google Scholar 

  9. Huang, Z., Dong, M., Mao, Q., Zhan, Y.: Speech emotion recognition using CNN. In: Proceedings of the 2014 ACM Conference on Multimedia, MM 2014, pp. 801–804 (2014)

    Google Scholar 

  10. Qamhan, M.A., Meftah, A.H., Selouani, S.A., Alotaibi, Y.A., Zakariah, M., Seddiq, Y.M.: Speech emotion recognition using convolutional recurrent neural networks and spectrograms. In: Canadian Conference on Electrical and Computer Engineering, August 2020 (2020)

    Google Scholar 

  11. Ando, A., Mori, T., Kobashikawa, S., Toda, T.: Speech emotion recognition based on listener-dependent emotion perception models. APSIPA Trans. Sig. Inf. Process. 10, e6 (2021)

    Google Scholar 

  12. Pandey, S.K., Shekhawat, H.S., Prasanna, S.R.: Attention gated tensor neural network architectures for speech emotion recognition. Biomed. Sig. Process. Control 71(PA), 103173 (2022). https://doi.org/10.1016/j.bspc.2021.103173

    Article  Google Scholar 

  13. Anvarjon, T., Mustaqeem, Kwon, S.: Deep-Net: a lightweight CNN-based speech emotion recognition system using deep frequency features. Sensors (Switzerland) 20(18), 1–16 (2020)

    Google Scholar 

  14. Jiang, P., Fu, H., Tao, H., Lei, P., Zhao, L.: Parallelized convolutional recurrent neural network with spectral features for speech emotion recognition. IEEE Access 7, 90368–90377 (2019)

    Article  Google Scholar 

  15. Praseetha, V.M., Vadivel, S.: Deep learning models for speech emotion recognition. J. Comput. Sci. 14(11), 1577–1587 (2018)

    Article  Google Scholar 

  16. Guizzo, E., Weyde, T., Tarroni, G.: Anti-transfer learning for task invariance in convolutional neural networks for speech processing. Neural Netw. 142, 238–251 (2021)

    Article  Google Scholar 

  17. Teixeira, F.L., Costa, M.R., Abreu, J.P., Cabral, M., Soares, S.P., Teixeira, J.P.: A narrative review of speech and EEG features for Schizophrenia detection: progress and challenges. Bioengineering 10(4), 1–31 (2023)

    Article  Google Scholar 

  18. Mannepalli, K., Sastry, P.N., Suman, M.: Emotion recognition in speech signals using optimization based multi-SVNN classifier. J. King Saud Univ. Comput. Inf. Sci. 34(2), 384–397 (2022). https://doi.org/10.1016/j.jksuci.2018.11.012. https://linkinghub.elsevier.com/retrieve/pii/S1319157818307158

  19. Mirsamadi, S., Barsoum, E., Zhang, C.: Automatic speech emotion recognition using recurrent neural networks with local attention. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017, pp. 2227–2231 (2017). https://doi.org/10.1016/j.specom.2019.09.002

  20. Liang, R., Tao, H., Tang, G., Wang, Q., Zhao, L.: A salient feature extraction algorithm for speech emotion recognition. IEICE Trans. Inf. Syst. E98D(9), 1715–1718 (2015)

    Article  Google Scholar 

  21. Özseven, T.: Investigation of the effect of spectrogram images and different texture analysis methods on speech emotion recognition. Appl. Acoust. 142, 70–77 (2018)

    Article  Google Scholar 

  22. Mao, Q., Dong, M., Huang, Z., Zhan, Y.: Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimedia 16(8), 2203–2213 (2014)

    Article  Google Scholar 

  23. Zheng, W.Q., Yu, J.S., Zou, Y.X.: An experimental study of speech emotion recognition based on deep convolutional neural networks. In: 2015 International Conference on Affective Computing and Intelligent Interaction, ACII 2015, pp. 827–831 (2015)

    Google Scholar 

  24. Zhao, J., Mao, X., Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Sig. Process. Control 47, 312–323 (2019). https://doi.org/10.1016/j.bspc.2018.08.035

    Article  Google Scholar 

  25. Li, P., Song, Y., McLoughlin, I., Guo, W., Dai, L.: An attention pooling based representation learning method for speech emotion recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2018, September 2018, pp. 3087–3091 (2018)

    Google Scholar 

  26. Satt, A., Rozenberg, S., Hoory, R.: Efficient emotion recognition from speech using deep learning on spectrograms. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, August 2017, pp. 1089–1093 (2017)

    Google Scholar 

  27. Badshah, A.M., Ahmad, J., Rahim, N., Baik, S.W.: Speech emotion recognition from spectrograms with deep convolutional neural network. In: Proceedings of the 2017 International Conference on Platform Technology and Service, PlatCon 2017, pp. 1–5 (2017)

    Google Scholar 

  28. Yenigalla, P., Kumar, A., Tripathi, S., Singh, C., Kar, S., Vepa, J.: Speech emotion recognition using spectrogram & phoneme embedding. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2018, September 2018, pp. 3688–3692 (2018)

    Google Scholar 

  29. Alluhaidan, A.S., Saidani, O., Jahangir, R., Nauman, M.A., Neffati, O.S.: Speech emotion recognition through hybrid features and convolutional neural network. Appl. Sci. (Switzerland) 13(8), 4750 (2023)

    Google Scholar 

  30. Costantini, G., Iadarola, I., Paoloni, A., Todisco, M.: EMOVO corpus: an Italian emotional speech database. In: Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014, pp. 3501–3504 (2014)

    Google Scholar 

  31. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: 9th European Conference on Speech Communication and Technology, May 2014, pp. 1517–1520 (2005)

    Google Scholar 

  32. Costantini, G., Iaderola, I., Paoloni, A., Todisco, M.: EMOVO corpus: an Italian emotional speech database. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland, May 2014, pp. 3501–3504. European Language Resources Association (ELRA) (2014). http://www.lrec-conf.org/proceedings/lrec2014/pdf/591_Paper.pdf

  33. Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018). https://doi.org/10.1371/journal.pone.0196391

    Article  Google Scholar 

  34. Haq, S., Jackson, P.: Machine audition: principles, algorithms and systems. In: Multimodal Emotion Recognition, pp. 398–423. IGI Global, Hershey, August 2010

    Google Scholar 

  35. Shah Fahad, M., Ranjan, A., Yadav, J., Deepak, A.: A survey of speech emotion recognition in natural environment. Digit. Sig. Process. Rev. J. 110, 102951 (2021). https://doi.org/10.1016/j.dsp.2020.102951

    Article  Google Scholar 

  36. Silva, L., Bispo, B., Teixeira, J.P.: Features selection algorithms for classification of voice signals. Procedia Comput. Sci. 181(2020), 948–956 (2021). https://doi.org/10.1016/j.procs.2021.01.251

    Article  Google Scholar 

  37. Singh, V., Prasad, S.: Speech emotion recognition system using gender dependent convolution neural network. Procedia Comput. Sci. 218, 2533–2540 (2023). https://doi.org/10.1016/j.procs.2023.01.227. https://linkinghub.elsevier.com/retrieve/pii/S1877050923002272

  38. Rossetti, D.: Projetando o espectro do som no espaço: imagens-movimento de parciais e grãos sonoros. Orfeu 5(1), 571–594 (2020)

    Article  Google Scholar 

  39. Fernandes, J., Teixeira, F., Guedes, V., Junior, A., Teixeira, J.P.: Harmonic to noise ratio measurement - selection of window and length. Procedia Comput. Sci. 138, 280–285 (2018). https://www.sciencedirect.com/science/article/pii/S1877050918316739. cENTERIS 2018 - International Conference on ENTERprise Information Systems/ ProjMAN 2018 - International Conference on Project ANagement / HCist 2018-International Conference on Health and Social Care Information Systems and Technologies, CENTERIS/ProjMAN/HCist 2018

  40. Fernandes, J.F.T., Freitas, D., Junior, A.C., Teixeira, J.P.: Determination of harmonic parameters in pathological voices-efficient algorithm. Appl. Sci. (Switzerland) 13(4), 2333 (2023)

    Google Scholar 

  41. Abbaschian, B.J., Sierra-Sosa, D., Elmaghraby, A.: Deep learning techniques for speech emotion recognition, from databases to models. Sensors (Switzerland) 21(4), 1–27 (2021)

    Article  Google Scholar 

Download references

Acknowledgements

This research was funded by the European Regional Development Fund (ERDF) via the Regional Operational Program North 2020, GreenHealth-Digital strategies in biological assets to improve well-being and promote green health, Norte-01-0145-FEDER-000042; Foundation for Science and Technology (FCT, Portugal) support from national funds FCT/MCTES (PIDDAC) to CeDRI (UIDB/05757/2020 and UIDP/05757/2020) and SusTEC (LA/P/0007/2021).

The authors are grateful for financial support from UTAD.

The authors would also like to thank João Mendes for his collaboration throughout the work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to João P. Teixeira .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Teixeira, F.L., Soares, S.P., Abreu, J.P., Oliveira, P.M., Teixeira, J.P. (2024). Comparative Analysis of Windows for Speech Emotion Recognition Using CNN. In: Pereira, A.I., Mendes, A., Fernandes, F.P., Pacheco, M.F., Coelho, J.P., Lima, J. (eds) Optimization, Learning Algorithms and Applications. OL2A 2023. Communications in Computer and Information Science, vol 1981. Springer, Cham. https://doi.org/10.1007/978-3-031-53025-8_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-53025-8_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-53024-1

  • Online ISBN: 978-3-031-53025-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics