Abstract
The paper presents the comparison of accuracy in the Speech Emotion Recognition task using the Hamming and Hanning windows for framing the speech and determining the spectrogram to be used as input of a convolutional neural network. The detection of between 4 and 10 emotional states was tested for both windows. The results show significant differences in accuracy between the two window types and provide valuable insights for the development of more efficient emotional state detection systems. The best accuracy between 4 and 10 emotions was 64.1% (4 emotions), 57.8% (5 emotions), 59.8% (6 emotions), 48.4% (7 emotions), 47.8% (8 emotions), 51.4% (9 emotions), and 45.9% (10 emotions). These accuracy is at the state-of-the art level.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Lopes, R.P., et al.: Digital technologies for innovative mental health rehabilitation. Electronics (Switzerland) 10(18), 1–15 (2021)
Teixeira, J.P., Freitas, D.: Segmental durations predicted with a neural network. In: 8th European Conference on Speech Communication and Technology, EUROSPEECH 2003, pp. 169–172 (2003)
Teixeira, J.P., Freitas, D., Braga, D., Barros, M.J., Latsch, V.: Phonetic events from the labeling the European Portuguese database for speech synthesis, FEUP/IPB-DB. In: 7th European Conference on Speech Communication and Technology, EUROSPEECH 2001, Scandinavia, pp. 1707–1710 (2001)
Teixeira, F.L., Teixeira, J.P., Soares, S.F.P., Abreu, J.L.P.: F0, LPC, and MFCC analysis for emotion recognition based on speech. In: Pereira, A.I., Košir, A., Fernandes, F.P., Pacheco, M.F., Teixeira, J.P., Lopes, R.P. (eds.) Optimization, Learning Algorithms and Applications, OL2A 2022. Communications in Computer and Information Science, vol, 1754, pp. 389–404. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-23236-7_27
Kraus, M.W.: Supplemental material for voice-only communication enhances empathic accuracy. Am. Psychol. 72(7), 644–654 (2017). http://supp.apa.org/psycarticles/supplemental/amp0000147/amp0000147_supp.html
Hamsa, S., Shahin, I., Iraqi, Y., Damiani, E., Nassif, A.B., Werghi, N.: Speaker identification from emotional and noisy speech using learned voice segregation and speech VGG. SSRN Electron. J. 224, 119871 (2022). https://doi.org/10.1016/j.eswa.2023.119871
Aucouturier, J.J., Johansson, P., Hall, L., Segnini, R., Mercadié, L., Watanabe, K.: Covert digital manipulation of vocal emotion alter speakers’ emotional states in a congruent direction. Proc. Natl. Acad. Sci. U.S.A. 113(4), 948–953 (2016)
de Lope, J., Graña, M.: An ongoing review of speech emotion recognition. Neurocomputing 528, 1–11 (2023). https://doi.org/10.1016/j.neucom.2023.01.002
Huang, Z., Dong, M., Mao, Q., Zhan, Y.: Speech emotion recognition using CNN. In: Proceedings of the 2014 ACM Conference on Multimedia, MM 2014, pp. 801–804 (2014)
Qamhan, M.A., Meftah, A.H., Selouani, S.A., Alotaibi, Y.A., Zakariah, M., Seddiq, Y.M.: Speech emotion recognition using convolutional recurrent neural networks and spectrograms. In: Canadian Conference on Electrical and Computer Engineering, August 2020 (2020)
Ando, A., Mori, T., Kobashikawa, S., Toda, T.: Speech emotion recognition based on listener-dependent emotion perception models. APSIPA Trans. Sig. Inf. Process. 10, e6 (2021)
Pandey, S.K., Shekhawat, H.S., Prasanna, S.R.: Attention gated tensor neural network architectures for speech emotion recognition. Biomed. Sig. Process. Control 71(PA), 103173 (2022). https://doi.org/10.1016/j.bspc.2021.103173
Anvarjon, T., Mustaqeem, Kwon, S.: Deep-Net: a lightweight CNN-based speech emotion recognition system using deep frequency features. Sensors (Switzerland) 20(18), 1–16 (2020)
Jiang, P., Fu, H., Tao, H., Lei, P., Zhao, L.: Parallelized convolutional recurrent neural network with spectral features for speech emotion recognition. IEEE Access 7, 90368–90377 (2019)
Praseetha, V.M., Vadivel, S.: Deep learning models for speech emotion recognition. J. Comput. Sci. 14(11), 1577–1587 (2018)
Guizzo, E., Weyde, T., Tarroni, G.: Anti-transfer learning for task invariance in convolutional neural networks for speech processing. Neural Netw. 142, 238–251 (2021)
Teixeira, F.L., Costa, M.R., Abreu, J.P., Cabral, M., Soares, S.P., Teixeira, J.P.: A narrative review of speech and EEG features for Schizophrenia detection: progress and challenges. Bioengineering 10(4), 1–31 (2023)
Mannepalli, K., Sastry, P.N., Suman, M.: Emotion recognition in speech signals using optimization based multi-SVNN classifier. J. King Saud Univ. Comput. Inf. Sci. 34(2), 384–397 (2022). https://doi.org/10.1016/j.jksuci.2018.11.012. https://linkinghub.elsevier.com/retrieve/pii/S1319157818307158
Mirsamadi, S., Barsoum, E., Zhang, C.: Automatic speech emotion recognition using recurrent neural networks with local attention. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017, pp. 2227–2231 (2017). https://doi.org/10.1016/j.specom.2019.09.002
Liang, R., Tao, H., Tang, G., Wang, Q., Zhao, L.: A salient feature extraction algorithm for speech emotion recognition. IEICE Trans. Inf. Syst. E98D(9), 1715–1718 (2015)
Özseven, T.: Investigation of the effect of spectrogram images and different texture analysis methods on speech emotion recognition. Appl. Acoust. 142, 70–77 (2018)
Mao, Q., Dong, M., Huang, Z., Zhan, Y.: Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimedia 16(8), 2203–2213 (2014)
Zheng, W.Q., Yu, J.S., Zou, Y.X.: An experimental study of speech emotion recognition based on deep convolutional neural networks. In: 2015 International Conference on Affective Computing and Intelligent Interaction, ACII 2015, pp. 827–831 (2015)
Zhao, J., Mao, X., Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Sig. Process. Control 47, 312–323 (2019). https://doi.org/10.1016/j.bspc.2018.08.035
Li, P., Song, Y., McLoughlin, I., Guo, W., Dai, L.: An attention pooling based representation learning method for speech emotion recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2018, September 2018, pp. 3087–3091 (2018)
Satt, A., Rozenberg, S., Hoory, R.: Efficient emotion recognition from speech using deep learning on spectrograms. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, August 2017, pp. 1089–1093 (2017)
Badshah, A.M., Ahmad, J., Rahim, N., Baik, S.W.: Speech emotion recognition from spectrograms with deep convolutional neural network. In: Proceedings of the 2017 International Conference on Platform Technology and Service, PlatCon 2017, pp. 1–5 (2017)
Yenigalla, P., Kumar, A., Tripathi, S., Singh, C., Kar, S., Vepa, J.: Speech emotion recognition using spectrogram & phoneme embedding. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2018, September 2018, pp. 3688–3692 (2018)
Alluhaidan, A.S., Saidani, O., Jahangir, R., Nauman, M.A., Neffati, O.S.: Speech emotion recognition through hybrid features and convolutional neural network. Appl. Sci. (Switzerland) 13(8), 4750 (2023)
Costantini, G., Iadarola, I., Paoloni, A., Todisco, M.: EMOVO corpus: an Italian emotional speech database. In: Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014, pp. 3501–3504 (2014)
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: 9th European Conference on Speech Communication and Technology, May 2014, pp. 1517–1520 (2005)
Costantini, G., Iaderola, I., Paoloni, A., Todisco, M.: EMOVO corpus: an Italian emotional speech database. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland, May 2014, pp. 3501–3504. European Language Resources Association (ELRA) (2014). http://www.lrec-conf.org/proceedings/lrec2014/pdf/591_Paper.pdf
Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018). https://doi.org/10.1371/journal.pone.0196391
Haq, S., Jackson, P.: Machine audition: principles, algorithms and systems. In: Multimodal Emotion Recognition, pp. 398–423. IGI Global, Hershey, August 2010
Shah Fahad, M., Ranjan, A., Yadav, J., Deepak, A.: A survey of speech emotion recognition in natural environment. Digit. Sig. Process. Rev. J. 110, 102951 (2021). https://doi.org/10.1016/j.dsp.2020.102951
Silva, L., Bispo, B., Teixeira, J.P.: Features selection algorithms for classification of voice signals. Procedia Comput. Sci. 181(2020), 948–956 (2021). https://doi.org/10.1016/j.procs.2021.01.251
Singh, V., Prasad, S.: Speech emotion recognition system using gender dependent convolution neural network. Procedia Comput. Sci. 218, 2533–2540 (2023). https://doi.org/10.1016/j.procs.2023.01.227. https://linkinghub.elsevier.com/retrieve/pii/S1877050923002272
Rossetti, D.: Projetando o espectro do som no espaço: imagens-movimento de parciais e grãos sonoros. Orfeu 5(1), 571–594 (2020)
Fernandes, J., Teixeira, F., Guedes, V., Junior, A., Teixeira, J.P.: Harmonic to noise ratio measurement - selection of window and length. Procedia Comput. Sci. 138, 280–285 (2018). https://www.sciencedirect.com/science/article/pii/S1877050918316739. cENTERIS 2018 - International Conference on ENTERprise Information Systems/ ProjMAN 2018 - International Conference on Project ANagement / HCist 2018-International Conference on Health and Social Care Information Systems and Technologies, CENTERIS/ProjMAN/HCist 2018
Fernandes, J.F.T., Freitas, D., Junior, A.C., Teixeira, J.P.: Determination of harmonic parameters in pathological voices-efficient algorithm. Appl. Sci. (Switzerland) 13(4), 2333 (2023)
Abbaschian, B.J., Sierra-Sosa, D., Elmaghraby, A.: Deep learning techniques for speech emotion recognition, from databases to models. Sensors (Switzerland) 21(4), 1–27 (2021)
Acknowledgements
This research was funded by the European Regional Development Fund (ERDF) via the Regional Operational Program North 2020, GreenHealth-Digital strategies in biological assets to improve well-being and promote green health, Norte-01-0145-FEDER-000042; Foundation for Science and Technology (FCT, Portugal) support from national funds FCT/MCTES (PIDDAC) to CeDRI (UIDB/05757/2020 and UIDP/05757/2020) and SusTEC (LA/P/0007/2021).
The authors are grateful for financial support from UTAD.
The authors would also like to thank João Mendes for his collaboration throughout the work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Teixeira, F.L., Soares, S.P., Abreu, J.P., Oliveira, P.M., Teixeira, J.P. (2024). Comparative Analysis of Windows for Speech Emotion Recognition Using CNN. In: Pereira, A.I., Mendes, A., Fernandes, F.P., Pacheco, M.F., Coelho, J.P., Lima, J. (eds) Optimization, Learning Algorithms and Applications. OL2A 2023. Communications in Computer and Information Science, vol 1981. Springer, Cham. https://doi.org/10.1007/978-3-031-53025-8_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-53025-8_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53024-1
Online ISBN: 978-3-031-53025-8
eBook Packages: Computer ScienceComputer Science (R0)