Abstract
In recent years, speech-emotion recognition (SER) techniques have gained importance, mainly in human-computer interaction studies and applications. This research area has different challenges, including developing new and efficient detection methods, efficient extraction of audio features, and time preprocessing strategies. This paper proposes a new multiview model to detect speech emotion in raw audio data. The proposed method uses mel-spectrogram features optimized from audio files and combines deep learning algorithms to improve the detection performance. This combination relied on the following algorithms: CNN (Convolutional Neural Network), VGG (Visual Geometry Group), ResNet (Residual neural network), and LSTM (Long Short-Term Memory). The role of the CNN algorithm is to extract the characteristics present in the images of the mel-spectrograms applied as input to the method. These characteristics are combined with the VGG and ResNet networks, which are pre-trained algorithms. Finally, the LSTM algorithm receives all this combined information to identify the predefined emotions. The proposed method was developed using the RAVDESS database and considering eight emotions. The results show an increase of up to 12% in accuracy compared to strategies in the literature that use raw data processing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Kulkarni, K., et al.: Automatic recognition of facial displays of unfelt emotions. IEEE Trans. Affect. Comput. 12(2), 377–390 (2021). https://doi.org/10.1109/TAFFC.2018.2874996
Aleedy, M., Shaiba, H., Bezbradica, M.: Generating and analyzing chatbot responses using natural language processing. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 10(9) (2019). https://doi.org/10.14569/IJACSA.2019.0100910
Loris. www.loris.ai/company/
Das, A., Nair, K., Bandi, Y.: Emotion detection using natural language processing and ConvNets. In: Shukla, S., Gao, X.Z., Kureethara, J.V., Mishra, D. (eds.) Data Science and Security. LNNS, vol. 462, pp. 127–135. Springer, Singapore (2022). https://doi.org/10.1007/978-981-19-2211-4_11
Cowen, A.S., Keltner, D.: Self-report captures 27 distinct categories of emotion bridged by continuous gradients. Proc. Nat. Acad. Sci. USA 114(38), E7900–E7909 (2017). https://doi.org/10.1073/pnas.1702247114. Epub 5 September 2017. PMID: 28874542. PMCID: PMC5617253
Rajak, R., Mall, R.: Emotion recognition from audio, dimensional and discrete categorization using CNNs. In: TENCON 2019–2019 IEEE Region 10 Conference (TENCON), Kochi, India, pp. 301–305 (2019). https://doi.org/10.1109/TENCON.2019.8929459
Mustaqeem, K.S.: A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 20(1), 183 (2020). https://doi.org/10.3390/s20010183
Slimi, A., Hamroun, M., Zrigui, M., Nicolas, H.: Emotion recognition from speech using spectrograms and shallow neural networks. In: Proceedings of the 18th International Conference on Advances in Mobile Computing & Multimedia (MoMM 2020), pp. 35–39. Association for Computing Machinery, New York, NY, USA (2021)
Gupta, M., Chandra, S.: Speech emotion recognition using MFCC and wide residual network. In: 2021 Thirteenth International Conference on Contemporary Computing (IC3-2021), pp. 320–327. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3474124.3474171
Ayadi, S., Lachiri, Z.: A combined CNN-LSTM network for audio emotion recognition using speech and song attributs. In: 2022 6th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), Sfax, Tunisia, pp. 1–6 (2022). https://doi.org/10.1109/ATSIP55956.2022.9805924
Livingstone, S.R., Russo, F.A.: The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) (2018)
Deckmann, S.M., Pomilio, J.A.: Analysis of discretized signals. in Electric Power Quality Assessment - UNICAMP (2020)
Raffel, C., Liang, D., Ellis, D.P.W., Nieto, O.: librosa: audio and music signal analysis in Python. In: Proceedings of the 14th Python in Science Conference (2015)
Acknowledgements
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001. This work was partially supported by Conselho Nacional de Desenvolvimento Científico e Tecnológico - CNPq (Proc. 311065/2020-1).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 Springer Nature Switzerland AG
About this paper
Cite this paper
Letícia de Mattos, F., Pellenz, M.E., Britto, A.d.S. (2024). Time Distributed Multiview Representation for Speech Emotion Recognition. In: Vasconcelos, V., Domingues, I., Paredes, S. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2023. Lecture Notes in Computer Science, vol 14469. Springer, Cham. https://doi.org/10.1007/978-3-031-49018-7_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-49018-7_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-49017-0
Online ISBN: 978-3-031-49018-7
eBook Packages: Computer ScienceComputer Science (R0)