Time Distributed Multiview Representation for Speech Emotion Recognition

Letícia de Mattos, Flavia; Pellenz, Marcelo E.; Britto, Alceu de S.

doi:10.1007/978-3-031-49018-7_11

Flavia Letícia de Mattos¹⁰,
Marcelo E. Pellenz¹⁰ &
Alceu de S. Britto^10,11

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14469))

Included in the following conference series:

Iberoamerican Congress on Pattern Recognition

327 Accesses

Abstract

In recent years, speech-emotion recognition (SER) techniques have gained importance, mainly in human-computer interaction studies and applications. This research area has different challenges, including developing new and efficient detection methods, efficient extraction of audio features, and time preprocessing strategies. This paper proposes a new multiview model to detect speech emotion in raw audio data. The proposed method uses mel-spectrogram features optimized from audio files and combines deep learning algorithms to improve the detection performance. This combination relied on the following algorithms: CNN (Convolutional Neural Network), VGG (Visual Geometry Group), ResNet (Residual neural network), and LSTM (Long Short-Term Memory). The role of the CNN algorithm is to extract the characteristics present in the images of the mel-spectrograms applied as input to the method. These characteristics are combined with the VGG and ResNet networks, which are pre-trained algorithms. Finally, the LSTM algorithm receives all this combined information to identify the predefined emotions. The proposed method was developed using the RAVDESS database and considering eight emotions. The results show an increase of up to 12% in accuracy compared to strategies in the literature that use raw data processing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Kulkarni, K., et al.: Automatic recognition of facial displays of unfelt emotions. IEEE Trans. Affect. Comput. 12(2), 377–390 (2021). https://doi.org/10.1109/TAFFC.2018.2874996
Aleedy, M., Shaiba, H., Bezbradica, M.: Generating and analyzing chatbot responses using natural language processing. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 10(9) (2019). https://doi.org/10.14569/IJACSA.2019.0100910
Loris. www.loris.ai/company/
Das, A., Nair, K., Bandi, Y.: Emotion detection using natural language processing and ConvNets. In: Shukla, S., Gao, X.Z., Kureethara, J.V., Mishra, D. (eds.) Data Science and Security. LNNS, vol. 462, pp. 127–135. Springer, Singapore (2022). https://doi.org/10.1007/978-981-19-2211-4_11
Cowen, A.S., Keltner, D.: Self-report captures 27 distinct categories of emotion bridged by continuous gradients. Proc. Nat. Acad. Sci. USA 114(38), E7900–E7909 (2017). https://doi.org/10.1073/pnas.1702247114. Epub 5 September 2017. PMID: 28874542. PMCID: PMC5617253
Rajak, R., Mall, R.: Emotion recognition from audio, dimensional and discrete categorization using CNNs. In: TENCON 2019–2019 IEEE Region 10 Conference (TENCON), Kochi, India, pp. 301–305 (2019). https://doi.org/10.1109/TENCON.2019.8929459
Mustaqeem, K.S.: A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 20(1), 183 (2020). https://doi.org/10.3390/s20010183
Slimi, A., Hamroun, M., Zrigui, M., Nicolas, H.: Emotion recognition from speech using spectrograms and shallow neural networks. In: Proceedings of the 18th International Conference on Advances in Mobile Computing & Multimedia (MoMM 2020), pp. 35–39. Association for Computing Machinery, New York, NY, USA (2021)
Google Scholar
Gupta, M., Chandra, S.: Speech emotion recognition using MFCC and wide residual network. In: 2021 Thirteenth International Conference on Contemporary Computing (IC3-2021), pp. 320–327. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3474124.3474171
Ayadi, S., Lachiri, Z.: A combined CNN-LSTM network for audio emotion recognition using speech and song attributs. In: 2022 6th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), Sfax, Tunisia, pp. 1–6 (2022). https://doi.org/10.1109/ATSIP55956.2022.9805924
Livingstone, S.R., Russo, F.A.: The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) (2018)
Google Scholar
Deckmann, S.M., Pomilio, J.A.: Analysis of discretized signals. in Electric Power Quality Assessment - UNICAMP (2020)
Google Scholar
Raffel, C., Liang, D., Ellis, D.P.W., Nieto, O.: librosa: audio and music signal analysis in Python. In: Proceedings of the 14th Python in Science Conference (2015)
Google Scholar

Download references

Acknowledgements

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001. This work was partially supported by Conselho Nacional de Desenvolvimento Científico e Tecnológico - CNPq (Proc. 311065/2020-1).

Author information

Authors and Affiliations

Graduate Program in Computer Science (PPGIa), Pontifical Catholic University of Paraná (PUCPR), Curitiba, Brazil
Flavia Letícia de Mattos, Marcelo E. Pellenz & Jr. Alceu de S. Britto
State University of Ponta Grossa (UEPG), Ponta Grossa, Brazil
Jr. Alceu de S. Britto

Authors

Flavia Letícia de Mattos
View author publications
You can also search for this author in PubMed Google Scholar
Marcelo E. Pellenz
View author publications
You can also search for this author in PubMed Google Scholar
Jr. Alceu de S. Britto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcelo E. Pellenz .

Editor information

Editors and Affiliations

Polytechnic Institute of Coimbra, Coimbra Institute of Engineering, Coimbra, Portugal
Verónica Vasconcelos
Polytechnic Institute of Coimbra, Coimbra Institute of Engineering, Coimbra, Portugal
Inês Domingues
Polytechnic Institute of Coimbra, Coimbra Institute of Engineering, Coimbra, Portugal
Simão Paredes

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Letícia de Mattos, F., Pellenz, M.E., Britto, A.d.S. (2024). Time Distributed Multiview Representation for Speech Emotion Recognition. In: Vasconcelos, V., Domingues, I., Paredes, S. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2023. Lecture Notes in Computer Science, vol 14469. Springer, Cham. https://doi.org/10.1007/978-3-031-49018-7_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-49018-7_11
Published: 27 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-49017-0
Online ISBN: 978-3-031-49018-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Time Distributed Multiview Representation for Speech Emotion Recognition