Improving Speech Emotion Recognition: Novel Aggregation Strategies for Self-supervised Features

Valls, Oscar; Pastor-Naranjo, Fran; del Amor, Rocío; Gómez-Zaragozá, Lucía; Marín-Morales, Javier; Raya, Mariano Alcañiz; Naranjo, Valery

doi:10.1007/978-3-031-77731-8_35

Oscar Valls¹⁴,
Fran Pastor-Naranjo¹⁴,
Rocío del Amor¹⁴,
Lucía Gómez-Zaragozá¹⁴,
Javier Marín-Morales¹⁴,
Mariano Alcañiz Raya¹⁴ &
…
Valery Naranjo^14,15

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15346))

Included in the following conference series:

International Conference on Intelligent Data Engineering and Automated Learning

395 Accesses

Abstract

Speech Emotion Recognition (SER) is crucial for enhancing human-machine voice interactions, allowing systems to better interpret the speaker’s emotional state and improve user experience. Self-supervised learning (SSL) models have significantly advanced speech recognition by generating speech representations from large unlabeled corpus, which can then be fine-tuned with smaller labeled datasets for downstream tasks such as voice command detection, automatic transcription, speaker identification, and SER. However, SSL models are typically optimized for general tasks rather than emotion recognition. This poses a challenge, as the limited labeled data in SER can hinder the generalization capabilities of these models, making the choice of SSL architecture and training strategy vital. The most common procedure in Speech Emotion Recognition is performing an average time pooling of the SSL features to train classification models, which often leads to a loss of temporal relationships in the data. In this study, we introduce two novel SSL feature aggregation methods that leverage attention mechanisms to better capture temporal dependencies in speech data. These methods significantly enhance the extraction of relevant information from SSL features, leading to improvements in classification accuracy. Our proposed approach outperforms the standard average time pooling method, achieving up to a 6.3% increase in weighted accuracy (WA) on the IEMOCAP database.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Berg, A., O’Connor, M., Cruz, M.T.: Keyword Transformer: a Self-attention Model for Keyword Spotting (2021). arXiv preprint arXiv:2104.00769
Busso, C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42, 335–359 (2008)
Article Google Scholar
Chen, S., et al.: Wavlm: large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 16(6), 1505–1518 (2022). https://doi.org/10.1109/jstsp.2022.3188113
Article Google Scholar
Chen, S., et al.: Unispeech-sat: universal speech representation learning with speaker aware pre-training. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6152–6156 (2022). https://doi.org/10.1109/ICASSP43922.2022.9747077
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding (2019)
Google Scholar
Dong, L., Xu, S., Xu, B.: Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888 (2018). https://doi.org/10.1109/ICASSP.2018.8462506
Dosovitskiy, A., et al.: An Image is Worth $16\times 16$ Words: Transformers for Image Recognition at Scale (2021)
Google Scholar
Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.: Hubert: Self-supervised Speech Representation Learning by Masked Prediction of Hidden Units (2021)
Google Scholar
Jiang, D., et al.: A further study of unsupervised pretraining for transformer based speech recognition. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6538–6542 (2021). https://doi.org/10.1109/ICASSP39728.2021.9414539
Latif, S., Zaidi, A., Cuayahuitl, H., Shamshad, F., Shoukat, M., Qadir, J.: Transformers in Speech Processing: A Survey (2023)
Google Scholar
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (2023)
Google Scholar
Novoselov, S., Lavrentyeva, G., Avdeeva, A., Volokhov, V., Gusev, A.: Robust Speaker Recognition with Transformers Using wav2vec 2.0 (2022)
Google Scholar
Shao, Z., et al.: Transmil: transformer based correlated multiple instance learning for whole slide image classification. Adv. Neural. Inf. Process. Syst. 34, 2136–2147 (2021)
Google Scholar
Wang, Y., et al.: Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition (2024)
Google Scholar

Download references

Acknowledgments

This work was supported by the Spanish Ministry of Science and Innovation for the DIPSY project (TED2021-131401B-C21), European Union’s Horizon 2020 funded project “HELIOS” (No 825585), by the Universitat Politècnica de València (PAID-10-20), by the Generalitat Valenciana (ACIF/2021/187 and PROMETEO/2020/024) and by the Spanish Government (BEWORD PID2021-126061OB-C41).

Author information

Authors and Affiliations

Instituto Universitario de Investigación en Tecnología Centrada en el Ser Humano (HUMAN-Tech), Universitat Politécnica de Valencia, 46022, Valencia, Spain
Oscar Valls, Fran Pastor-Naranjo, Rocío del Amor, Lucía Gómez-Zaragozá, Javier Marín-Morales, Mariano Alcañiz Raya & Valery Naranjo
Valencian Graduate School and Research Network for Artificial Intelligence (valgrAI), 46022, Valencia, Spain
Valery Naranjo

Authors

Oscar Valls
View author publications
You can also search for this author in PubMed Google Scholar
Fran Pastor-Naranjo
View author publications
You can also search for this author in PubMed Google Scholar
Rocío del Amor
View author publications
You can also search for this author in PubMed Google Scholar
Lucía Gómez-Zaragozá
View author publications
You can also search for this author in PubMed Google Scholar
Javier Marín-Morales
View author publications
You can also search for this author in PubMed Google Scholar
Mariano Alcañiz Raya
View author publications
You can also search for this author in PubMed Google Scholar
Valery Naranjo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fran Pastor-Naranjo .

Editor information

Editors and Affiliations

Technical University of Valencia, Valencia, Valencia, Spain
Vicente Julian
Technical University of Madrid, Madrid, Spain
David Camacho
The University of Manchester, Manchester, UK
Hujun Yin
Universitat Politècnica de València, Valencia, Valencia, Spain
Juan M. Alberola
University of Evora, Evora, Portugal
Vitor Beires Nogueira
Universidade do Minho, Braga, Portugal
Paulo Novais
University of Huelva, Huelva, Spain
Antonio Tallón-Ballesteros

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Valls, O. et al. (2025). Improving Speech Emotion Recognition: Novel Aggregation Strategies for Self-supervised Features. In: Julian, V., et al. Intelligent Data Engineering and Automated Learning – IDEAL 2024. IDEAL 2024. Lecture Notes in Computer Science, vol 15346. Springer, Cham. https://doi.org/10.1007/978-3-031-77731-8_35

Download citation

DOI: https://doi.org/10.1007/978-3-031-77731-8_35
Published: 14 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-77730-1
Online ISBN: 978-3-031-77731-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics