Abstract
Speech Emotion Recognition (SER) is crucial for enhancing human-machine voice interactions, allowing systems to better interpret the speaker’s emotional state and improve user experience. Self-supervised learning (SSL) models have significantly advanced speech recognition by generating speech representations from large unlabeled corpus, which can then be fine-tuned with smaller labeled datasets for downstream tasks such as voice command detection, automatic transcription, speaker identification, and SER. However, SSL models are typically optimized for general tasks rather than emotion recognition. This poses a challenge, as the limited labeled data in SER can hinder the generalization capabilities of these models, making the choice of SSL architecture and training strategy vital. The most common procedure in Speech Emotion Recognition is performing an average time pooling of the SSL features to train classification models, which often leads to a loss of temporal relationships in the data. In this study, we introduce two novel SSL feature aggregation methods that leverage attention mechanisms to better capture temporal dependencies in speech data. These methods significantly enhance the extraction of relevant information from SSL features, leading to improvements in classification accuracy. Our proposed approach outperforms the standard average time pooling method, achieving up to a 6.3% increase in weighted accuracy (WA) on the IEMOCAP database.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Berg, A., O’Connor, M., Cruz, M.T.: Keyword Transformer: a Self-attention Model for Keyword Spotting (2021). arXiv preprint arXiv:2104.00769
Busso, C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42, 335–359 (2008)
Chen, S., et al.: Wavlm: large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 16(6), 1505–1518 (2022). https://doi.org/10.1109/jstsp.2022.3188113
Chen, S., et al.: Unispeech-sat: universal speech representation learning with speaker aware pre-training. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6152–6156 (2022). https://doi.org/10.1109/ICASSP43922.2022.9747077
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding (2019)
Dong, L., Xu, S., Xu, B.: Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888 (2018). https://doi.org/10.1109/ICASSP.2018.8462506
Dosovitskiy, A., et al.: An Image is Worth \(16\times 16\) Words: Transformers for Image Recognition at Scale (2021)
Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.: Hubert: Self-supervised Speech Representation Learning by Masked Prediction of Hidden Units (2021)
Jiang, D., et al.: A further study of unsupervised pretraining for transformer based speech recognition. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6538–6542 (2021). https://doi.org/10.1109/ICASSP39728.2021.9414539
Latif, S., Zaidi, A., Cuayahuitl, H., Shamshad, F., Shoukat, M., Qadir, J.: Transformers in Speech Processing: A Survey (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (2023)
Novoselov, S., Lavrentyeva, G., Avdeeva, A., Volokhov, V., Gusev, A.: Robust Speaker Recognition with Transformers Using wav2vec 2.0 (2022)
Shao, Z., et al.: Transmil: transformer based correlated multiple instance learning for whole slide image classification. Adv. Neural. Inf. Process. Syst. 34, 2136–2147 (2021)
Wang, Y., et al.: Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition (2024)
Acknowledgments
This work was supported by the Spanish Ministry of Science and Innovation for the DIPSY project (TED2021-131401B-C21), European Union’s Horizon 2020 funded project “HELIOS” (No 825585), by the Universitat Politècnica de València (PAID-10-20), by the Generalitat Valenciana (ACIF/2021/187 and PROMETEO/2020/024) and by the Spanish Government (BEWORD PID2021-126061OB-C41).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Valls, O. et al. (2025). Improving Speech Emotion Recognition: Novel Aggregation Strategies for Self-supervised Features. In: Julian, V., et al. Intelligent Data Engineering and Automated Learning – IDEAL 2024. IDEAL 2024. Lecture Notes in Computer Science, vol 15346. Springer, Cham. https://doi.org/10.1007/978-3-031-77731-8_35
Download citation
DOI: https://doi.org/10.1007/978-3-031-77731-8_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-77730-1
Online ISBN: 978-3-031-77731-8
eBook Packages: Computer ScienceComputer Science (R0)