Skip to main content

Improving Speech Emotion Recognition: Novel Aggregation Strategies for Self-supervised Features

  • Conference paper
  • First Online:
Intelligent Data Engineering and Automated Learning – IDEAL 2024 (IDEAL 2024)

Abstract

Speech Emotion Recognition (SER) is crucial for enhancing human-machine voice interactions, allowing systems to better interpret the speaker’s emotional state and improve user experience. Self-supervised learning (SSL) models have significantly advanced speech recognition by generating speech representations from large unlabeled corpus, which can then be fine-tuned with smaller labeled datasets for downstream tasks such as voice command detection, automatic transcription, speaker identification, and SER. However, SSL models are typically optimized for general tasks rather than emotion recognition. This poses a challenge, as the limited labeled data in SER can hinder the generalization capabilities of these models, making the choice of SSL architecture and training strategy vital. The most common procedure in Speech Emotion Recognition is performing an average time pooling of the SSL features to train classification models, which often leads to a loss of temporal relationships in the data. In this study, we introduce two novel SSL feature aggregation methods that leverage attention mechanisms to better capture temporal dependencies in speech data. These methods significantly enhance the extraction of relevant information from SSL features, leading to improvements in classification accuracy. Our proposed approach outperforms the standard average time pooling method, achieving up to a 6.3% increase in weighted accuracy (WA) on the IEMOCAP database.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Berg, A., O’Connor, M., Cruz, M.T.: Keyword Transformer: a Self-attention Model for Keyword Spotting (2021). arXiv preprint arXiv:2104.00769

  2. Busso, C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42, 335–359 (2008)

    Article  Google Scholar 

  3. Chen, S., et al.: Wavlm: large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 16(6), 1505–1518 (2022). https://doi.org/10.1109/jstsp.2022.3188113

    Article  Google Scholar 

  4. Chen, S., et al.: Unispeech-sat: universal speech representation learning with speaker aware pre-training. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6152–6156 (2022). https://doi.org/10.1109/ICASSP43922.2022.9747077

  5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding (2019)

    Google Scholar 

  6. Dong, L., Xu, S., Xu, B.: Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888 (2018). https://doi.org/10.1109/ICASSP.2018.8462506

  7. Dosovitskiy, A., et al.: An Image is Worth \(16\times 16\) Words: Transformers for Image Recognition at Scale (2021)

    Google Scholar 

  8. Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.: Hubert: Self-supervised Speech Representation Learning by Masked Prediction of Hidden Units (2021)

    Google Scholar 

  9. Jiang, D., et al.: A further study of unsupervised pretraining for transformer based speech recognition. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6538–6542 (2021). https://doi.org/10.1109/ICASSP39728.2021.9414539

  10. Latif, S., Zaidi, A., Cuayahuitl, H., Shamshad, F., Shoukat, M., Qadir, J.: Transformers in Speech Processing: A Survey (2023)

    Google Scholar 

  11. Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (2023)

    Google Scholar 

  12. Novoselov, S., Lavrentyeva, G., Avdeeva, A., Volokhov, V., Gusev, A.: Robust Speaker Recognition with Transformers Using wav2vec 2.0 (2022)

    Google Scholar 

  13. Shao, Z., et al.: Transmil: transformer based correlated multiple instance learning for whole slide image classification. Adv. Neural. Inf. Process. Syst. 34, 2136–2147 (2021)

    Google Scholar 

  14. Wang, Y., et al.: Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition (2024)

    Google Scholar 

Download references

Acknowledgments

This work was supported by the Spanish Ministry of Science and Innovation for the DIPSY project (TED2021-131401B-C21), European Union’s Horizon 2020 funded project “HELIOS” (No 825585), by the Universitat Politècnica de València (PAID-10-20), by the Generalitat Valenciana (ACIF/2021/187 and PROMETEO/2020/024) and by the Spanish Government (BEWORD PID2021-126061OB-C41).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fran Pastor-Naranjo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Valls, O. et al. (2025). Improving Speech Emotion Recognition: Novel Aggregation Strategies for Self-supervised Features. In: Julian, V., et al. Intelligent Data Engineering and Automated Learning – IDEAL 2024. IDEAL 2024. Lecture Notes in Computer Science, vol 15346. Springer, Cham. https://doi.org/10.1007/978-3-031-77731-8_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-77731-8_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-77730-1

  • Online ISBN: 978-3-031-77731-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics