Skip to main content

Aggregation Strategies of Wav2vec 2.0 Embeddings for Computational Paralinguistic Tasks

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2023)

Abstract

Throughout the history of computational paralinguistics, numerous feature extraction, preprocessing and classification techniques have been used. One of the important challenges in this subfield of speech technology is handling utterances with different duration. Since standard speech processing features (such as filter banks or DNN embeddings) are typically frame-level ones and we would like to classify whole utterances, a set of frame-level features have to be converted into fixed-sized utterance-level features. The choice of this aggregation method is often overlooked, and simple functions like mean and/or standard deviation are used without solid experimental support. In this study we take wav2vec 2.0 deep embeddings, and aggregate them with 11 different functions. We sought to obtain a subset of potentially optimal aggregation functions, because there are no general rules yet that can be applied universally between subtopics. Besides testing both standard and non-traditional aggregation strategies individually, we also combined them to improve the classification performance. By using multiple aggregation functions, we were able to achieve significant improvements on three public paralinguistic corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Baevski, A., Auli, M., Conneau, A.: Wav2vec 2.0: learning the structure of speech from raw audio (2020). https://ai.meta.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/

  2. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 1–27 (2011). https://doi.org/10.1145/1961189.1961199

    Article  Google Scholar 

  3. Chen, J., Ye, J., Tang, F., Zhou, J.: Automatic detection of Alzheimer’s Disease using spontaneous speech only. In: Proceedings of the Interspeech 2021, pp. 3830–3834 (2021). https://doi.org/10.21437/Interspeech.2021-2002

  4. Conneau, A., Baevski, A., Collobert, R., Mohamed, A., Auli, M.: Unsupervised Cross-lingual Representation Learning for Speech Recognition (2020). https://doi.org/10.48550/ARXIV.2006.13979

  5. Egas-López, J.V., Gosztolya, G.: Deep Neural Network Embeddings for the estimation of the degree of sleepiness. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 7288–7292 (2021). https://doi.org/10.1109/ICASSP39728.2021.9413589

  6. Egas-López, J.V., Kiss, G., Sztahó, D., Gosztolya, G.: Automatic assessment of the degree of clinical depression from speech using X-Vectors. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8502–8506 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746068

  7. Egas-López, J.V., Vetráb, M., Tóth, L., Gosztolya, G.: identifying conflict escalation and primates by using ensemble x-vectors and fisher vector features. In: Proceedings of the Interspeech 2021, pp. 476–480 (2021). https://doi.org/10.21437/Interspeech.2021-1173

  8. Gosztolya, G.: Using the Fisher vector representation for audio-based emotion recognition. Acta Polytechnica Hungarica 17, 7–23 (2020)

    Article  Google Scholar 

  9. Gosztolya, G., Tóth, L., Svindt, V., Bóna, J., Hoffmann, I.: Using acoustic deep neural network embeddings to detect multiple sclerosis from speech. In: Proceedings of ICASSP, pp. 6927–6931 (2022)

    Google Scholar 

  10. Gosztolya, G., Beke, A., Neuberger, T.: Differentiating laughter types via HMM/DNN and probabilistic sampling. In: Speech and Computer, SPECOM 2019. vol. 11658, pp. 122–132 (2019)

    Google Scholar 

  11. Grezes, F., Richards, J., Rosenberg, A.: Let me finish: automatic conflict detection using speaker overlap. In: Proceedings of the Interspeech 2013, pp. 200–204 (2013). https://doi.org/10.21437/Interspeech.2013-67

  12. Grosman, J.: Fine-tuned XLSR-53 large model for speech recognition in German (2021). https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-german

  13. Han, K.J., Kim, S., Narayanan, S.S.: Strategies to improve the robustness of Agglomerative Hierarchical Clustering under data source variation for speaker diarization. IEEE Trans. Audio Speech Lang. Process. 16, 1590–1601 (2008). https://doi.org/10.1109/TASL.2008.2002085

    Article  Google Scholar 

  14. Hinton, G., et al.: Deep Neural Networks for Acoustic Modeling in Speech Recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29, 82–97 (2012). https://doi.org/10.1109/MSP.2012.2205597

    Article  Google Scholar 

  15. Jeancolas, L., et al.: X-Vectors: new quantitative biomarkers for early Parkinson’s Disease detection from speech. Front. Neuroinform. 15, 1–18 (2021). https://doi.org/10.3389/fninf.2021.578369

    Article  Google Scholar 

  16. Kadiri, S., Kethireddy, R., Alku, P.: Parkinson’s Disease detection from speech using Single Frequency Filtering Cepstral Coefficients. In: Proceedings of the Interspeech 2020, pp. 4971–4975 (2020). https://doi.org/10.21437/Interspeech.2020-3197

  17. Kaya, H., Karpov, A., Salah, A.: Fisher vectors with cascaded normalization for paralinguistic analysis. In: Proceedings of the Interspeech 2015, pp. 909–913 (2015). https://doi.org/10.21437/Interspeech.2015-193

  18. Krajewski, J., Schieder, S., Batliner, A.: Description of the upper respiratory tract infection corpus (urtic). In: Proceedings of the Interspeech 2017 (2017)

    Google Scholar 

  19. Lin, W.W., Mak, M.W.: Wav2spk: a simple DNN architecture for learning speaker embeddings from waveforms. In: Proceedings of Interspeech, pp. 3211–3215 (2020)

    Google Scholar 

  20. Metze, F., Batliner, A., Eyben, F., Polzehl, T., Schuller, B., Steidl, S.: Emotion recognition using imperfect speech recognition. In: Proceedings of the Interspeech 2010, pp. 478–481 (2010). https://doi.org/10.21437/Interspeech.2010-202

  21. Mustaqeem, Kwon, S.: CLSTM: deep feature-based speech emotion recognition using the hierarchical ConvLSTM network. Mathematics 8, 1–19 (2020). https://doi.org/10.3390/math8122133

  22. Oflazoglu, C., Yildirim, S.: Recognizing emotion from Turkish speech using acoustic features. In: EURASIP Journal on Audio Speech and Music Processing 2013 (2013). https://doi.org/10.1186/1687-4722-2013-26

  23. Pappagari, R., et al.: Automatic detection and assessment of Alzheimer Disease using speech and language technologies in low-resource scenarios. In: Proceedings of the Interspeech 2021, pp. 3825–3829 (2021). https://doi.org/10.21437/Interspeech.2021-1850

  24. Pérez-Toro, P., et al.: Alzheimer’s detection from English to Spanish using acoustic and linguistic embeddings. In: Proceedings of Interspeech 2022, pp. 2483–2487 (2022). https://doi.org/10.21437/Interspeech.2022-10883

  25. Přibil, J., Přibilová, A., Matoušek, J.: GMM-based speaker age and gender classification in Czech and Slovak. J. Electr. Eng. 68, 3–12 (2017). https://doi.org/10.1515/jee-2017-0001

    Article  Google Scholar 

  26. Schuller, B., Steidl, S., Batliner, A.: The INTERSPEECH 2009 emotion challenge. In: Proceedings of the Interspeech 2009, pp. 312–315 (2009). https://doi.org/10.21437/Interspeech. 2009–103

  27. Schuller, B., et al.: The INTERSPEECH 2017 computational paralinguistics challenge: addressee, cold & snoring. In: Proceedings of the Interspeech 2017, pp. 3442–3446 (2017). https://doi.org/10.21437/Interspeech.2017-43

  28. Schuller, B., et al.: The INTERSPEECH 2015 computational paralinguistics challenge: Nativeness, Parkinson’s & eating condition. In: Proceedings of the Interspeech 2015, pp. 478–482 (2015). https://doi.org/10.21437/Interspeech.2015-179

  29. Schuller, B.W., et al.: The INTERSPEECH 2019 computational paralinguistics challenge: Styrian dialects, continuous sleepiness, baby sounds & orca activity. In: Proceedings of the Interspeech 2019, pp. 2378–2382 (2019). https://doi.org/10.21437/Interspeech.2019-1122

  30. Sheikh, S.A., Sahidullah, M., Hirsch, F., Ouni, S.: Introducing ECAPA-TDNN and Wav2Vec2.0 Embeddings to Stuttering Detection (2022). https://doi.org/10.48550/ARXIV.2204.01564

  31. Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-Vectors: robust DNN embeddings for speaker verification. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 5329–5333 (2018). https://doi.org/10.1109/ICASSP.2018.8461375

  32. Steidl, S.: Automatic classification of emotion related user states in spontaneous children’s speech. Logos-Verlag Berlin, Germany (2009). https://d-nb.info/992551641

  33. Tzirakis, P., Zhang, J., Schuller, B.W.: End-to-end speech emotion recognition using deep neural networks. In: 2018 IEEE international Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5089–5093 (2018)

    Google Scholar 

  34. Van Segbroeck, M., et al.: Classification of cognitive load from speech using an i-vector framework. In: Proceedings of the Interspeech 2014, pp. 751–755 (2014). https://doi.org/10.21437/Interspeech.2014-114

  35. Vetráb, M., Gosztolya, G.: Speech emotion detection form a Hungarian database with the Bag-of-Audi-Words technique (in Hungarian). In: Proceedings of MSZNY, pp. 265–274. Szeged (2019)

    Google Scholar 

  36. Vetráb, M., Gosztolya, G.: Using hybrid HMM/DNN embedding extractor models in computational paralinguistic tasks. Sensors 23, 5208 (2023)

    Article  Google Scholar 

  37. Vetráb, M., et al.: Using spectral sequence-to-sequence autoencoders to assess mild cognitive impairment. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 6467–6471 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746148

  38. Vásquez-Correa, J., Orozco-Arroyave, J.R., Nöth, E.: Convolutional Neural Network to model articulation impairments in patients with Parkinson’s Disease. In: Proceedings of the Interspeech 2017, pp. 314–318 (2017). https://doi.org/10.21437/Interspeech.2017-1078

  39. Wagner, J., Schiller, D., Seiderer, A., Andre, E.: Deep learning in paralinguistic recognition tasks: are hand-crafted features still relevant? In: Interspeech, pp. 147–151 (2018). https://doi.org/10.21437/Interspeech.2018-1238

  40. Wang, W., Lu, P., Yan, Y.: An improved hierarchical speaker clustering. Acta Acustica 33, 9–14 (2008)

    Google Scholar 

  41. Zhao, Z., Bao, Z., Zhang, Z., Cummins, N., Wang, H., Schuller, B.: Attention-enhanced connectionist temporal classification for discrete speech emotion recognition. In: Proceedings of the Interspeech 2019, pp. 206–210 (2019). https://doi.org/10.21437/Interspeech.2019-1649

Download references

Acknowledgements

This research was supported by the NRDI Office of the Hungarian Ministry of Innovation and Technology (grant no. TKP2021-NVA-09), and within the framework of the Artificial Intelligence National Laboratory Program (RRF-2.3.1-21-2022-00004).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mercedes Vetráb .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Vetráb, M., Gosztolya, G. (2023). Aggregation Strategies of Wav2vec 2.0 Embeddings for Computational Paralinguistic Tasks. In: Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M. (eds) Speech and Computer. SPECOM 2023. Lecture Notes in Computer Science(), vol 14338. Springer, Cham. https://doi.org/10.1007/978-3-031-48309-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-48309-7_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-48308-0

  • Online ISBN: 978-3-031-48309-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics