skip to main content
10.1145/3503161.3551606acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Audio Features from the Wav2Vec 2.0 Embeddings for the ACM Multimedia 2022 Stuttering Challenge

Published:10 October 2022Publication History

ABSTRACT

The ACM Multimedia 2022 Stuttering Challenge is to determine the stuttering-related class of a speech segment. There are seven stuttering-related classes and an eighth garbage class. For this purpose, we have investigated the Wav2Vec 2.0 deep neural network to extract audio features from Transformer embeddings. Experiments were conducted on a part of the Kassel State of Fluency Corpus (KSoF). First, we introduced 21 functionals allowing to define two large composite audio feature: the first one set called W2V2 Basic audio-feature set (193,536 features) from the Base version of Transformer embeddings and the second one called W2V2 Large audio-feature set (516, 096 features) from the Large version of Transformer embeddings. Some functionals aim at estimating the spatial variability (e.g., mean, standard deviation, quartiles) and others aim at the temporal variability (e.g., linear regression slope). Then, each composite audio feature set have been splitted into a set of audio feature sets by grouping audio features by functional and by layer. Then, the most discriminant audio feature sets have been selected from these audio feature sets. Finally, two audio features sets specializing in stuttering speech have been developed and assessed: W2V2 Advanced audio-feature set (9,984 features) and W2V2 Large Advanced audio-feature set (15,360 features). Experiments have shown an improvement of 9.3% to the first one and 11.9% for the second one on the Test set compared to the official baseline of the Challenge (40.4%).

Skip Supplemental Material Section

Supplemental Material

mmgc60-montacie.mp4

mp4

96.1 MB

References

  1. Gerald A. Maguire, Christopher Y. Yeh, and Brandon S. Ito. 2012. Overview of the diagnosis and treatment of stuttering. Journal of Experimental & Clinical Medicine, 2012, Vol. 4, no 2, 92--97.Google ScholarGoogle Scholar
  2. American Psychiatric Association. 2013. Diagnostic and Statistical Manual of Mental Disorders (5th ed.), DSM-V. VA, Arlington.Google ScholarGoogle Scholar
  3. Sakshi Gupta, Ravi S. Shukla, and Rajesh K. Shukla. 2019. Literature survey and review of techniques used for automatic assessment of Stuttered Speech. International Journal of Management, Technology and Engineering, Vol. 9, 229--240.Google ScholarGoogle Scholar
  4. Shakeel A. Sheikh, Md Sahidullah, Fabrice Hirsch, and Slim Ouni. 2021. Machine learning for stuttering identification: Review, challenges & future directions. arXiv:2107.04057. Retrieved from https://arxiv.org/pdf/2107.04057Google ScholarGoogle Scholar
  5. Liam Barrett, Junchao Hu, and Peter Howell. 2022. Systematic review of machine learning approaches for detecting developmental stuttering. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 30, 1160--1172.Google ScholarGoogle Scholar
  6. Tian-Swee Tan, Helbin-Liboh, A. K. Ariff, Chee-Ming Ting, and Sh-Hussain Salleh. 2007. Application of Malay speech technology in Malay speech therapy assistance tools. In Proceedings of IEEE-ICIAS International Conference on Intelligent and Advanced Systems, 330--334.Google ScholarGoogle ScholarCross RefCross Ref
  7. Pravin B. Ramteke, Shashidhar G. Koolagudi, and Fathima AFROZ. 2016. Repetition detection in stuttered speech. In Proceedings of 3rd international conference on advanced computing, networking and informatics. Springer, New Delhi, 611--617.Google ScholarGoogle ScholarCross RefCross Ref
  8. Izabela Świetlicka, WiesŁawa Kuniszyk-Jozkowiak, and Elżbieta Smolka, 2013. Hierarchical ANN system for stuttering identification. Computer Speech & Language, 2013, vol. 27, no 1, 228--242.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Manjula, M. Shivakumar, and Yelimeli V. Geetha. 2019. Adaptive optimization based neural network for classification of stuttered speech. In Proceedings of the 3th international Conference on Cryptography, Security and Privacy, 93--98.Google ScholarGoogle Scholar
  10. Tedd Kourkounakis, Amirhossein Hajavi, and Ali Etemad. 2020. Detecting multiple speech disfluencies using a deep residual network with bidirectional long short-term memory. In Proceedings of the IEEE-ICASSP International Conference on Acoustics, Speech and Signal Processing, 6089--6093.Google ScholarGoogle ScholarCross RefCross Ref
  11. Mélanie Jouaiti and Kerstin Dautenhahn. 2022. Dysfluency classification in stuttered speech using deep learning for real-time applications. In Proceedings of the IEEE-ICASSP International Conference on Acoustics, Speech and Signal Processing, 6482--6486Google ScholarGoogle ScholarCross RefCross Ref
  12. Sadeen Alharbi, Madina Hasan, Anthony J.H. Simons, S. Brumfitt, and P. Green 2020. Sequence labeling to detect stuttering events in read speech. Computer Speech & Language, Vol. 62, 101052.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Rachid Riad, Anne-Catherine Bachoud-Lévi, Frank Rudzicz, Emmanuel Dupoux. 2020. Identification of primary and collateral tracks in stuttered speech. arXiv:2003.01018. Retrieved from https://arxiv.org/pdf/2003.01018Google ScholarGoogle Scholar
  14. Shakeel A. Sheikh, Md Sahidullah, Fabrice Hirsch, and Slim Ouni. 2021. Stutternet: Stuttering detection using time delay neural network. In Proceedings of the 29th IEEE-EUSIPCO European Signal Processing Conference, 426--430.Google ScholarGoogle ScholarCross RefCross Ref
  15. Shakeel A. Sheikh, Md Sahidullah, Fabrice Hirsch, and Slim Ouni. 2022. Introducing ECAPA-TDNN and Wav2Vec2.0 Embeddings to Stuttering Detection. arXiv:2204.01564. Retrieved from https://arxiv.org/pdf/2204.01564Google ScholarGoogle Scholar
  16. Peter Howell, Stephen Davis, and John Bartrip. 2009. The UCLASS archive of stuttered speech. Journal of speech, language, and hearing research, Vol. 52, 2, 556--569.Google ScholarGoogle ScholarCross RefCross Ref
  17. Nan Bernstein Ratner and Brian Macwhinney. 2018. Fluency Bank: A new resource for fluency research and practice. Journal of fluency disorders, vol. 56, 69--80.Google ScholarGoogle ScholarCross RefCross Ref
  18. Lea Colin, Mitra Vikramjit, Joshi Aparna, Kajarekar Sachin, and Jeffrey P. Bigham. 2021. Sep-28k: A dataset for stuttering event detection from podcasts with people who stutter. In Proceedings of the IEEE-ICASSP International Conference on Acoustics, Speech and Signal Processing, 6798--6802.Google ScholarGoogle Scholar
  19. Sebastian P. Bayerl, Alexander Wolff von Gudenberg, Florian Hönig, Elmar Nöth, and Korbinian Riedhammer. 2022. KSoF: The Kassel State of Fluency Dataset - A Therapy Centered Dataset of Stuttering. arXiv:2203.05383. Retrieved from https://arxiv.org/pdf/2203.05383Google ScholarGoogle Scholar
  20. Björn W. Schuller, Anton Batliner, Shahin Amiriparian, Christian Bergler, Maurice Gerczuk, Natalie Holz, Pauline Larrouy-Maestri, Sebastian P. Bayerl, Korbinian Riedhammer, Adria Mallol-Ragolta, Maria Pateraki, Harry Coppock, Ivan Kiskin, Marianne Sinka, and Stephen Roberts. 2022. The ACM Multimedia 2022 Computational Paralinguistics Challenge: Vocalisations, Stuttering, Activity, & Mosquitoes. In Proceedings of the ACM Multimedia (MM '22), October, 2022, Lisboa, Portugal. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3475957.348445Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, Vol. 33, 12449--12460.Google ScholarGoogle Scholar
  22. Florian Eyben, Félix Weninger, Florian Groß and Bjorn Schuller. 2013. Recent developments in openSMILE, the Munich open-source multimedia feature extractor. In Proceedings of ACM MM, Barcelona, Spain, 835--838.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. ?https://huggingface.co/docs/transformers/model_doc/wav2vec2Google ScholarGoogle Scholar
  24. Leonardo Pepino, Pablo Riera, and Luciana Ferrer. 2021. Emotion recognition from speech using wav2vec 2.0 embeddings. arXiv:2104.03502. Retrieved from https://arxiv.org/pdf/2104.03502Google ScholarGoogle Scholar
  25. Sri Harsha Dumpala, Sebastian Rodriguez, Sheri Rempel, Mehri Sajjadian, Rudolf Uher, and Sageev Oore. 2022. Detecting Depression with a Temporal Context Of Speaker Embeddings [J]. Proc. AAAI SAS, 2022.Google ScholarGoogle Scholar
  26. Mu Yang, Kevin Hirschi, Stephen D. Looney, Okim Kang, and John H. L. Hansen. 2022. Improving Mispronunciation Detection with Wav2vec2-based Momentum Pseudo-Labeling for Accentedness and Intelligibility Assessment. arXiv:2203.15937. Retrieved from https://arxiv.org/pdf/2203.15937Google ScholarGoogle Scholar
  27. Sebastian P. Bayerl, Dominik Wagner, Elmar Nöth, and Korbinian Riedhammer. 2022. Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0. arXiv:2204.03417. Retrieved from https://arxiv.org/pdf/2204.03417Google ScholarGoogle Scholar
  28. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, and Jake Vanderplas. 2011. Scikit-learn: Machine learning in Python", Journal of machine learning research, 2825--2830.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Li-Wei Chen and Alexander Rudnicky. 2021. Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition. arXiv:2110.06309. Retrieved from https://arxiv.org/pdf/2110.06309Google ScholarGoogle Scholar
  30. Nik Vaessen and David A. Van Leeuwen. 2022. Fine-tuning wav2vec2 for speaker recognition. In Proceedings of the IEEE-ICASSP International Conference on Acoustics, Speech and Signal Processing, 7967--7971.Google ScholarGoogle Scholar

Index Terms

  1. Audio Features from the Wav2Vec 2.0 Embeddings for the ACM Multimedia 2022 Stuttering Challenge

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MM '22: Proceedings of the 30th ACM International Conference on Multimedia
        October 2022
        7537 pages
        ISBN:9781450392037
        DOI:10.1145/3503161

        Copyright © 2022 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 October 2022

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate995of4,171submissions,24%

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader