research-article

Audio Features from the Wav2Vec 2.0 Embeddings for the ACM Multimedia 2022 Stuttering Challenge

Authors:
Claude Montacié

Sorbonne University, Paris, France

Sorbonne University, Paris, France
View Profile

,
Marie-José Caraty

Paris University, Paris, France

Paris University, Paris, France
View Profile

,
Nikola Lackovic

Malakof Humanis, Paris, France

Malakof Humanis, Paris, France
View Profile

MM '22: Proceedings of the 30th ACM International Conference on MultimediaOctober 2022Pages 7195–7199https://doi.org/10.1145/3503161.3551606

Published:10 October 2022Publication History

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 7195–7199

ABSTRACT

The ACM Multimedia 2022 Stuttering Challenge is to determine the stuttering-related class of a speech segment. There are seven stuttering-related classes and an eighth garbage class. For this purpose, we have investigated the Wav2Vec 2.0 deep neural network to extract audio features from Transformer embeddings. Experiments were conducted on a part of the Kassel State of Fluency Corpus (KSoF). First, we introduced 21 functionals allowing to define two large composite audio feature: the first one set called W2V2 Basic audio-feature set (193,536 features) from the Base version of Transformer embeddings and the second one called W2V2 Large audio-feature set (516, 096 features) from the Large version of Transformer embeddings. Some functionals aim at estimating the spatial variability (e.g., mean, standard deviation, quartiles) and others aim at the temporal variability (e.g., linear regression slope). Then, each composite audio feature set have been splitted into a set of audio feature sets by grouping audio features by functional and by layer. Then, the most discriminant audio feature sets have been selected from these audio feature sets. Finally, two audio features sets specializing in stuttering speech have been developed and assessed: W2V2 Advanced audio-feature set (9,984 features) and W2V2 Large Advanced audio-feature set (15,360 features). Experiments have shown an improvement of 9.3% to the first one and 11.9% for the second one on the Test set compared to the official baseline of the Challenge (40.4%).

Supplemental Material

mmgc60-montacie.mp4

mp4

96.1 MB

Download

References

Gerald A. Maguire, Christopher Y. Yeh, and Brandon S. Ito. 2012. Overview of the diagnosis and treatment of stuttering. Journal of Experimental & Clinical Medicine, 2012, Vol. 4, no 2, 92--97.Google Scholar
American Psychiatric Association. 2013. Diagnostic and Statistical Manual of Mental Disorders (5th ed.), DSM-V. VA, Arlington.Google Scholar
Sakshi Gupta, Ravi S. Shukla, and Rajesh K. Shukla. 2019. Literature survey and review of techniques used for automatic assessment of Stuttered Speech. International Journal of Management, Technology and Engineering, Vol. 9, 229--240.Google Scholar
Shakeel A. Sheikh, Md Sahidullah, Fabrice Hirsch, and Slim Ouni. 2021. Machine learning for stuttering identification: Review, challenges & future directions. arXiv:2107.04057. Retrieved from https://arxiv.org/pdf/2107.04057Google Scholar
Liam Barrett, Junchao Hu, and Peter Howell. 2022. Systematic review of machine learning approaches for detecting developmental stuttering. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 30, 1160--1172.Google Scholar
Tian-Swee Tan, Helbin-Liboh, A. K. Ariff, Chee-Ming Ting, and Sh-Hussain Salleh. 2007. Application of Malay speech technology in Malay speech therapy assistance tools. In Proceedings of IEEE-ICIAS International Conference on Intelligent and Advanced Systems, 330--334.Google ScholarCross Ref
Pravin B. Ramteke, Shashidhar G. Koolagudi, and Fathima AFROZ. 2016. Repetition detection in stuttered speech. In Proceedings of 3rd international conference on advanced computing, networking and informatics. Springer, New Delhi, 611--617.Google ScholarCross Ref
Izabela Świetlicka, WiesŁawa Kuniszyk-Jozkowiak, and Elżbieta Smolka, 2013. Hierarchical ANN system for stuttering identification. Computer Speech & Language, 2013, vol. 27, no 1, 228--242.Google ScholarDigital Library
G. Manjula, M. Shivakumar, and Yelimeli V. Geetha. 2019. Adaptive optimization based neural network for classification of stuttered speech. In Proceedings of the 3th international Conference on Cryptography, Security and Privacy, 93--98.Google Scholar
Tedd Kourkounakis, Amirhossein Hajavi, and Ali Etemad. 2020. Detecting multiple speech disfluencies using a deep residual network with bidirectional long short-term memory. In Proceedings of the IEEE-ICASSP International Conference on Acoustics, Speech and Signal Processing, 6089--6093.Google ScholarCross Ref
Mélanie Jouaiti and Kerstin Dautenhahn. 2022. Dysfluency classification in stuttered speech using deep learning for real-time applications. In Proceedings of the IEEE-ICASSP International Conference on Acoustics, Speech and Signal Processing, 6482--6486Google ScholarCross Ref
Sadeen Alharbi, Madina Hasan, Anthony J.H. Simons, S. Brumfitt, and P. Green 2020. Sequence labeling to detect stuttering events in read speech. Computer Speech & Language, Vol. 62, 101052.Google ScholarDigital Library
Rachid Riad, Anne-Catherine Bachoud-Lévi, Frank Rudzicz, Emmanuel Dupoux. 2020. Identification of primary and collateral tracks in stuttered speech. arXiv:2003.01018. Retrieved from https://arxiv.org/pdf/2003.01018Google Scholar
Shakeel A. Sheikh, Md Sahidullah, Fabrice Hirsch, and Slim Ouni. 2021. Stutternet: Stuttering detection using time delay neural network. In Proceedings of the 29th IEEE-EUSIPCO European Signal Processing Conference, 426--430.Google ScholarCross Ref
Shakeel A. Sheikh, Md Sahidullah, Fabrice Hirsch, and Slim Ouni. 2022. Introducing ECAPA-TDNN and Wav2Vec2.0 Embeddings to Stuttering Detection. arXiv:2204.01564. Retrieved from https://arxiv.org/pdf/2204.01564Google Scholar
Peter Howell, Stephen Davis, and John Bartrip. 2009. The UCLASS archive of stuttered speech. Journal of speech, language, and hearing research, Vol. 52, 2, 556--569.Google ScholarCross Ref
Nan Bernstein Ratner and Brian Macwhinney. 2018. Fluency Bank: A new resource for fluency research and practice. Journal of fluency disorders, vol. 56, 69--80.Google ScholarCross Ref
Lea Colin, Mitra Vikramjit, Joshi Aparna, Kajarekar Sachin, and Jeffrey P. Bigham. 2021. Sep-28k: A dataset for stuttering event detection from podcasts with people who stutter. In Proceedings of the IEEE-ICASSP International Conference on Acoustics, Speech and Signal Processing, 6798--6802.Google Scholar
Sebastian P. Bayerl, Alexander Wolff von Gudenberg, Florian Hönig, Elmar Nöth, and Korbinian Riedhammer. 2022. KSoF: The Kassel State of Fluency Dataset - A Therapy Centered Dataset of Stuttering. arXiv:2203.05383. Retrieved from https://arxiv.org/pdf/2203.05383Google Scholar
Björn W. Schuller, Anton Batliner, Shahin Amiriparian, Christian Bergler, Maurice Gerczuk, Natalie Holz, Pauline Larrouy-Maestri, Sebastian P. Bayerl, Korbinian Riedhammer, Adria Mallol-Ragolta, Maria Pateraki, Harry Coppock, Ivan Kiskin, Marianne Sinka, and Stephen Roberts. 2022. The ACM Multimedia 2022 Computational Paralinguistics Challenge: Vocalisations, Stuttering, Activity, & Mosquitoes. In Proceedings of the ACM Multimedia (MM '22), October, 2022, Lisboa, Portugal. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3475957.348445Google ScholarDigital Library
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, Vol. 33, 12449--12460.Google Scholar
Florian Eyben, Félix Weninger, Florian Groß and Bjorn Schuller. 2013. Recent developments in openSMILE, the Munich open-source multimedia feature extractor. In Proceedings of ACM MM, Barcelona, Spain, 835--838.Google ScholarDigital Library
?https://huggingface.co/docs/transformers/model_doc/wav2vec2Google Scholar
Leonardo Pepino, Pablo Riera, and Luciana Ferrer. 2021. Emotion recognition from speech using wav2vec 2.0 embeddings. arXiv:2104.03502. Retrieved from https://arxiv.org/pdf/2104.03502Google Scholar
Sri Harsha Dumpala, Sebastian Rodriguez, Sheri Rempel, Mehri Sajjadian, Rudolf Uher, and Sageev Oore. 2022. Detecting Depression with a Temporal Context Of Speaker Embeddings [J]. Proc. AAAI SAS, 2022.Google Scholar
Mu Yang, Kevin Hirschi, Stephen D. Looney, Okim Kang, and John H. L. Hansen. 2022. Improving Mispronunciation Detection with Wav2vec2-based Momentum Pseudo-Labeling for Accentedness and Intelligibility Assessment. arXiv:2203.15937. Retrieved from https://arxiv.org/pdf/2203.15937Google Scholar
Sebastian P. Bayerl, Dominik Wagner, Elmar Nöth, and Korbinian Riedhammer. 2022. Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0. arXiv:2204.03417. Retrieved from https://arxiv.org/pdf/2204.03417Google Scholar
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, and Jake Vanderplas. 2011. Scikit-learn: Machine learning in Python", Journal of machine learning research, 2825--2830.Google ScholarDigital Library
Li-Wei Chen and Alexander Rudnicky. 2021. Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition. arXiv:2110.06309. Retrieved from https://arxiv.org/pdf/2110.06309Google Scholar
Nik Vaessen and David A. Van Leeuwen. 2022. Fine-tuning wav2vec2 for speaker recognition. In Proceedings of the IEEE-ICASSP International Conference on Acoustics, Speech and Signal Processing, 7967--7971.Google Scholar

Index Terms

Audio Features from the Wav2Vec 2.0 Embeddings for the ACM Multimedia 2022 Stuttering Challenge
1. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms
      1. Feature selection
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval

Recommendations

The ACM Multimedia 2022 Computational Paralinguistics Challenge: Vocalisations, Stuttering, Activity, & Mosquitoes
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

The ACM Multimedia 2022 Computational Paralinguistics Challenge addresses four different problems for the first time in a research competition under well-defined conditions: In the Vocalisations and Stuttering Sub-Challenges, a classification on human ...
Read More
Recent developments in openSMILE, the munich open-source multimedia feature extractor
MM '13: Proceedings of the 21st ACM international conference on Multimedia

We present recent developments in the openSMILE feature extraction toolkit. Version 2.0 now unites feature extraction paradigms from speech, music, and general sound events with basic video features for multi-modal processing. Descriptors from audio and ...
Read More
Combination of audio and lyrics features for genre classification in digital audio collections
MM '08: Proceedings of the 16th ACM international conference on Multimedia

In many areas multimedia technology has made its way into mainstream. In the case of digital audio this is manifested in numerous online music stores having turned into profitable businesses. The widespread user adaption of digital audio both on home ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 October 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
audio features
benchmark
challenge
computational paralinguistics
functionals
stuttering recognition
transformer embeddings
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 195
  Total Downloads
- Downloads (Last 12 months)84
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Audio Features from the Wav2Vec 2.0 Embeddings for the ACM Multimedia 2022 Stuttering Challenge

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

The ACM Multimedia 2022 Computational Paralinguistics Challenge: Vocalisations, Stuttering, Activity, & Mosquitoes

Recent developments in openSMILE, the munich open-source multimedia feature extractor

Combination of audio and lyrics features for genre classification in digital audio collections