ABSTRACT
The ACM Multimedia 2022 Stuttering Challenge is to determine the stuttering-related class of a speech segment. There are seven stuttering-related classes and an eighth garbage class. For this purpose, we have investigated the Wav2Vec 2.0 deep neural network to extract audio features from Transformer embeddings. Experiments were conducted on a part of the Kassel State of Fluency Corpus (KSoF). First, we introduced 21 functionals allowing to define two large composite audio feature: the first one set called W2V2 Basic audio-feature set (193,536 features) from the Base version of Transformer embeddings and the second one called W2V2 Large audio-feature set (516, 096 features) from the Large version of Transformer embeddings. Some functionals aim at estimating the spatial variability (e.g., mean, standard deviation, quartiles) and others aim at the temporal variability (e.g., linear regression slope). Then, each composite audio feature set have been splitted into a set of audio feature sets by grouping audio features by functional and by layer. Then, the most discriminant audio feature sets have been selected from these audio feature sets. Finally, two audio features sets specializing in stuttering speech have been developed and assessed: W2V2 Advanced audio-feature set (9,984 features) and W2V2 Large Advanced audio-feature set (15,360 features). Experiments have shown an improvement of 9.3% to the first one and 11.9% for the second one on the Test set compared to the official baseline of the Challenge (40.4%).
Supplemental Material
- Gerald A. Maguire, Christopher Y. Yeh, and Brandon S. Ito. 2012. Overview of the diagnosis and treatment of stuttering. Journal of Experimental & Clinical Medicine, 2012, Vol. 4, no 2, 92--97.Google Scholar
- American Psychiatric Association. 2013. Diagnostic and Statistical Manual of Mental Disorders (5th ed.), DSM-V. VA, Arlington.Google Scholar
- Sakshi Gupta, Ravi S. Shukla, and Rajesh K. Shukla. 2019. Literature survey and review of techniques used for automatic assessment of Stuttered Speech. International Journal of Management, Technology and Engineering, Vol. 9, 229--240.Google Scholar
- Shakeel A. Sheikh, Md Sahidullah, Fabrice Hirsch, and Slim Ouni. 2021. Machine learning for stuttering identification: Review, challenges & future directions. arXiv:2107.04057. Retrieved from https://arxiv.org/pdf/2107.04057Google Scholar
- Liam Barrett, Junchao Hu, and Peter Howell. 2022. Systematic review of machine learning approaches for detecting developmental stuttering. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 30, 1160--1172.Google Scholar
- Tian-Swee Tan, Helbin-Liboh, A. K. Ariff, Chee-Ming Ting, and Sh-Hussain Salleh. 2007. Application of Malay speech technology in Malay speech therapy assistance tools. In Proceedings of IEEE-ICIAS International Conference on Intelligent and Advanced Systems, 330--334.Google ScholarCross Ref
- Pravin B. Ramteke, Shashidhar G. Koolagudi, and Fathima AFROZ. 2016. Repetition detection in stuttered speech. In Proceedings of 3rd international conference on advanced computing, networking and informatics. Springer, New Delhi, 611--617.Google ScholarCross Ref
- Izabela Świetlicka, WiesŁawa Kuniszyk-Jozkowiak, and Elżbieta Smolka, 2013. Hierarchical ANN system for stuttering identification. Computer Speech & Language, 2013, vol. 27, no 1, 228--242.Google ScholarDigital Library
- G. Manjula, M. Shivakumar, and Yelimeli V. Geetha. 2019. Adaptive optimization based neural network for classification of stuttered speech. In Proceedings of the 3th international Conference on Cryptography, Security and Privacy, 93--98.Google Scholar
- Tedd Kourkounakis, Amirhossein Hajavi, and Ali Etemad. 2020. Detecting multiple speech disfluencies using a deep residual network with bidirectional long short-term memory. In Proceedings of the IEEE-ICASSP International Conference on Acoustics, Speech and Signal Processing, 6089--6093.Google ScholarCross Ref
- Mélanie Jouaiti and Kerstin Dautenhahn. 2022. Dysfluency classification in stuttered speech using deep learning for real-time applications. In Proceedings of the IEEE-ICASSP International Conference on Acoustics, Speech and Signal Processing, 6482--6486Google ScholarCross Ref
- Sadeen Alharbi, Madina Hasan, Anthony J.H. Simons, S. Brumfitt, and P. Green 2020. Sequence labeling to detect stuttering events in read speech. Computer Speech & Language, Vol. 62, 101052.Google ScholarDigital Library
- Rachid Riad, Anne-Catherine Bachoud-Lévi, Frank Rudzicz, Emmanuel Dupoux. 2020. Identification of primary and collateral tracks in stuttered speech. arXiv:2003.01018. Retrieved from https://arxiv.org/pdf/2003.01018Google Scholar
- Shakeel A. Sheikh, Md Sahidullah, Fabrice Hirsch, and Slim Ouni. 2021. Stutternet: Stuttering detection using time delay neural network. In Proceedings of the 29th IEEE-EUSIPCO European Signal Processing Conference, 426--430.Google ScholarCross Ref
- Shakeel A. Sheikh, Md Sahidullah, Fabrice Hirsch, and Slim Ouni. 2022. Introducing ECAPA-TDNN and Wav2Vec2.0 Embeddings to Stuttering Detection. arXiv:2204.01564. Retrieved from https://arxiv.org/pdf/2204.01564Google Scholar
- Peter Howell, Stephen Davis, and John Bartrip. 2009. The UCLASS archive of stuttered speech. Journal of speech, language, and hearing research, Vol. 52, 2, 556--569.Google ScholarCross Ref
- Nan Bernstein Ratner and Brian Macwhinney. 2018. Fluency Bank: A new resource for fluency research and practice. Journal of fluency disorders, vol. 56, 69--80.Google ScholarCross Ref
- Lea Colin, Mitra Vikramjit, Joshi Aparna, Kajarekar Sachin, and Jeffrey P. Bigham. 2021. Sep-28k: A dataset for stuttering event detection from podcasts with people who stutter. In Proceedings of the IEEE-ICASSP International Conference on Acoustics, Speech and Signal Processing, 6798--6802.Google Scholar
- Sebastian P. Bayerl, Alexander Wolff von Gudenberg, Florian Hönig, Elmar Nöth, and Korbinian Riedhammer. 2022. KSoF: The Kassel State of Fluency Dataset - A Therapy Centered Dataset of Stuttering. arXiv:2203.05383. Retrieved from https://arxiv.org/pdf/2203.05383Google Scholar
- Björn W. Schuller, Anton Batliner, Shahin Amiriparian, Christian Bergler, Maurice Gerczuk, Natalie Holz, Pauline Larrouy-Maestri, Sebastian P. Bayerl, Korbinian Riedhammer, Adria Mallol-Ragolta, Maria Pateraki, Harry Coppock, Ivan Kiskin, Marianne Sinka, and Stephen Roberts. 2022. The ACM Multimedia 2022 Computational Paralinguistics Challenge: Vocalisations, Stuttering, Activity, & Mosquitoes. In Proceedings of the ACM Multimedia (MM '22), October, 2022, Lisboa, Portugal. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3475957.348445Google ScholarDigital Library
- Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, Vol. 33, 12449--12460.Google Scholar
- Florian Eyben, Félix Weninger, Florian Groß and Bjorn Schuller. 2013. Recent developments in openSMILE, the Munich open-source multimedia feature extractor. In Proceedings of ACM MM, Barcelona, Spain, 835--838.Google ScholarDigital Library
- ?https://huggingface.co/docs/transformers/model_doc/wav2vec2Google Scholar
- Leonardo Pepino, Pablo Riera, and Luciana Ferrer. 2021. Emotion recognition from speech using wav2vec 2.0 embeddings. arXiv:2104.03502. Retrieved from https://arxiv.org/pdf/2104.03502Google Scholar
- Sri Harsha Dumpala, Sebastian Rodriguez, Sheri Rempel, Mehri Sajjadian, Rudolf Uher, and Sageev Oore. 2022. Detecting Depression with a Temporal Context Of Speaker Embeddings [J]. Proc. AAAI SAS, 2022.Google Scholar
- Mu Yang, Kevin Hirschi, Stephen D. Looney, Okim Kang, and John H. L. Hansen. 2022. Improving Mispronunciation Detection with Wav2vec2-based Momentum Pseudo-Labeling for Accentedness and Intelligibility Assessment. arXiv:2203.15937. Retrieved from https://arxiv.org/pdf/2203.15937Google Scholar
- Sebastian P. Bayerl, Dominik Wagner, Elmar Nöth, and Korbinian Riedhammer. 2022. Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0. arXiv:2204.03417. Retrieved from https://arxiv.org/pdf/2204.03417Google Scholar
- Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, and Jake Vanderplas. 2011. Scikit-learn: Machine learning in Python", Journal of machine learning research, 2825--2830.Google ScholarDigital Library
- Li-Wei Chen and Alexander Rudnicky. 2021. Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition. arXiv:2110.06309. Retrieved from https://arxiv.org/pdf/2110.06309Google Scholar
- Nik Vaessen and David A. Van Leeuwen. 2022. Fine-tuning wav2vec2 for speaker recognition. In Proceedings of the IEEE-ICASSP International Conference on Acoustics, Speech and Signal Processing, 7967--7971.Google Scholar
Index Terms
- Audio Features from the Wav2Vec 2.0 Embeddings for the ACM Multimedia 2022 Stuttering Challenge
Recommendations
The ACM Multimedia 2022 Computational Paralinguistics Challenge: Vocalisations, Stuttering, Activity, & Mosquitoes
MM '22: Proceedings of the 30th ACM International Conference on MultimediaThe ACM Multimedia 2022 Computational Paralinguistics Challenge addresses four different problems for the first time in a research competition under well-defined conditions: In the Vocalisations and Stuttering Sub-Challenges, a classification on human ...
Recent developments in openSMILE, the munich open-source multimedia feature extractor
MM '13: Proceedings of the 21st ACM international conference on MultimediaWe present recent developments in the openSMILE feature extraction toolkit. Version 2.0 now unites feature extraction paradigms from speech, music, and general sound events with basic video features for multi-modal processing. Descriptors from audio and ...
Combination of audio and lyrics features for genre classification in digital audio collections
MM '08: Proceedings of the 16th ACM international conference on MultimediaIn many areas multimedia technology has made its way into mainstream. In the case of digital audio this is manifested in numerous online music stores having turned into profitable businesses. The widespread user adaption of digital audio both on home ...
Comments