Explainable Stuttering Recognition Using Axial Attention

Ma, Yu; Huang, Yuting; Yuan, Kaixiang; Xuan, Guangzhe; Yu, Yongzi; Zhong, Hengrui; Li, Rui; Shen, Jian; Qian, Kun; Hu, Bin; Schuller, Björn W.; Yamamoto, Yoshiharu

doi:10.1007/978-981-99-4749-2_18

Yu Ma¹³,
Yuting Huang¹³,
Kaixiang Yuan¹³,
Guangzhe Xuan¹³,
Yongzi Yu¹³,
Hengrui Zhong¹³,
Rui Li¹⁴,
Jian Shen¹³,
Kun Qian¹³,
Bin Hu¹³,
Björn W. Schuller^15,16 &
…
Yoshiharu Yamamoto¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14088))

Included in the following conference series:

International Conference on Intelligent Computing

802 Accesses

Abstract

Stuttering is a complex speech disorder that disrupts the flow of speech, and recognizing persons who stutter (PWS) and understanding their significant struggles is crucial. With advancements in computer vision, deep neural networks offer potential for recognizing stuttering events through image-based features. In this paper, we extract image features of Wavelet Transformation (WT) and Histograms of Oriented Gradient (HOG) from audio signals. We also generate explainable images using Gradient-weighted Class Activation Mapping (Grad-CAM) as input for our final recognition model–an axial attention-based EfficientNetV2, which is trained on the Kassel State of Fluency Dataset (KSoF) to perform 8 classes recognition. Our experimental results achieved a relative percentage increase in unweighted average recall (UAR) of 4.4% compared to the baseline of ComParE 2022, demonstrating that the axial attention-based EfficientNetV2, combined with the explainable input, has the capability to detect and recognise multiple types of stuttering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hu, B., Shen, J., Zhu, L., Dong, Q., Cai, H., Qian, K.: Fundamentals of computational psychophysiology: theory and methodology. IEEE Trans. Comput. Soc. Syst. 9(2), 349–355 (2022)
Article Google Scholar
Shen, J., Zhang, X., Hu, B., Wang, G., Ding, Z., Hu, B.: An improved empirical mode decomposition of electroencephalogram signals for depression detection. IEEE Trans. Affect. Comput. 13(1), 262–271 (2022)
Article Google Scholar
Zhang, X., Shen, J., ud Din, Z., Liu, J., Wang, G., Hu, B.: Multimodal depression detection: fusion of electroencephalography and paralinguistic behaviors using a novel strategy for classifier ensemble. IEEE J. Biomed. Health Inform. 23(6), 2265–2275 (2019)
Google Scholar
Banerjee, N., Borah, S., Sethi, N.: Intelligent stuttering speech recognition: a succinct review. Multimed. Tools Appl. 81, 1–22 (2022)
Article Google Scholar
Lickley, R.: Disfluency in typical and stuttered speech. Fattori Sociali E Biologici Nella Variazione Fonetica-Social and Biological Factors in Speech Variation (2017)
Google Scholar
Junuzovic-Zunic, L., Sinanovic, O., Majic, B.: Neurogenic stuttering: etiology, symptomatology, and treatment. Med. Arch. 75(6), 456 (2021)
Article Google Scholar
Catalano, G., Robben, D.L., Catalano, M.C., Kahn, D.A.: Olanzapine for the treatment of acquired neurogenic stuttering. J. Psychiatr. Pract.® 15(6), 484–488 (2009)
Google Scholar
Oue, S., Marxer, R., Rudzicz, F.: Automatic dysfluency detection in dysarthric speech using deep belief networks. In: Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies, pp. 60–64 (2015)
Google Scholar
Sheikh, S.A., Sahidullah, M., Hirsch, F., Ouni, S.: StutterNet: stuttering detection using time delay neural network. In: 29th European Signal Processing Conference (EUSIPCO), pp. 426–430 (2021)
Google Scholar
Qian, K., et al.: A bag of wavelet features for snore sound classification. Ann. Biomed. Eng. 47(4), 1000–1011 (2019)
Article Google Scholar
Qian, K., Zhang, Z., Yamamoto, Y., Schuller, B.W.: Artificial intelligence Internet of Things for the elderly: from assisted living to health-care monitoring. IEEE Signal Process. Mag. 38(4), 78–88 (2021)
Article Google Scholar
Qian, K., et al.: Computer audition for healthcare: opportunities and challenges. Front. Digit. Health 2, 5 (2020)
Article Google Scholar
Shen, J., Zhao, S., Yao, Y., Wang, Y., Feng, L.: A novel depression detection method based on pervasive EEG and EEG splitting criterion. In: 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1879–1886. IEEE (2017)
Google Scholar
Shen, J., et al.: An optimal channel selection for EEG-based depression detection via kernel-target alignment. IEEE J. Biomed. Health Inform. 25(7), 2545–2556 (2020)
Article MathSciNet Google Scholar
Yang, M., Ma, Y., Liu, Z., Cai, H., Hu, X., Hu, B.: Undisturbed mental state assessment in the 5G era: a case study of depression detection based on facial expressions. IEEE Wirel. Commun. 28(3), 46–53 (2021)
Article Google Scholar
Zhang, K., et al.: Research on mine vehicle tracking and detection technology based on YOLOv5. Syst. Sci. Control Eng. 10(1), 347–366 (2022)
Article MathSciNet Google Scholar
Shen, J., et al.: Exploring the intrinsic features of EEG signals via empirical mode decomposition for depression recognition. IEEE Trans. Neural Syst. Rehabil. Eng. 31, 356–365 (2022)
Article Google Scholar
Shen, J., et al.: Depression recognition from EEG signals using an adaptive channel fusion method via improved focal loss. IEEE J. Biomed. Health Inform. 27, 3234–3245 (2023)
Article Google Scholar
Rosenberg, J., et al.: Conflict processing networks: a directional analysis of stimulus-response compatibilities using MEG. PLoS ONE 16(2), e0247408 (2021)
Article MathSciNet Google Scholar
Dong, Q., et al.: Integrating convolutional neural networks and multi-task dictionary learning for cognitive decline prediction with longitudinal images. J. Alzheimer’s Dis. 75(3), 971–992 (2020)
Article Google Scholar
Wu, Y., et al.: Person reidentification by multiscale feature representation learning with random batch feature mask. IEEE Trans. Cogn. Dev. Syst. 13(4), 865–874 (2020)
Article Google Scholar
Demir, F., Sengur, A., Cummins, N., Amiriparian, S., Schuller, B.W.: Low level texture features for snore sound discrimination. In: 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 413–416 (2018)
Google Scholar
Barrett, L., Hu, J., Howell, P.: Systematic review of machine learning approaches for detecting developmental stuttering. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 1160–1172 (2022)
Article Google Scholar
Howell, P., Sackin, S.: Automatic recognition of repetitions and prolongations in stuttered speech. In: Proceedings of the First World Congress on Fluency Disorders, vol. 2, pp. 372–374. University Press Nijmegen Nijmegen, The Netherlands (1995)
Google Scholar
Gupta, S., Shukla, R.S., Shukla, R.K., Verma, R.: Deep learning bidirectional LSTM based detection of prolongation and repetition in stuttered speech using weighted MFCC. Int. J. Adv. Comput. Sci. Appl. 11(9), 1–12 (2020)
Google Scholar
Świetlicka, I., Kuniszyk-Jóźkowiak, W., Smołka, E.: Artificial neural networks in the disabled speech analysis. Comput. Recogn. Syst. 3, 347–354 (2009)
Google Scholar
Ravikumar, K.M., Rajagopal, R., Nagaraj, H.: An approach for objective assessment of stuttered speech using MFCC features. ICGST Int. J. Digit. Signal Process. 9(1), 19–24 (2009)
Google Scholar
Chee, L.S., Ai, O.C., Hariharan, M., Yaacob, S.: MFCC based recognition of repetitions and prolongations in stuttered speech using k-NN and LDA. In: 2009 IEEE Student Conference on Research and Development (SCOReD), pp. 146–149. IEEE (2009)
Google Scholar
Ai, O.C., Hariharan, M., Yaacob, S., Chee, L.S.: Classification of speech dysfluencies with MFCC and LPCC features. Expert Syst. Appl. 39(2), 2157–2165 (2012)
Article Google Scholar
Mahesha, P., Vinod, D.: Support vector machine-based stuttering dysfluency classification using gmm supervectors. Int. J. Grid Util. Comput. 6(3–4), 143–149 (2015)
Article Google Scholar
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 618–626 (2017)
Google Scholar
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.: MobilenetV2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510–4520 (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Xu, H., Ma, J., Jiang, J., Guo, X., Ling, H.: U2Fusion: a unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 502–518 (2020)
Article Google Scholar
Tan, M., Le, Q.: EfficientnetV2: smaller models and faster training. In: International Conference on Machine Learning (ICML), pp. 10096–10106 (2021)
Google Scholar
Ho, J., Kalchbrenner, N., Weissenborn, D., Salimans, T.: Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180 (2019)
Bayerl, S.P., von Gudenberg, A.W., Hönig, F., Nöth, E., Riedhammer, K.: KSoF: the Kassel state of fluency dataset–a therapy centered dataset of stuttering. arXiv preprint arXiv:2203.05383 (2022)
Schuller, B.W., et al.: The ACM Multimedia 2022 Computational Paralinguistics Challenge: Vocalisations, Stuttering, Activity, & Mosquitoes, pp. 1–5. arXiv Preprint arXiv:2205.06799 (2022)
McFee, B., et al.: librosa: audio and music signal analysis in Python. In: Proceedings of the 14th Python in Science Conference, vol. 8, pp. 18–25 (2015)
Google Scholar
Hunter, J.D.: Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9(03), 90–95 (2007)
Article Google Scholar

Download references

Acknowledgements

This work partially supported by the National Key Research and Development Program of China (Grant No. 2019YFA0706200), the Project funded by China Postdoctoral Science Foundation (Grant No. 2021M700423), the Ministry of Science and Technology of the People’s Republic of China (No. 2021ZD0201900, 2021ZD0200601), the National Natural Science Foundation of China (No. 62227807, 62272044, 62072219), the National High-Level Young Talent Project, the BIT Teli Young Fellow Program from the Beijing Institute of Technology, China, the Natural Science Foundation of Gansu Province, China (No. 22JR5RA401), the Fundamental Research Funds for the Central Universities (No. lzujbky-2022-ey13), the JSPS KAKENHI (No. 20H00569), the JST Mirai Program (No. 21473074), and the JST MOONSHOT Program (No. JPMJMS229B), Japan.

Author information

Authors and Affiliations

School of Medical Technology, Beijing Institute of Technology, Beijing, China
Yu Ma, Yuting Huang, Kaixiang Yuan, Guangzhe Xuan, Yongzi Yu, Hengrui Zhong, Jian Shen, Kun Qian & Bin Hu
School of Information Science and Engineering, Lanzhou University, Lanzhou, China
Rui Li
GLAM – Group on Language, Audio, and Music, Imperial College London, London, UK
Björn W. Schuller
Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany
Björn W. Schuller
Educational Physiology Laboratory, Graduate School of Education, University of Tokyo, Tokyo, Japan
Yoshiharu Yamamoto

Authors

Yu Ma
View author publications
You can also search for this author in PubMed Google Scholar
Yuting Huang
View author publications
You can also search for this author in PubMed Google Scholar
Kaixiang Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Guangzhe Xuan
View author publications
You can also search for this author in PubMed Google Scholar
Yongzi Yu
View author publications
You can also search for this author in PubMed Google Scholar
Hengrui Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Rui Li
View author publications
You can also search for this author in PubMed Google Scholar
Jian Shen
View author publications
You can also search for this author in PubMed Google Scholar
Kun Qian
View author publications
You can also search for this author in PubMed Google Scholar
Bin Hu
View author publications
You can also search for this author in PubMed Google Scholar
Björn W. Schuller
View author publications
You can also search for this author in PubMed Google Scholar
Yoshiharu Yamamoto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jian Shen , Kun Qian or Bin Hu .

Editor information

Editors and Affiliations

Department of Computer Science, Eastern Institute of Technology, Zhejiang, China
De-Shuang Huang
University of Wollongong, North Wollongong, NSW, Australia
Prashan Premaratne
Zhengzhou University of Light Industry, Zhengzhou, China
Baohua Jin
Zhong Yuan University of Technology, Zhengzhou, China
Boyang Qu
University of Ulsan, Ulsan, Korea (Republic of)
Kang-Hyun Jo
Department of Computer Science, Liverpool John Moores University, Liverpool, UK
Abir Hussain

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ma, Y. et al. (2023). Explainable Stuttering Recognition Using Axial Attention. In: Huang, DS., Premaratne, P., Jin, B., Qu, B., Jo, KH., Hussain, A. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2023. Lecture Notes in Computer Science, vol 14088. Springer, Singapore. https://doi.org/10.1007/978-981-99-4749-2_18

Download citation

DOI: https://doi.org/10.1007/978-981-99-4749-2_18
Published: 30 July 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-4748-5
Online ISBN: 978-981-99-4749-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics