Skip to main content

Explainable Stuttering Recognition Using Axial Attention

  • Conference paper
  • First Online:
Advanced Intelligent Computing Technology and Applications (ICIC 2023)

Abstract

Stuttering is a complex speech disorder that disrupts the flow of speech, and recognizing persons who stutter (PWS) and understanding their significant struggles is crucial. With advancements in computer vision, deep neural networks offer potential for recognizing stuttering events through image-based features. In this paper, we extract image features of Wavelet Transformation (WT) and Histograms of Oriented Gradient (HOG) from audio signals. We also generate explainable images using Gradient-weighted Class Activation Mapping (Grad-CAM) as input for our final recognition model–an axial attention-based EfficientNetV2, which is trained on the Kassel State of Fluency Dataset (KSoF) to perform 8 classes recognition. Our experimental results achieved a relative percentage increase in unweighted average recall (UAR) of 4.4% compared to the baseline of ComParE 2022, demonstrating that the axial attention-based EfficientNetV2, combined with the explainable input, has the capability to detect and recognise multiple types of stuttering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Hu, B., Shen, J., Zhu, L., Dong, Q., Cai, H., Qian, K.: Fundamentals of computational psychophysiology: theory and methodology. IEEE Trans. Comput. Soc. Syst. 9(2), 349–355 (2022)

    Article  Google Scholar 

  2. Shen, J., Zhang, X., Hu, B., Wang, G., Ding, Z., Hu, B.: An improved empirical mode decomposition of electroencephalogram signals for depression detection. IEEE Trans. Affect. Comput. 13(1), 262–271 (2022)

    Article  Google Scholar 

  3. Zhang, X., Shen, J., ud Din, Z., Liu, J., Wang, G., Hu, B.: Multimodal depression detection: fusion of electroencephalography and paralinguistic behaviors using a novel strategy for classifier ensemble. IEEE J. Biomed. Health Inform. 23(6), 2265–2275 (2019)

    Google Scholar 

  4. Banerjee, N., Borah, S., Sethi, N.: Intelligent stuttering speech recognition: a succinct review. Multimed. Tools Appl. 81, 1–22 (2022)

    Article  Google Scholar 

  5. Lickley, R.: Disfluency in typical and stuttered speech. Fattori Sociali E Biologici Nella Variazione Fonetica-Social and Biological Factors in Speech Variation (2017)

    Google Scholar 

  6. Junuzovic-Zunic, L., Sinanovic, O., Majic, B.: Neurogenic stuttering: etiology, symptomatology, and treatment. Med. Arch. 75(6), 456 (2021)

    Article  Google Scholar 

  7. Catalano, G., Robben, D.L., Catalano, M.C., Kahn, D.A.: Olanzapine for the treatment of acquired neurogenic stuttering. J. Psychiatr. Pract.® 15(6), 484–488 (2009)

    Google Scholar 

  8. Oue, S., Marxer, R., Rudzicz, F.: Automatic dysfluency detection in dysarthric speech using deep belief networks. In: Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies, pp. 60–64 (2015)

    Google Scholar 

  9. Sheikh, S.A., Sahidullah, M., Hirsch, F., Ouni, S.: StutterNet: stuttering detection using time delay neural network. In: 29th European Signal Processing Conference (EUSIPCO), pp. 426–430 (2021)

    Google Scholar 

  10. Qian, K., et al.: A bag of wavelet features for snore sound classification. Ann. Biomed. Eng. 47(4), 1000–1011 (2019)

    Article  Google Scholar 

  11. Qian, K., Zhang, Z., Yamamoto, Y., Schuller, B.W.: Artificial intelligence Internet of Things for the elderly: from assisted living to health-care monitoring. IEEE Signal Process. Mag. 38(4), 78–88 (2021)

    Article  Google Scholar 

  12. Qian, K., et al.: Computer audition for healthcare: opportunities and challenges. Front. Digit. Health 2, 5 (2020)

    Article  Google Scholar 

  13. Shen, J., Zhao, S., Yao, Y., Wang, Y., Feng, L.: A novel depression detection method based on pervasive EEG and EEG splitting criterion. In: 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1879–1886. IEEE (2017)

    Google Scholar 

  14. Shen, J., et al.: An optimal channel selection for EEG-based depression detection via kernel-target alignment. IEEE J. Biomed. Health Inform. 25(7), 2545–2556 (2020)

    Article  MathSciNet  Google Scholar 

  15. Yang, M., Ma, Y., Liu, Z., Cai, H., Hu, X., Hu, B.: Undisturbed mental state assessment in the 5G era: a case study of depression detection based on facial expressions. IEEE Wirel. Commun. 28(3), 46–53 (2021)

    Article  Google Scholar 

  16. Zhang, K., et al.: Research on mine vehicle tracking and detection technology based on YOLOv5. Syst. Sci. Control Eng. 10(1), 347–366 (2022)

    Article  MathSciNet  Google Scholar 

  17. Shen, J., et al.: Exploring the intrinsic features of EEG signals via empirical mode decomposition for depression recognition. IEEE Trans. Neural Syst. Rehabil. Eng. 31, 356–365 (2022)

    Article  Google Scholar 

  18. Shen, J., et al.: Depression recognition from EEG signals using an adaptive channel fusion method via improved focal loss. IEEE J. Biomed. Health Inform. 27, 3234–3245 (2023)

    Article  Google Scholar 

  19. Rosenberg, J., et al.: Conflict processing networks: a directional analysis of stimulus-response compatibilities using MEG. PLoS ONE 16(2), e0247408 (2021)

    Article  MathSciNet  Google Scholar 

  20. Dong, Q., et al.: Integrating convolutional neural networks and multi-task dictionary learning for cognitive decline prediction with longitudinal images. J. Alzheimer’s Dis. 75(3), 971–992 (2020)

    Article  Google Scholar 

  21. Wu, Y., et al.: Person reidentification by multiscale feature representation learning with random batch feature mask. IEEE Trans. Cogn. Dev. Syst. 13(4), 865–874 (2020)

    Article  Google Scholar 

  22. Demir, F., Sengur, A., Cummins, N., Amiriparian, S., Schuller, B.W.: Low level texture features for snore sound discrimination. In: 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 413–416 (2018)

    Google Scholar 

  23. Barrett, L., Hu, J., Howell, P.: Systematic review of machine learning approaches for detecting developmental stuttering. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 1160–1172 (2022)

    Article  Google Scholar 

  24. Howell, P., Sackin, S.: Automatic recognition of repetitions and prolongations in stuttered speech. In: Proceedings of the First World Congress on Fluency Disorders, vol. 2, pp. 372–374. University Press Nijmegen Nijmegen, The Netherlands (1995)

    Google Scholar 

  25. Gupta, S., Shukla, R.S., Shukla, R.K., Verma, R.: Deep learning bidirectional LSTM based detection of prolongation and repetition in stuttered speech using weighted MFCC. Int. J. Adv. Comput. Sci. Appl. 11(9), 1–12 (2020)

    Google Scholar 

  26. Świetlicka, I., Kuniszyk-Jóźkowiak, W., Smołka, E.: Artificial neural networks in the disabled speech analysis. Comput. Recogn. Syst. 3, 347–354 (2009)

    Google Scholar 

  27. Ravikumar, K.M., Rajagopal, R., Nagaraj, H.: An approach for objective assessment of stuttered speech using MFCC features. ICGST Int. J. Digit. Signal Process. 9(1), 19–24 (2009)

    Google Scholar 

  28. Chee, L.S., Ai, O.C., Hariharan, M., Yaacob, S.: MFCC based recognition of repetitions and prolongations in stuttered speech using k-NN and LDA. In: 2009 IEEE Student Conference on Research and Development (SCOReD), pp. 146–149. IEEE (2009)

    Google Scholar 

  29. Ai, O.C., Hariharan, M., Yaacob, S., Chee, L.S.: Classification of speech dysfluencies with MFCC and LPCC features. Expert Syst. Appl. 39(2), 2157–2165 (2012)

    Article  Google Scholar 

  30. Mahesha, P., Vinod, D.: Support vector machine-based stuttering dysfluency classification using gmm supervectors. Int. J. Grid Util. Comput. 6(3–4), 143–149 (2015)

    Article  Google Scholar 

  31. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 618–626 (2017)

    Google Scholar 

  32. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.: MobilenetV2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510–4520 (2018)

    Google Scholar 

  33. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)

    Google Scholar 

  34. Xu, H., Ma, J., Jiang, J., Guo, X., Ling, H.: U2Fusion: a unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 502–518 (2020)

    Article  Google Scholar 

  35. Tan, M., Le, Q.: EfficientnetV2: smaller models and faster training. In: International Conference on Machine Learning (ICML), pp. 10096–10106 (2021)

    Google Scholar 

  36. Ho, J., Kalchbrenner, N., Weissenborn, D., Salimans, T.: Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180 (2019)

  37. Bayerl, S.P., von Gudenberg, A.W., Hönig, F., Nöth, E., Riedhammer, K.: KSoF: the Kassel state of fluency dataset–a therapy centered dataset of stuttering. arXiv preprint arXiv:2203.05383 (2022)

  38. Schuller, B.W., et al.: The ACM Multimedia 2022 Computational Paralinguistics Challenge: Vocalisations, Stuttering, Activity, & Mosquitoes, pp. 1–5. arXiv Preprint arXiv:2205.06799 (2022)

  39. McFee, B., et al.: librosa: audio and music signal analysis in Python. In: Proceedings of the 14th Python in Science Conference, vol. 8, pp. 18–25 (2015)

    Google Scholar 

  40. Hunter, J.D.: Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9(03), 90–95 (2007)

    Article  Google Scholar 

Download references

Acknowledgements

This work partially supported by the National Key Research and Development Program of China (Grant No. 2019YFA0706200), the Project funded by China Postdoctoral Science Foundation (Grant No. 2021M700423), the Ministry of Science and Technology of the People’s Republic of China (No. 2021ZD0201900, 2021ZD0200601), the National Natural Science Foundation of China (No. 62227807, 62272044, 62072219), the National High-Level Young Talent Project, the BIT Teli Young Fellow Program from the Beijing Institute of Technology, China, the Natural Science Foundation of Gansu Province, China (No. 22JR5RA401), the Fundamental Research Funds for the Central Universities (No. lzujbky-2022-ey13), the JSPS KAKENHI (No. 20H00569), the JST Mirai Program (No. 21473074), and the JST MOONSHOT Program (No. JPMJMS229B), Japan.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jian Shen , Kun Qian or Bin Hu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ma, Y. et al. (2023). Explainable Stuttering Recognition Using Axial Attention. In: Huang, DS., Premaratne, P., Jin, B., Qu, B., Jo, KH., Hussain, A. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2023. Lecture Notes in Computer Science, vol 14088. Springer, Singapore. https://doi.org/10.1007/978-981-99-4749-2_18

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-4749-2_18

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-4748-5

  • Online ISBN: 978-981-99-4749-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics