Toward Robust Mispronunciation Detection via Audio-Visual Speech Recognition

Karbasi, Mahdie; Zeiler, Steffen; Freiwald, Jan; Kolossa, Dorothea

doi:10.1007/978-3-030-20518-8_54

Mahdie Karbasi¹⁷,
Steffen Zeiler¹⁷,
Jan Freiwald¹⁷ &
…
Dorothea Kolossa¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11507))

Included in the following conference series:

International Work-Conference on Artificial Neural Networks

2362 Accesses

Abstract

A recent trend in language learning is gamification, i.e. the application of game-design elements and game principles in non-game contexts. A key component therein is the detection of mispronunciations by means of automatic speech recognition. Constraints like quiet environments and the use of close-talking microphones hinder the applicability for language learning games.

In this work, we propose to use multi-modal—specifically audio-visual—speech recognition as an alternative for detecting mispronunciations in acoustically noisy or otherwise challenging environments. We examine a hybrid speech recognizer structure, using either feed-forward or bidirectional long-short term memory (BiLSTM) networks. There are several options to integrate both modalities. Here, we compare early fusion, i.e. the use of one joint audio-visual network, with a turbo-decoding approach that combines contributions from acoustic and visual models. We evaluate the performance of these topologies in detecting some common phoneme mispronunciations, namely the errors in manner (MoA) and in place of articulation (PoA). It is shown that our novel architecture, using deep neural network acoustic and visual submodels in conjunction with turbo-decoding, is very well suited for the task of mispronunciation detection, and that the visual modality contributes strongly to achieving noise-robust performance.

This project has received funding from the European Regional Development Fund (ERDF).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A novel framework using 3D-CNN and BiLSTM model with dynamic learning rate scheduler for visual speech recognition

Article 18 May 2024

Continuous lipreading based on acoustic temporal alignments

Article Open access 06 May 2024

Two-stage visual speech recognition for intensive care patients

Article Open access 17 January 2023

References

Abdelaziz, A.H.: Comparing fusion models for DNN-based audiovisual continuous speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 26(3), 475–484 (2018)
Article Google Scholar
Abdelaziz, A.H., Zeiler, S., Kolossa, D.: Learning dynamic stream weights for coupled-HMM-based audio-visual speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 23(5), 863–876 (2015)
Google Scholar
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006). https://doi.org/10.1121/1.2229005
Article Google Scholar
Freiwald, J., et al.: Utilizing slow feature analysis for lipreading. In: Proceedings of ITG, November 2018
Google Scholar
Gergen, S., Zeiler, S., Hussen Abdelaziz, A., Nickel, R., Kolossa, D.: Dynamic stream weighting for turbo-decoding-based audiovisual ASR. In: Proceedings of ITG, pp. 2135–2139, September 2016
Google Scholar
Graham, C.R., Lonsdale, D., Kennington, C., Johnson, A., McGhee, J.: Elicited imitation as an oral proficiency measure with ASR scoring. In: LREC (2008)
Google Scholar
Hu, W., Qian, Y., Soong, F.K., Wang, Y.: Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers. Speech Commun. 67, 154–166 (2015)
Article Google Scholar
Kjellström, H., Engwall, O., Abdou, S.M., Bälter, O.: Audio-visual phoneme classification for pronunciation training applications. In: Proceedings of the Eighth Annual Conference of the International Speech Communication Association (2007)
Google Scholar
Lee, A., Glass, J.: A comparison-based approach to mispronunciation detection. In: Proceedings of Spoken Language Technology Workshop (SLT), pp. 382–387 (2012)
Google Scholar
Lee, A., Zhang, Y., Glass, J.: Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams. In: Proceedings of ICASSP, pp. 8227–8231 (2013)
Google Scholar
Li, K., Qian, X., Meng, H.: Mispronunciation detection and diagnosis in L2 English speech using multidistribution deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 25(1), 193–207 (2017)
Article Google Scholar
Li, W., Chen, N., Siniscalchi, M., Lee, C.H.: Improving mispronunciation detection for non-native learners with multisource information and LSTM-based deep models. In: Proceedings of Interspeech, pp. 2759–2763, September 2017
Google Scholar
Nefian, A.V., Liang, L., Pi, X., Liu, X., Murphy, K.: Dynamic Bayesian networks for audio-visual speech recognition. EURASIP J. Adv. Signal Process. 2002(11), 783042 (2002)
Article Google Scholar
Picard, S., Ananthakrishnan, G., Wik, P., Engwall, O., Abdou, S.: Detection of specific mispronunciations using audiovisual features. In: Proceedings of Auditory-Visual Speech Processing (2010)
Google Scholar
Receveur, S., Weiss, R., Fingscheidt, T.: Turbo automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24(5), 846–862 (2016)
Article Google Scholar
Richardson, M., Bilmes, J., Diorio, C.: Hidden-articulator Markov models for speech recognition. Speech Commun. 41(2–3), 511–529 (2003)
Article Google Scholar
Ronen, O., Neumeyer, L., Franco, H.: Automatic detection of mispronunciation for language instruction. In: Proceedings of the Fifth Eurospeech (1997)
Google Scholar
Tepperman, J., Narayanan, S.: Using articulatory representations to detect segmental errors in nonnative pronunciation. IEEE/ACM Trans. Audio Speech Lang. Process. 16(1), 8–22 (2008)
Article Google Scholar
Truong, K., Neri, A., Cucchiarini, C., Strik, H.: Automatic pronunciation error detection: an acoustic-phonetic approach. In: Proceedings of InSTIL/ICALL Symposium (2004)
Google Scholar
Wang, Y.B., Lee, L.S.: Improved approaches of modeling and detecting error patterns with empirical analysis for computer-aided pronunciation training. In: Proceedings of ICASSP, pp. 5049–5052 (2012)
Google Scholar
Wang, Y.B., Lee, L.S.: Supervised detection and unsupervised discovery of pronunciation error patterns for computer-assisted language learning. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 23(3), 564–579 (2015)
Article Google Scholar
Zeiler, S., Nickel, R., Ma, N., Brown, G., Kolossa, D.: Robust audiovisual speech recognition using noise-adaptive linear discriminant analysis. In: Proceedings of ICASSP (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Communication Acoustics, Faculty of Electrical Engineering and Information Technology, Ruhr University Bochum, Bochum, Germany
Mahdie Karbasi, Steffen Zeiler, Jan Freiwald & Dorothea Kolossa

Authors

Mahdie Karbasi
View author publications
You can also search for this author in PubMed Google Scholar
Steffen Zeiler
View author publications
You can also search for this author in PubMed Google Scholar
Jan Freiwald
View author publications
You can also search for this author in PubMed Google Scholar
Dorothea Kolossa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mahdie Karbasi .

Editor information

Editors and Affiliations

University of Granada, Granada, Spain
Ignacio Rojas
University of Malaga, Malaga, Spain
Gonzalo Joya
Polytechnic University of Catalonia, Barcelona, Spain
Andreu Catala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Karbasi, M., Zeiler, S., Freiwald, J., Kolossa, D. (2019). Toward Robust Mispronunciation Detection via Audio-Visual Speech Recognition. In: Rojas, I., Joya, G., Catala, A. (eds) Advances in Computational Intelligence. IWANN 2019. Lecture Notes in Computer Science(), vol 11507. Springer, Cham. https://doi.org/10.1007/978-3-030-20518-8_54

Download citation

DOI: https://doi.org/10.1007/978-3-030-20518-8_54
Published: 16 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20517-1
Online ISBN: 978-3-030-20518-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics