Abstract
A recent trend in language learning is gamification, i.e. the application of game-design elements and game principles in non-game contexts. A key component therein is the detection of mispronunciations by means of automatic speech recognition. Constraints like quiet environments and the use of close-talking microphones hinder the applicability for language learning games.
In this work, we propose to use multi-modal—specifically audio-visual—speech recognition as an alternative for detecting mispronunciations in acoustically noisy or otherwise challenging environments. We examine a hybrid speech recognizer structure, using either feed-forward or bidirectional long-short term memory (BiLSTM) networks. There are several options to integrate both modalities. Here, we compare early fusion, i.e. the use of one joint audio-visual network, with a turbo-decoding approach that combines contributions from acoustic and visual models. We evaluate the performance of these topologies in detecting some common phoneme mispronunciations, namely the errors in manner (MoA) and in place of articulation (PoA). It is shown that our novel architecture, using deep neural network acoustic and visual submodels in conjunction with turbo-decoding, is very well suited for the task of mispronunciation detection, and that the visual modality contributes strongly to achieving noise-robust performance.
This project has received funding from the European Regional Development Fund (ERDF).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abdelaziz, A.H.: Comparing fusion models for DNN-based audiovisual continuous speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 26(3), 475–484 (2018)
Abdelaziz, A.H., Zeiler, S., Kolossa, D.: Learning dynamic stream weights for coupled-HMM-based audio-visual speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 23(5), 863–876 (2015)
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006). https://doi.org/10.1121/1.2229005
Freiwald, J., et al.: Utilizing slow feature analysis for lipreading. In: Proceedings of ITG, November 2018
Gergen, S., Zeiler, S., Hussen Abdelaziz, A., Nickel, R., Kolossa, D.: Dynamic stream weighting for turbo-decoding-based audiovisual ASR. In: Proceedings of ITG, pp. 2135–2139, September 2016
Graham, C.R., Lonsdale, D., Kennington, C., Johnson, A., McGhee, J.: Elicited imitation as an oral proficiency measure with ASR scoring. In: LREC (2008)
Hu, W., Qian, Y., Soong, F.K., Wang, Y.: Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers. Speech Commun. 67, 154–166 (2015)
Kjellström, H., Engwall, O., Abdou, S.M., Bälter, O.: Audio-visual phoneme classification for pronunciation training applications. In: Proceedings of the Eighth Annual Conference of the International Speech Communication Association (2007)
Lee, A., Glass, J.: A comparison-based approach to mispronunciation detection. In: Proceedings of Spoken Language Technology Workshop (SLT), pp. 382–387 (2012)
Lee, A., Zhang, Y., Glass, J.: Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams. In: Proceedings of ICASSP, pp. 8227–8231 (2013)
Li, K., Qian, X., Meng, H.: Mispronunciation detection and diagnosis in L2 English speech using multidistribution deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 25(1), 193–207 (2017)
Li, W., Chen, N., Siniscalchi, M., Lee, C.H.: Improving mispronunciation detection for non-native learners with multisource information and LSTM-based deep models. In: Proceedings of Interspeech, pp. 2759–2763, September 2017
Nefian, A.V., Liang, L., Pi, X., Liu, X., Murphy, K.: Dynamic Bayesian networks for audio-visual speech recognition. EURASIP J. Adv. Signal Process. 2002(11), 783042 (2002)
Picard, S., Ananthakrishnan, G., Wik, P., Engwall, O., Abdou, S.: Detection of specific mispronunciations using audiovisual features. In: Proceedings of Auditory-Visual Speech Processing (2010)
Receveur, S., Weiss, R., Fingscheidt, T.: Turbo automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24(5), 846–862 (2016)
Richardson, M., Bilmes, J., Diorio, C.: Hidden-articulator Markov models for speech recognition. Speech Commun. 41(2–3), 511–529 (2003)
Ronen, O., Neumeyer, L., Franco, H.: Automatic detection of mispronunciation for language instruction. In: Proceedings of the Fifth Eurospeech (1997)
Tepperman, J., Narayanan, S.: Using articulatory representations to detect segmental errors in nonnative pronunciation. IEEE/ACM Trans. Audio Speech Lang. Process. 16(1), 8–22 (2008)
Truong, K., Neri, A., Cucchiarini, C., Strik, H.: Automatic pronunciation error detection: an acoustic-phonetic approach. In: Proceedings of InSTIL/ICALL Symposium (2004)
Wang, Y.B., Lee, L.S.: Improved approaches of modeling and detecting error patterns with empirical analysis for computer-aided pronunciation training. In: Proceedings of ICASSP, pp. 5049–5052 (2012)
Wang, Y.B., Lee, L.S.: Supervised detection and unsupervised discovery of pronunciation error patterns for computer-assisted language learning. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 23(3), 564–579 (2015)
Zeiler, S., Nickel, R., Ma, N., Brown, G., Kolossa, D.: Robust audiovisual speech recognition using noise-adaptive linear discriminant analysis. In: Proceedings of ICASSP (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Karbasi, M., Zeiler, S., Freiwald, J., Kolossa, D. (2019). Toward Robust Mispronunciation Detection via Audio-Visual Speech Recognition. In: Rojas, I., Joya, G., Catala, A. (eds) Advances in Computational Intelligence. IWANN 2019. Lecture Notes in Computer Science(), vol 11507. Springer, Cham. https://doi.org/10.1007/978-3-030-20518-8_54
Download citation
DOI: https://doi.org/10.1007/978-3-030-20518-8_54
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20517-1
Online ISBN: 978-3-030-20518-8
eBook Packages: Computer ScienceComputer Science (R0)