Articulation-based pronunciation error detection is concerned with
the task of diagnosing mispronounced segments in non-native speech
on the level of broad phonological properties, such as place of articulation
or voicing. Using acoustic features and visual spectrograms extracted
from native English utterances, we train several neural classifiers
that deduce articulatory properties from segments extracted from non-native
English utterances. Visual cues are thereby processed by convolutional
neural networks, whereas acoustic cues are processed by recurrent neural
networks.
We show that combining both modalities increases performance over
using models in isolation, with important implications for user satisfaction.
Furthermore, we test the impact of alignment quality on model performance
by comparing results on manually corrected segments and force-aligned
segments, showing that the proposed pipeline can dispense with manual
correction.
Cite as: Jenne, S., Vu, N.T. (2019) Multimodal Articulation-Based Pronunciation Error Detection with Spectrogram and Acoustic Features. Proc. Interspeech 2019, 3549-3553, doi: 10.21437/Interspeech.2019-1677
@inproceedings{jenne19_interspeech, author={Sabrina Jenne and Ngoc Thang Vu}, title={{Multimodal Articulation-Based Pronunciation Error Detection with Spectrogram and Acoustic Features}}, year=2019, booktitle={Proc. Interspeech 2019}, pages={3549--3553}, doi={10.21437/Interspeech.2019-1677} }