When speech data with native transcriptions are scarce in an under-resourced language, automatic speech recognition (ASR) must be trained using other methods. Semi-supervised learning first labels the speech using ASR from other languages, then re-trains the ASR using the generated labels. Mismatched crowdsourcing asks crowd-workers unfamiliar with the language to transcribe it. In this paper, self-training and mismatched crowdsourcing are compared under exactly matched conditions. Specifically, speech data of the target language are decoded by the source language ASR systems into source language phone/word sequences. We find that (1) human mismatched crowdsourcing and cross-lingual ASR have similar error patterns, but different specific errors. (2) These two sources of information can be usefully combined in order to train a better target-language ASR. (3) The differences between the error patterns of non-native human listeners and non-native ASR are small, but when differences are observed, they provide information about the relationship between the phoneme systems of the annotator/source language (Mandarin) and the target language (Vietnamese).
Cite as: Do, V.H., Chen, N.F., Lim, B.P., Hasegawa-Johnson, M. (2016) Analysis of Mismatched Transcriptions Generated by Humans and Machines for Under-Resourced Languages. Proc. Interspeech 2016, 3863-3867, doi: 10.21437/Interspeech.2016-736
@inproceedings{do16c_interspeech, author={Van Hai Do and Nancy F. Chen and Boon Pang Lim and Mark Hasegawa-Johnson}, title={{Analysis of Mismatched Transcriptions Generated by Humans and Machines for Under-Resourced Languages}}, year=2016, booktitle={Proc. Interspeech 2016}, pages={3863--3867}, doi={10.21437/Interspeech.2016-736} }