ISCA Archive Interspeech 2016
ISCA Archive Interspeech 2016

Analysis of Mismatched Transcriptions Generated by Humans and Machines for Under-Resourced Languages

Van Hai Do, Nancy F. Chen, Boon Pang Lim, Mark Hasegawa-Johnson

When speech data with native transcriptions are scarce in an under-resourced language, automatic speech recognition (ASR) must be trained using other methods. Semi-supervised learning first labels the speech using ASR from other languages, then re-trains the ASR using the generated labels. Mismatched crowdsourcing asks crowd-workers unfamiliar with the language to transcribe it. In this paper, self-training and mismatched crowdsourcing are compared under exactly matched conditions. Specifically, speech data of the target language are decoded by the source language ASR systems into source language phone/word sequences. We find that (1) human mismatched crowdsourcing and cross-lingual ASR have similar error patterns, but different specific errors. (2) These two sources of information can be usefully combined in order to train a better target-language ASR. (3) The differences between the error patterns of non-native human listeners and non-native ASR are small, but when differences are observed, they provide information about the relationship between the phoneme systems of the annotator/source language (Mandarin) and the target language (Vietnamese).


doi: 10.21437/Interspeech.2016-736

Cite as: Do, V.H., Chen, N.F., Lim, B.P., Hasegawa-Johnson, M. (2016) Analysis of Mismatched Transcriptions Generated by Humans and Machines for Under-Resourced Languages. Proc. Interspeech 2016, 3863-3867, doi: 10.21437/Interspeech.2016-736

@inproceedings{do16c_interspeech,
  author={Van Hai Do and Nancy F. Chen and Boon Pang Lim and Mark Hasegawa-Johnson},
  title={{Analysis of Mismatched Transcriptions Generated by Humans and Machines for Under-Resourced Languages}},
  year=2016,
  booktitle={Proc. Interspeech 2016},
  pages={3863--3867},
  doi={10.21437/Interspeech.2016-736}
}