Improving ASR Output for Endangered Language Documentation

Jimerson, Robbie; Simha, Kruthika; Ptucha, Raymond; Prudhommeaux, Emily

doi:10.21437/SLTU.2018-39

Improving ASR Output for Endangered Language Documentation

Robbie Jimerson, Kruthika Simha, Raymond Ptucha, Emily Prudhommeaux

Documenting endangered languages supports the historical preservation of diverse cultures. Automatic speech recognition (ASR), while potentially very useful for this task, has been underutilized for language documentation due to the challenges inherent in building robust models from extremely limited audio and text training resources. In this paper, we explore the utility of supplementing existing training resources using synthetic data, with a focus on Seneca, a morphologically complex endangered language of North America. We use transfer learning to train acoustic models using both the small amount of available acoustic training data and artificially distorted copies of that data. We then supplement the language model training data with verb forms generated by rule and sentences produced by an LSTM trained on the available text data. The addition of synthetic data yields reductions in word error rate, demonstrating the promise of data augmentation for this task.

doi: 10.21437/SLTU.2018-39

Cite as: Jimerson, R., Simha, K., Ptucha, R., Prudhommeaux, E. (2018) Improving ASR Output for Endangered Language Documentation. Proc. 6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018), 187-191, doi: 10.21437/SLTU.2018-39

@inproceedings{jimerson18_sltu,
  author={Robbie Jimerson and Kruthika Simha and Raymond Ptucha and Emily Prudhommeaux},
  title={{Improving ASR Output for Endangered Language Documentation}},
  year=2018,
  booktitle={Proc. 6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018)},
  pages={187--191},
  doi={10.21437/SLTU.2018-39}
}