Mining Training Data for Language Modeling Across the World's Languages

Prasad, Manasa; Breiner, Theresa; van Esch, Daan

doi:10.21437/SLTU.2018-13

Mining Training Data for Language Modeling Across the World's Languages

Manasa Prasad, Theresa Breiner, Daan van Esch

Building smart keyboards and speech recognition systems for new languages requires a large, clean text corpus to train n-gram language models on. We report our findings on how much text data can realistically be found on the web across thousands of languages. In addition, we describe an innovative, scalable approach to normalizing this data: all data sources are noisy to some extent, but this situation is even more severe for low-resource languages. To help clean the data we find across all languages in a scalable way, we built a pipeline to automatically derive the configuration for language-specific text normalization systems, which we describe here as well.

doi: 10.21437/SLTU.2018-13

Cite as: Prasad, M., Breiner, T., van Esch, D. (2018) Mining Training Data for Language Modeling Across the World's Languages. Proc. 6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018), 61-65, doi: 10.21437/SLTU.2018-13

@inproceedings{prasad18_sltu,
  author={Manasa Prasad and Theresa Breiner and Daan {van Esch}},
  title={{Mining Training Data for Language Modeling Across the World's Languages}},
  year=2018,
  booktitle={Proc. 6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018)},
  pages={61--65},
  doi={10.21437/SLTU.2018-13}
}