Text normalization is the process of converting non-standard words (NSWs) such as numbers, and abbreviations into standard words so that their pronunciations can be derived by a typical means (usually lexicon lookups). Text normalization is, thus, an important component of any text-to-speech (TTS) system. Without text normalization, the resulting voice may sound unintelligent. In this paper, we describe an approach to develop rule-based text normalization. We also describe our open source repository containing text normalization grammars and tests for Bangla, Javanese, Khmer, Nepali, Sinhala and Sundanese. Finally, we present a recipe for utilizing the grammars in a TTS sytem.
Cite as: Sodimana, K., Silva, P.D., Sproat, R., Wattanavekin, T., Gutkin, A., Pipatsrisawat, K. (2018) Text Normalization for Bangla, Khmer, Nepali, Javanese, Sinhala and Sundanese Text-to-Speech Systems. Proc. 6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018), 147-151, doi: 10.21437/SLTU.2018-31
@inproceedings{sodimana18b_sltu, author={Keshan Sodimana and Pasindu De Silva and Richard Sproat and Theeraphol Wattanavekin and Alexander Gutkin and Knot Pipatsrisawat}, title={{Text Normalization for Bangla, Khmer, Nepali, Javanese, Sinhala and Sundanese Text-to-Speech Systems}}, year=2018, booktitle={Proc. 6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018)}, pages={147--151}, doi={10.21437/SLTU.2018-31} }