Text Normalization for Bangla, Khmer, Nepali, Javanese, Sinhala and Sundanese Text-to-Speech Systems

Sodimana, Keshan; Silva, Pasindu De; Sproat, Richard; Wattanavekin, Theeraphol; Gutkin, Alexander; Pipatsrisawat, Knot

doi:10.21437/SLTU.2018-31

Text Normalization for Bangla, Khmer, Nepali, Javanese, Sinhala and Sundanese Text-to-Speech Systems

Keshan Sodimana, Pasindu De Silva, Richard Sproat, Theeraphol Wattanavekin, Alexander Gutkin, Knot Pipatsrisawat

Text normalization is the process of converting non-standard words (NSWs) such as numbers, and abbreviations into standard words so that their pronunciations can be derived by a typical means (usually lexicon lookups). Text normalization is, thus, an important component of any text-to-speech (TTS) system. Without text normalization, the resulting voice may sound unintelligent. In this paper, we describe an approach to develop rule-based text normalization. We also describe our open source repository containing text normalization grammars and tests for Bangla, Javanese, Khmer, Nepali, Sinhala and Sundanese. Finally, we present a recipe for utilizing the grammars in a TTS sytem.

doi: 10.21437/SLTU.2018-31

Cite as: Sodimana, K., Silva, P.D., Sproat, R., Wattanavekin, T., Gutkin, A., Pipatsrisawat, K. (2018) Text Normalization for Bangla, Khmer, Nepali, Javanese, Sinhala and Sundanese Text-to-Speech Systems. Proc. 6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018), 147-151, doi: 10.21437/SLTU.2018-31

@inproceedings{sodimana18b_sltu,
  author={Keshan Sodimana and Pasindu De Silva and Richard Sproat and Theeraphol Wattanavekin and Alexander Gutkin and Knot Pipatsrisawat},
  title={{Text Normalization for Bangla, Khmer, Nepali, Javanese, Sinhala and Sundanese Text-to-Speech Systems}},
  year=2018,
  booktitle={Proc. 6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018)},
  pages={147--151},
  doi={10.21437/SLTU.2018-31}
}