Burmese (Myanmar) Name Romanization: A Sub-syllabic Segmentation Scheme for Statistical Solutions

Ding, Chenchen; Pa, Win Pa; Utiyama, Masao; Sumita, Eiichiro

doi:10.1007/978-981-10-8438-6_16

Chenchen Ding¹¹,
Win Pa Pa¹²,
Masao Utiyama¹¹ &
…
Eiichiro Sumita¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 781))

Included in the following conference series:

International Conference of the Pacific Association for Computational Linguistics

862 Accesses

Abstract

We focus on Burmese name Romanization, a critical task in the translation of Burmese into languages using Latin script. As Burmese is under researched and not well resourced, we collected and manually annotated 2, 335 Romanization instances to enable statistical approaches. The annotation includes string segmentation and alignment between Burmese and Latin scripts. Although previous studies regard syllables as unbreakable units when processing Burmese, in this study, Burmese strings are segmented into well-designed sub-syllabic units to achieve precise and consistent alignment with Latin script. The experiments show that sub-syllabic units are better units than syllables for statistical approaches in Burmese name Romanization. The annotated data and segmentation program have been released under a CC BY-NC-SA license.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Stemming and Segmentation for Classical Tibetan

Segmentation and Alignment of Chinese and Khmer Bilingual Names Based on Hierarchical Dirichlet Process

Segmentation of Words Written in the Latin Alphabet: A Systematic Review

Notes

1.
http://www.nlpresearch-ucsy.edu.mm/NLP_UCSY/name-db.html.
2.
Typical ones are the Myanmar Language Commission Transcription System, the Library of Congress’ ALA-LC Romanization index system for Burmese (http://www.loc.gov/catdir/cpso/romanization/burmese.pdf), and the Okell’s system [13].
3.
Yayit originally represents while in the modern standard Burmese the phoneme has been merged into
4.
Actually a voiceless sign, e.g., changing to .
5.
Yapin can also be combined with .
6.
may be argued in some references. The combination appears marginally in borrowing words and interjections.
7.
E.g., is actually or .
8.
E.g., changing to and changing to .
9.
The visarga is usually not transcribed and aukmyit is inconsistently represented by a final t in Romanization.
10.
Multiple medial consonants for one initial consonant is possible while yapin and yayit cannot appear simultaneously.
11.
As mentioned, glottal endings take no tones.
12.
However, the swapped order may introduce no problem in displaying, so both orders are used in daily typing.
13.
Using GIZA++ [12] at http://www.statmt.org/moses/giza/GIZA++.html.
14.
An open-sourced tool is available at https://github.com/lemaoliu/Agtarbidir.
15.
http://taku910.github.io/crfpp/.
16.
http://www.phontron.com/kytea/.
17.
I.e., on the level in the bottom rank in Fig. 1, with no explicit alignment or unit boundaries between characters.
18.
SEG cannot be applied to the RNN approach as the alignment and segmentation are not explicit variables.
19.
I.e., the results in Tables 1 and 2 are based on the middle and upper-right parts in Fig. 1, respectively.
20.
The Romanization instance is directly taken from the released data set. A more common Romanization of the Pali-derived name is Wunna.

References

Banchs, R.E., Zhang, M., Duan, X., Li, H., Kumaran, A.: Report of NEWS 2015 machine transliteration shared task. In: Proceedings of NEWS, pp. 10–23 (2015)
Google Scholar
Costa-Jussà, M.R.: Moses-based official baseline for NEWS 2016. In: Proceedings of NEWS, pp. 88–90 (2016)
Google Scholar
Ding, C., Thu, Y.K., Utiyama, M., Finch, A., Sumita, E.: Empirical dependency-based head finalization for statistical Chinese-, English-, and French-to-Myanmar (Burmese) machine translation. In: Proceedings of IWSLT, pp. 184–191 (2014)
Google Scholar
Ding, C., Thu, Y.K., Utiyama, M., Sumita, E.: Parsing Myanmar (Burmese) by using Japanese as a pivot. In: Proceedings of ICCA (Myanmar), pp. 158–162 (2016)
Google Scholar
Ding, C., Thu, Y.K., Utiyama, M., Sumita, E.: Word segmentation for Burmese (Myanmar). ACM Trans. Asian Low Resour. Lang. Inf. Process. 15(4), 22 (2016)
Article Google Scholar
Finch, A., Liu, L., Wang, X., Sumita, E.: Neural network transduction models in transliteration generation. In: Proceedings of NEWS, pp. 61–66 (2015)
Google Scholar
Finch, A., Liu, L., Wang, X., Sumita, E.: Target-bidirectional neural models for machine transliteration. In: Proceedings of NEWS, pp. 78–82 (2016)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML, pp. 282–289 (2001)
Google Scholar
Liu, L., Finch, A., Utiyama, M., Sumita, E.: Agreement on target-bidirectional LSTMs for sequence-to-sequence learning. In: Proceedings of AAAI, pp. 2630–2637 (2016)
Google Scholar
Naing, H.M.S., Hlaing, A.M., Pa, W.P., Hu, X., Thu, Y.K., Hori, C., Kawai, H.: A Myanmar large vocabulary continuous speech recognition system. In: Proceedings of APSIPA, pp. 320–327 (2015)
Google Scholar
Neubig, G., Nakata, Y., Mori, S.: Pointwise prediction for robust, adaptable Japanese morphological analysis. In: Proceedings of ACL-HLT, pp. 529–533 (2011)
Google Scholar
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)
Article MATH Google Scholar
Okell, J.: A guide to the Romanization of Burmese (1971)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of ACL, pp. 311–318 (2002)
Google Scholar
Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of HLT-NAACT, pp. 134–141 (2003)
Google Scholar
Thu, Y.K., Pa, W.P., Finch, A., Ni, J., Sumita, E., Hori, C.: The application of phrase based statistical machine translation techniques to Myanmar grapheme to phoneme conversion. In: Hasida, K., Purwarianti, A. (eds.) Computational Linguistics. CCIS, vol. 593, pp. 238–250. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-0515-2_17
Chapter Google Scholar
Thu, Y.K., Pa, W.P., Ni, J., Shiga, Y., Finch, A., Hori, C., Kawai, H., Sumita, E.: HMM based Myanmar text to speech system. In: Proceedings of INTERSPEECH, pp. 2237–2241 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Advanced Translation Technology Laboratory, ASTREC, National Institute of Information and Communications Technology, 3-5 Hikaridai, Seikacho, Sorakugun, Kyoto, 619-0289, Japan
Chenchen Ding, Masao Utiyama & Eiichiro Sumita
Natural Language Processing Lab, University of Computer Studies, Yangon, Myanmar
Win Pa Pa

Authors

Chenchen Ding
View author publications
You can also search for this author in PubMed Google Scholar
Win Pa Pa
View author publications
You can also search for this author in PubMed Google Scholar
Masao Utiyama
View author publications
You can also search for this author in PubMed Google Scholar
Eiichiro Sumita
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chenchen Ding .

Editor information

Editors and Affiliations

Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan
Kôiti Hasida
Natural Language Processing Lab, University of Computer Studies, Yangon, Yangon, Myanmar
Win Pa Pa

Appendix

Figure 6 shows specific annotation instances for a further illustration and demonstration. The data are organized in a three-section format of

original Burmese name,
original Romanization, and
aligned Burmese/Latin graphemes,

separated by |||.

The descriptions of specific instances are as follows.

I.
An ordinary Romanization instance.
II.
A Burmese name with a western expression (Grace) as a component. Generally, such western expressions are segmented according to the Burmese spellings. In this instance, Grace is segmented into /G /@ /r /a /@ /ce. Notice that we just apply the same @ for the dummy vowel on Burmese side and for the silent placeholder on Latin side, which causes no confusion.
III.
A Burmese name derived from Pali (Wanna),^{Footnote 20} where stacked consonants appear (/n /n). The stacked consonants are split and aligned to separate Latin letters. If no doubled Latin letters are used, the second Burmese character will be simply aligned to a silent placeholder @. The stacking operator is always aligned to @.
IV.
A Burmese name with complex stacking, that the rhyme of the previous syllable (/in) is stacked with the following onset (/gy).
V.
A Burmese name with more complex stacking, that part of the rhyme of the previous syllable (/ein) is stacked with the following onset (/g), which is taking a further vowel diacritic (/i). The instances IV. and V. illustrate the necessity on the segmentation of stacked characters.
VI.
A Burmese name with stacked consonants, for which two syllables are kept as one word (Thinzar) in Romanization.
VII.
A Burmese name with stacked consonants, for which two syllables are separated as two words (Thin Zar) in Romanization. Notice the Burmese names in instance VI. and VII. are identical. They are treated as two different Romanization instances due to the spellings in Romanization are different.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ding, C., Pa, W.P., Utiyama, M., Sumita, E. (2018). Burmese (Myanmar) Name Romanization: A Sub-syllabic Segmentation Scheme for Statistical Solutions. In: Hasida, K., Pa, W. (eds) Computational Linguistics. PACLING 2017. Communications in Computer and Information Science, vol 781. Springer, Singapore. https://doi.org/10.1007/978-981-10-8438-6_16

Download citation

DOI: https://doi.org/10.1007/978-981-10-8438-6_16
Published: 04 March 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8437-9
Online ISBN: 978-981-10-8438-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Burmese (Myanmar) Name Romanization: A Sub-syllabic Segmentation Scheme for Statistical Solutions

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Stemming and Segmentation for Classical Tibetan

Segmentation and Alignment of Chinese and Khmer Bilingual Names Based on Hierarchical Dirichlet Process

Segmentation of Words Written in the Latin Alphabet: A Systematic Review

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us