Abstract
This paper examines the effectiveness of conditional random fields (CRFs) when used to identify Myanmar word boundaries within a supervised framework. Existing approaches are based on the method of maximum matching which appears to suffer from problems relating to the manner in which Myanmar words are composed. In our experiments, the CRF approach is compared against a baseline based on maximum matching using dictionaries from the Myanmar Language Commission Dictionary (word only) and a manually segmented subset of the BTEC1 corpus. The experimental results show that the CRF model is able to achieve considerably higher F-scores on the segmentation task than the baseline, even when the baseline is allowed to use words from the test data in its dictionary.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Pa, W.P., Thein, N.L.: Myanmar Word Segmentation using Hybrid Approach. In: Proceedings of 6th International Conference on Computer Applications, Yangon, Myanmar, pp. 166–170 (2008)
Kikui, G., Yamamoto, S., Takezawa, T., Sumita, E.: Comparative study on corpora for speech translation. IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 1674–1682 (2006)
Thu, Y.K., Finch, A., Sagisaka, Y., Sumita, E.: A Study of Myanmar Word Segmentation Schemes for Statistical Machine Translation. In: Proceedings of 12th International Conference on Computer Applications, Yangon, Myanmar, pp. 167–179 (2014)
Lafferty, J., McCallum, A., Pereira, F.: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: Proceedings of the 18th International Conf. on Machine Learning, pp. 282–289 (2001)
Thet, T.T., Na, J.-C., Ko, W.K.: Word Segmentation for the Myanmar language. Journal of Information Science 34(5), 688–704 (2008)
Htay, H.H., Murthy, K.N.: Myanmar Word Segmentation Using Syllable Level Longst Matching, the 6th Workshop on Asian Language. Resources 2008, 41–48 (2008)
Liu, Y., Tan, Q., Shen, K.X.: The Word Segmentation Methods for Chinese Information Processing. Quing Hua University Press and Guang Xi Science and Technology Press, 36 (1994) (in Chinese)
Myanmar English Dictionary, Myanmar Language Commission, Myanmar, 2012 Edition
Myanmar Grammar, Myanmar Language Commission, Myanmar (2000)
Taku Kudo: CRF++ An open source toolkit for CRF (2005). http://crfpp.sourceforge.net/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Pa, W.P., Thu, Y.K., Finch, A., Sumita, E. (2016). Word Boundary Identification for Myanmar Text Using Conditional Random Fields. In: Zin, T., Lin, JW., Pan, JS., Tin, P., Yokota, M. (eds) Genetic and Evolutionary Computing. GEC 2015. Advances in Intelligent Systems and Computing, vol 388. Springer, Cham. https://doi.org/10.1007/978-3-319-23207-2_46
Download citation
DOI: https://doi.org/10.1007/978-3-319-23207-2_46
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23206-5
Online ISBN: 978-3-319-23207-2
eBook Packages: EngineeringEngineering (R0)