skip to main content
short-paper

Word Segmentation for Burmese Based on Dual-Layer CRFs

Published: 12 November 2018 Publication History

Abstract

Burmese is an isolated language, in which the syllable is the smallest unit. Syllable segmentation methods based on matching lead to performance subject to the syllable segmentation effect. This article proposes a word segmentation method with fusion conditions of double syllable features. It combines word segmentation and segmentation of syllables into one process, thus reducing the impact of errors on the syllable segmentation of Burmese. In the first layer of the conditional random fields (CRF) model, Burmese characters as atomic features are integrated into the Burma section of the Barkis Speech Paradigm (Backus normal form) features to realize the Burma syllable sequence tags. In the second layer of the CRFs model, with the syllable marked as input, it realizes the sequence markers through building a feature template with syllables as atomic features. The experimental results show that the proposed method has a better effect compared with the method based on the matching of syllables.

References

[1]
Sun Maosong and Zou Jiayan. 2001. A review of the study of Chinese automatic word segmentation. Mod. Ling. 3, 1 (2001), 22--32.
[2]
Zhou Jun, Zheng Zhonghua, and Zhang Wei. 2014. Chinese word segmentation based on improved maximum matching algorithm. Comput. Eng Appl. 50, 2, (2014), 124--128.
[3]
Li Jiangbo, Zhou Qiang, and Chen Zushun. 2006. Research on fast search algorithm for chinese dictionary. Chin. J. Inf. 20, 5 (2006), 31--39.
[4]
Zhang Bingyi, Wei Bo, and Chen Jiancheng et al. 2014. Chinese segmentation algorithm based on dual coding. J. Nanjing Univ. Sci. Technol. Nat. Sci. 38, 4 (2014), 526--530.
[5]
HuaPing Zhang, HongKui Yu, and DeYi Xiong et al. 2003. HHMM-based chinese lexical ICTCLAS. In Proceedings of the 2nd SIGHAN Workshop on Language Processing, Volume 17. 184--187.
[6]
Jia Xu, Jianfeng Gao, Kristina Toutanova, and Hermann Ney. 2008. Bayesian semi-supervised chinese word segmentation for statistical machine translation. In Proceedings of the International Conference on Computational Linguistics (COLING’08). 1017--1024.
[7]
R. Sproat and T. Emerson. 2003. The first international chinese word segmentation bakeoff. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. ACL, 133--143.
[8]
Xue Nianwen and Shen Libin. 2003. Chinese word segmentation as LMR tagging. In Proceedings of the 2nd ACL SIGHAN Workshop on Chinese Language Processing. ACL, 176--179.
[9]
Zhao Hai, Huang Changning, and Li Mu. 2006. An system with conditional random field. Workshop on Chinese Language Processing, improved Chinese word segmentation. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. ACL, 108--117.
[10]
Huang Degen, Jiao Yang, and Zhou Huiwei. 2010. Double layer CRFs chinese word segmentation based on child words. Comput. Res. Dev. 47, 5 (2010), 962--968.
[11]
Tun Thura Thet and Jin-Cheon Na. 2008. Word segmentation for the Myanmar language. J. Inf. Sci. 34, 5 (2008), 688--704.
[12]
Aye Myat Mon et al. 2010. Analysis of myanmar word boundary and segmentation by using statistical approach. In Proceedings of the 3rd International Conference on Advanced Computer Theory and Engineering (ICACTE‘10), V5:233--237.
[13]
Ye Kyaw Thu. Integrating dictionaries into an unsupervised model for myanmar word segmentation. In Proceedings of the 5th Workshop on South and Southeast Asian NLP and 25th International Conference on Computational Linguistics. 20--27.
[14]
Chenchen Ding, Ye Kyaw Thu, Masao Utiyama, and Eiichiro Sumita. 2016. Word segmentation for Burmese (Myanmar). ACM Trans. Asian Low-Resour. Lang. Inf. Process. 15, 4, Article 22 (May 2016), 10 pages.
[15]
Zhou Junsheng, Dai Xinyu, Yin Cunyan et al. 2006. Automatic identification of Chinese organization names based on cascaded conditional random field model {J}. J. Electr. 34, 5 (2006), 6804--809.
[16]
Yan Yang, Wen Dunwei, Wang Yunji et al. 2014. Chinese medical record naming entity recognition based on cascading conditions with the airport{J}. Journal of Jilin University: Engineering Edition 44, 6 (2014), 1843--1848.
[17]
Li Yachao, Jiayangji, and Zong Chengqing et al. 2013. Research and implementation of tibetan automatic word segmentation based on conditional random field {J}. Journal of Chinese Information Processing 27, 4 (2013), 52--58.
[18]
Hla Hla Htay and Kavi Narayana Murthy. 2008. Myanmar Word Segmentation using Syllable level Longest Matching. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP’8). 41--48.

Cited By

View all
  • (2024)Enhancing Sindhi Word Segmentation Using Subword Representation Learning and Position-Aware Self-AttentionIEEE Access10.1109/ACCESS.2024.350738212(183133-183142)Online publication date: 2024
  • (2021)A Neural Joint Model with BERT for Burmese Syllable Segmentation, Word Segmentation, and POS TaggingACM Transactions on Asian and Low-Resource Language Information Processing10.1145/343681820:4(1-23)Online publication date: 26-May-2021

Index Terms

  1. Word Segmentation for Burmese Based on Dual-Layer CRFs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 18, Issue 1
    March 2019
    196 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3292011
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 November 2018
    Accepted: 01 June 2018
    Revised: 01 March 2018
    Received: 01 October 2017
    Published in TALLIP Volume 18, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. BNF
    2. Burmese
    3. CRFs
    4. syllable segmentation
    5. word segmentation

    Qualifiers

    • Short-paper
    • Research
    • Refereed

    Funding Sources

    • Yunnan Province Natural Science Foundation
    • National Natural Science Foundation of China
    • Kunming University of Technology Introduction of Talent Research Start-up Foundation
    • Yunnan Province Department of Education Foundation

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 22 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Enhancing Sindhi Word Segmentation Using Subword Representation Learning and Position-Aware Self-AttentionIEEE Access10.1109/ACCESS.2024.350738212(183133-183142)Online publication date: 2024
    • (2021)A Neural Joint Model with BERT for Burmese Syllable Segmentation, Word Segmentation, and POS TaggingACM Transactions on Asian and Low-Resource Language Information Processing10.1145/343681820:4(1-23)Online publication date: 26-May-2021

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media