short-paper

Word Segmentation for Burmese Based on Dual-Layer CRFs

Authors:

Shaoning Zhang,

Jiafu ZhangAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Volume 18, Issue 1

Article No.: 6, Pages 1 - 11

https://doi.org/10.1145/3232537

Published: 12 November 2018 Publication History

Abstract

Burmese is an isolated language, in which the syllable is the smallest unit. Syllable segmentation methods based on matching lead to performance subject to the syllable segmentation effect. This article proposes a word segmentation method with fusion conditions of double syllable features. It combines word segmentation and segmentation of syllables into one process, thus reducing the impact of errors on the syllable segmentation of Burmese. In the first layer of the conditional random fields (CRF) model, Burmese characters as atomic features are integrated into the Burma section of the Barkis Speech Paradigm (Backus normal form) features to realize the Burma syllable sequence tags. In the second layer of the CRFs model, with the syllable marked as input, it realizes the sequence markers through building a feature template with syllables as atomic features. The experimental results show that the proposed method has a better effect compared with the method based on the matching of syllables.

References

[1]

Sun Maosong and Zou Jiayan. 2001. A review of the study of Chinese automatic word segmentation. Mod. Ling. 3, 1 (2001), 22--32.

[2]

Zhou Jun, Zheng Zhonghua, and Zhang Wei. 2014. Chinese word segmentation based on improved maximum matching algorithm. Comput. Eng Appl. 50, 2, (2014), 124--128.

[3]

Li Jiangbo, Zhou Qiang, and Chen Zushun. 2006. Research on fast search algorithm for chinese dictionary. Chin. J. Inf. 20, 5 (2006), 31--39.

[4]

Zhang Bingyi, Wei Bo, and Chen Jiancheng et al. 2014. Chinese segmentation algorithm based on dual coding. J. Nanjing Univ. Sci. Technol. Nat. Sci. 38, 4 (2014), 526--530.

[5]

HuaPing Zhang, HongKui Yu, and DeYi Xiong et al. 2003. HHMM-based chinese lexical ICTCLAS. In Proceedings of the 2nd SIGHAN Workshop on Language Processing, Volume 17. 184--187.

Digital Library

[6]

Jia Xu, Jianfeng Gao, Kristina Toutanova, and Hermann Ney. 2008. Bayesian semi-supervised chinese word segmentation for statistical machine translation. In Proceedings of the International Conference on Computational Linguistics (COLING’08). 1017--1024.

Digital Library

[7]

R. Sproat and T. Emerson. 2003. The first international chinese word segmentation bakeoff. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. ACL, 133--143.

Digital Library

[8]

Xue Nianwen and Shen Libin. 2003. Chinese word segmentation as LMR tagging. In Proceedings of the 2nd ACL SIGHAN Workshop on Chinese Language Processing. ACL, 176--179.

Digital Library

[9]

Zhao Hai, Huang Changning, and Li Mu. 2006. An system with conditional random field. Workshop on Chinese Language Processing, improved Chinese word segmentation. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. ACL, 108--117.

[10]

Huang Degen, Jiao Yang, and Zhou Huiwei. 2010. Double layer CRFs chinese word segmentation based on child words. Comput. Res. Dev. 47, 5 (2010), 962--968.

[11]

Tun Thura Thet and Jin-Cheon Na. 2008. Word segmentation for the Myanmar language. J. Inf. Sci. 34, 5 (2008), 688--704.

Digital Library

[12]

Aye Myat Mon et al. 2010. Analysis of myanmar word boundary and segmentation by using statistical approach. In Proceedings of the 3rd International Conference on Advanced Computer Theory and Engineering (ICACTE‘10), V5:233--237.

[13]

Ye Kyaw Thu. Integrating dictionaries into an unsupervised model for myanmar word segmentation. In Proceedings of the 5th Workshop on South and Southeast Asian NLP and 25th International Conference on Computational Linguistics. 20--27.

[14]

Chenchen Ding, Ye Kyaw Thu, Masao Utiyama, and Eiichiro Sumita. 2016. Word segmentation for Burmese (Myanmar). ACM Trans. Asian Low-Resour. Lang. Inf. Process. 15, 4, Article 22 (May 2016), 10 pages.

Digital Library

[15]

Zhou Junsheng, Dai Xinyu, Yin Cunyan et al. 2006. Automatic identification of Chinese organization names based on cascaded conditional random field model {J}. J. Electr. 34, 5 (2006), 6804--809.

[16]

Yan Yang, Wen Dunwei, Wang Yunji et al. 2014. Chinese medical record naming entity recognition based on cascading conditions with the airport{J}. Journal of Jilin University: Engineering Edition 44, 6 (2014), 1843--1848.

[17]

Li Yachao, Jiayangji, and Zong Chengqing et al. 2013. Research and implementation of tibetan automatic word segmentation based on conditional random field {J}. Journal of Chinese Information Processing 27, 4 (2013), 52--58.

[18]

Hla Hla Htay and Kavi Narayana Murthy. 2008. Myanmar Word Segmentation using Syllable level Longest Matching. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP’8). 41--48.

Cited By

Ali WKumar JTumrani SNour RNoor AXu Z(2024)Enhancing Sindhi Word Segmentation Using Subword Representation Learning and Position-Aware Self-AttentionIEEE Access10.1109/ACCESS.2024.350738212(183133-183142)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3507382
Mao CMan ZYu ZGao SWang ZWang H(2021)A Neural Joint Model with BERT for Burmese Syllable Segmentation, Word Segmentation, and POS TaggingACM Transactions on Asian and Low-Resource Language Information Processing10.1145/343681820:4(1-23)Online publication date: 26-May-2021
https://dl.acm.org/doi/10.1145/3436818

Index Terms

Word Segmentation for Burmese Based on Dual-Layer CRFs
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Phonology / morphology

Recommendations

A Neural Joint Model with BERT for Burmese Syllable Segmentation, Word Segmentation, and POS Tagging
The smallest semantic unit of the Burmese language is called the syllable. In the present study, it is intended to propose the first neural joint learning model for Burmese syllable segmentation, word segmentation, and part-of-speech (POS) tagging with ...
Word Segmentation for Burmese (Myanmar)

Experiments on various word segmentation approaches for the Burmese language are conducted and discussed in this note. Specifically, dictionary-based, statistical, and machine learning approaches are tested. Experimental results demonstrate that ...
Word segmentation for the Myanmar language

This study reports the development of a Myanmar word segmentation method using Unicode standard encoding. Word segmentation is an essential step prior to natural language processing in the Myanmar language, because a Myanmar text is a string of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 18, Issue 1

March 2019

196 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3292011

Editor:
Nianwen Xue
Brandeis University, Waltham, USA

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2018

Accepted: 01 June 2018

Revised: 01 March 2018

Received: 01 October 2017

Published in TALLIP Volume 18, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed

Funding Sources

Yunnan Province Natural Science Foundation
National Natural Science Foundation of China
Kunming University of Technology Introduction of Talent Research Start-up Foundation
Yunnan Province Department of Education Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
213
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ali WKumar JTumrani SNour RNoor AXu Z(2024)Enhancing Sindhi Word Segmentation Using Subword Representation Learning and Position-Aware Self-AttentionIEEE Access10.1109/ACCESS.2024.350738212(183133-183142)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3507382
Mao CMan ZYu ZGao SWang ZWang H(2021)A Neural Joint Model with BERT for Burmese Syllable Segmentation, Word Segmentation, and POS TaggingACM Transactions on Asian and Low-Resource Language Information Processing10.1145/343681820:4(1-23)Online publication date: 26-May-2021
https://dl.acm.org/doi/10.1145/3436818

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents