research-article

The Left and Right Context of a Word: Overlapping Chinese Syllable Word Segmentation with Minimal Context

Authors:
Mike Tian-Jian Jiang

National Tsing Hua University and Academia Sinica

National Tsing Hua University and Academia Sinica
View Profile

,
Tsung-Hsien Lee

Academia Sinica and University of Texas at Austin

Academia Sinica and University of Texas at Austin
View Profile

,
Wen-Lian Hsu

Academia Sinica and National Tsing Hua University

Academia Sinica and National Tsing Hua University
View Profile

ACM Transactions on Asian Language Information Processing Volume 12 Issue 1Article No.: 2pp 1–23https://doi.org/10.1145/2425327.2425329

Published:01 March 2013Publication History

ACM Transactions on Asian Language Information Processing

Abstract

Since a Chinese syllable can correspond to many characters (homophones), the syllable-to-character conversion task is quite challenging for Chinese phonetic input methods (CPIM). There are usually two stages in a CPIM: 1. segment the syllable sequence into syllable words, and 2. select the most likely character words for each syllable word. A CPIM usually assumes that the input is a complete sentence, and evaluates the performance based on a well-formed corpus. However, in practice, most Pinyin users prefer progressive text entry in several short chunks, mainly in one or two words each (most Chinese words consist of two or more characters). Short chunks do not provide enough contexts to perform the best possible syllable-to-character conversion, especially when a chunk consists of overlapping syllable words. In such cases, a conversion system often selects the boundary of a word with the highest frequency. Short chunk input is even more popular on platforms with limited computing power, such as mobile phones. Based on the observation that the relative strength of a word can be quite different when calculated leftwards or rightwards, we propose a simple division of the word context into the left context and the right context. Furthermore, we design a double ranking strategy for each word to reduce the number of errors in Step 1. Our strategy is modeled as the minimum feedback arc set problem on bipartite tournament with approximate solutions derived from genetic algorithm. Experiments show that, compared to the frequency-based method (FBM) (low memory and fast) and the conditional random fields (CRF) model (larger memory and slower), our double ranking strategy has the benefits of less memory and low power requirement with competitive performance. We believe a similar strategy could also be adopted to disambiguate conflicting linguistic patterns effectively.

References

Chen, Z. and Lee, K.-F. 2000. A new statistical approach to Chinese Pinyin input. In Proceedings of the Association for Computational Linguistics (ACL’00). 241--247. Google ScholarDigital Library
Cohn, T., Smith, A., and Osborne, M. 2005. Scaling conditional random fields using error-correcting codes. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL’05). 10--17. Google ScholarDigital Library
Duan, H.-M., Bai X.-J., Chang, B.-B., and Yu, S.-W. 2003. Chinese word segmentation at Peking University. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing (SIGHAN’03). 17. Google ScholarDigital Library
Even, G., Naor, J., Schieber, B., and Sudan, M. 1998. Approximating minimum feedback sets and multi-cuts in directed graphs. Algorithm 20, 2, 151--174.Google ScholarCross Ref
Feng, H., Chen, K., Kit, C., and Deng, X. 2005. Unsupervised segmentation of Chinese corpus using accessor variety. In Proceedings of the Conference on Natural Language Processing (IJCNLP’04). 694--703. Google ScholarDigital Library
Gao, J.-F. and Zhang, M. 2002. Improving language model size reduction using better pruning criteria. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL’02). 176--182. Google ScholarDigital Library
Gao, J.-F., Goodman, J., Li, M., and Lee, K.-F. 2002. Toward a unified approach to statistical language modeling for Chinese. ACM Trans. Asian Lang. Inform. Process. 1, 1, 3--33. Google ScholarDigital Library
Gao, J.-F., Suzuki, H., and Yuan, W. 2006. An empirical study on language model adaptation. ACM Trans. Asian Lang. Inform. Process. 5, 3, 209--227. Google ScholarDigital Library
Graff, D. 2007. Chinese Gigaword 3rd Ed. Linguistic Data Consortium, Philadelphia, Catalog Number LDC2007T38.Google Scholar
Guo, J., Hüffner, F., and Moser, H. 2007. Feedback arc set in bipartite tournaments is NP-complete. Inf. Proc. Lett. 102, 2--3, 62--65. Google ScholarDigital Library
Gupta, S. 2008. Feedback arc set problem in bipartite tournaments. Inf. Proc. Lett. 105, 4, 150--154. Google ScholarDigital Library
Huang, C.-R. 2009. Tagged Chinese Gigaword Version 2.0. Linguistic Data Consortium, Philadelphia, Catalog Number LDC2009T14.Google Scholar
Huang, C.-R. Lee, L.-H., Qu, W.-G., and Yu, S.-W. 2008. Quality assurance of automatic annotation of very large corpora: A study based on heterogeneous tagging systems. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC’08).Google Scholar
Jiang, M. T.-J., Lee, C.-W., Liu, C., Chang, Y.-C., and Hsu W.-L. 2011. Robustness analysis of adaptive chinese input methods. In Proceedings of the Workshop on Advances in Text Input Methods (WTIM’11). 53--61.Google Scholar
Lafferty, J. D., Mccallum, A., and Pereira, F. C. N. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML’01). 282--289. Google ScholarDigital Library
Levow, G. A. 2006. The 3rd International Chinese Language Processing Bakeoff: Word segmentation and named entity recognition. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing (CLP’06). 108--117.Google Scholar
Li, L., Wang, X., Wang, X.-L., and Yu, Y.-B. 2009. A conditional random fields approach to Chinese Pinyin-to-character conversion. J. Comm. Comput. 6, 4, 25--31.Google Scholar
Li, M., Gao, J.-F., Huang, C.-N., and Li, J.-F. 2003. Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing (SIGHAN’03). 17, 1--7. Google ScholarDigital Library
Li, R., Liu, S.-H., Ye, S.-W., and Shi, Z.-Z. 2001. A method of crossing ambiguities in Chinese word segmentation based on SVM and k-NN (in Chinese). J. Chin. Inf. Proc. 15, 6, 13--18.Google Scholar
Liang, N.-Y. 1987. A written Chinese automatic segmentation system (in Chinese). J. Chin. Inf. Proc. 2, 44--52.Google Scholar
Liu, B.-Q. and Wang, X.-L. 2002. An approach to machine learning of Chinese Pinyin-to-character conversion for small-memory application. In Proceedings of the 1st International Conference on Machine Learning and Cybernetics (CMLC’02). 1287--1291.Google Scholar
Liu, Y. and Wang, Q.-Q. 2007. Chinese Pinyin phrasal input on mobile phone: Usability and developing trends. In Proceedings of the 4th International Conference on Mobile Technology, Applications, and Systems and the 1st International Symposium on Computer Human Interaction in Mobile Technology (Mobility’07). 540--546. Google ScholarDigital Library
Low, J. K., Ng, H. T., and Guo, W. 2005. A maximum entropy approach to Chinese word segmentation. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing (SIGHAN’05). 161--164.Google Scholar
Mackenzie, S. I. and Soukoreff, W. R. 2002. Text entry for mobile computing: Models and methods, theory and practice. Hum. Comp. Inter. 17, 2, 147--198.Google ScholarCross Ref
Peng, F., Feng, F., and Mccallum, A. 2004. Chinese segmentation and new word detection using conditional random fields. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’04). 562--568. Google ScholarDigital Library
Qiao, W., Sun, M.-S., and Menzel, W. 2008. Statistical properties of overlapping ambiguities in Chinese word segmentation and a strategy for their disambiguation. In Proceedings of the 11th International Conference on Text, Speech and Dialogue (TSD’08). 177--186. Google ScholarDigital Library
Sproat, R. and Emerson, T. 2003. The 1st International Chinese Word Segmentation Bakeoff. In Proceedings of the SIGHAN Workshop on Chinese Language Processing (SIGHAN’03). 133--143. Google ScholarDigital Library
Stonedahl, F., Rand, W., and Wilensky, U. 2008. CrossNet: A framework for crossover with network-based chromosomal representations. In Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation (GECCO’08). 1057--1064. Google ScholarDigital Library
Sun, M.-S. and Zuo, Z.-P. 1998. Overlapping ambiguity in Chinese text (in Chinese). Quantitative and Computational Studies on the Chinese Language. HK. 323--338.Google Scholar
Sun, W. 2011. A stacked sub-word model for joint Chinese word segmentation and part-of-speech tagging. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL’11). 1385--1394. Google ScholarDigital Library
Tsai, R. T.-H., Hung, H.-C., Sung, C.-L., Dai, H.-J., and Hsu, W.-L. 2006. On closed task of Chinese word segmentation: An improved CRF model coupled with character clustering and automatically generated template matching. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing (SIGHAN’06). 108--117.Google Scholar
Tseng, H., Chang, P., Andrew, G., Jurafsky, D., and Manning, C. 2005. A conditional random field word segmenter for SIGHAN Bakeoff 2005. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing (SIGHAN’06). 168--171.Google Scholar
Wang, X., Li, L., Yao, L., and Anwar, W. 2006. A maximum entropy approach to Chinese Pinyin-to-character conversion. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics (SMC’06).Google Scholar
Ward, D. J., Blackwell, A. F., and Mackay, D. J. C. 2000. Dasher --- A data entry interface using continuous gestures and language models. In Proceedings of the 13th Annual ACM Symposium on User Interface Software and Technology (UIST’00). 129--137. Google ScholarDigital Library
Wen, J., Wang, X.-J., Xu, W.-Z., and Jiang, H.-X. 2008. Ambiguity solution of Pinyin segmentation in continuous Pinyin-to-character conversion. In Proceedings of the IEEE International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE’08).Google Scholar
Wu, G.-Q. and Zheng, F. 2003. A method to build a super small but practically accurate language model for handheld devices. J. Comput. Sci. Tech. 18, 6, 747--755. Google ScholarDigital Library
Xiao, J.-H., Liu, B.-Q., and Wang, X.-L. 2007. Exploiting Pinyin constraints in Pinyin-to-character conversion task: A class-based maximum entropy Markov model approach. Comput. Linguist. Chin. Lang. Proc. 12, 3, 325--348.Google Scholar
Xue, N. 2003. Chinese word segmentation as character tagging. Comput. Linguist. Chin. Lang. Proc. 18, 1, 29--48.Google Scholar
Xue, N. and Shen, L. 2003. Chinese word segmentation as LMR tagging. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing (SIGHAN’03). 176--179. Google ScholarDigital Library
Yang, K.-C., Ho, T.-H., Chien, L.-F., and Lee, L.-S. 1998. Statistics-based segment pattern lexicon: A new direction for Chinese language modeling. In Proceedings of the IEEE International Conference on Acoustic, Speech, Signal Processing (ICASSP’98). 169--172.Google Scholar
Zhang, K. and Sun, M. 2011. A comparison study of candidate generation for Chinese word segmentation. In Proceedings of the 7th IEEE International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE’11). 60--67.Google Scholar
Zhang, M., Zhou, G.-D., Yang, L.-P., and Ji, D.-H. 2006. Chinese word segmentation and named entity recognition based on a context-dependent mutual information independence model. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing (SIGHAN’06). 154--157.Google Scholar
Zhao, H. and Kit, C.-Y. 2011. Integrating unsupervised and supervised word segmentation: The role of goodness measures. Inf. Sci. 181, 1, 163--183. Google ScholarDigital Library
Zhao, H., Huang, C.-N., Li, M., and Lu, B.-L. 2010. A unified character-based tagging framework for Chinese word segmentation. ACM Trans. Asian Lang. Inform. Process. 9, 2. Google ScholarDigital Library
Zheng, F. 1999. A syllable-synchronous network search algorithm for word decoding in Chinese speech recognition. In Proceedings of the Conference on Acoustics, Speech, and Signal Processing (ICASSP’99). 601--604. Google ScholarDigital Library

Index Terms

The Left and Right Context of a Word: Overlapping Chinese Syllable Word Segmentation with Minimal Context
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

Two-Word Collocation Extraction Using Monolingual Word Alignment Method

Statistical bilingual word alignment has been well studied in the field of machine translation. This article adapts the bilingual word alignment algorithm into a monolingual scenario to extract collocations from monolingual corpus, based on the fact ...
Read More
Improving bilingual word embeddings mapping with monolingual context information
Abstract
Bilingual word embeddings (BWEs) play a very important role in many natural language processing (NLP) tasks, especially cross-lingual tasks such as machine translation (MT) and cross-language information retrieval. Most existing methods to train ...
Read More
Word independent context pair classification model for word sense disambiguation
CONLL '05: Proceedings of the Ninth Conference on Computational Natural Language Learning

Traditionally, word sense disambiguation (WSD) involves a different context classification model for each individual word. This paper presents a weakly supervised learning approach to WSD based on learning a word independent context pair classification ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Asian Language Information Processing Volume 12, Issue 1
March 2013
102 pages
ISSN:1530-0226
EISSN:1558-3430
DOI:10.1145/2425327
Issue’s Table of Contents

Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 March 2013
- Accepted: 1 February 2012
- Revised: 1 January 2012
- Received: 1 August 2011
Published in talip Volume 12, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Chinese phonetic input methods
syllable-to-word conversion
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 262
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The Left and Right Context of a Word: Overlapping Chinese Syllable Word Segmentation with Minimal Context

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Two-Word Collocation Extraction Using Monolingual Word Alignment Method

Improving bilingual word embeddings mapping with monolingual context information

Word independent context pair classification model for word sense disambiguation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

The Left and Right Context of a Word: Overlapping Chinese Syllable Word Segmentation with Minimal Context

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Two-Word Collocation Extraction Using Monolingual Word Alignment Method

Improving bilingual word embeddings mapping with monolingual context information

Word independent context pair classification model for word sense disambiguation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media