Abstract
Since a Chinese syllable can correspond to many characters (homophones), the syllable-to-character conversion task is quite challenging for Chinese phonetic input methods (CPIM). There are usually two stages in a CPIM: 1. segment the syllable sequence into syllable words, and 2. select the most likely character words for each syllable word. A CPIM usually assumes that the input is a complete sentence, and evaluates the performance based on a well-formed corpus. However, in practice, most Pinyin users prefer progressive text entry in several short chunks, mainly in one or two words each (most Chinese words consist of two or more characters). Short chunks do not provide enough contexts to perform the best possible syllable-to-character conversion, especially when a chunk consists of overlapping syllable words. In such cases, a conversion system often selects the boundary of a word with the highest frequency. Short chunk input is even more popular on platforms with limited computing power, such as mobile phones. Based on the observation that the relative strength of a word can be quite different when calculated leftwards or rightwards, we propose a simple division of the word context into the left context and the right context. Furthermore, we design a double ranking strategy for each word to reduce the number of errors in Step 1. Our strategy is modeled as the minimum feedback arc set problem on bipartite tournament with approximate solutions derived from genetic algorithm. Experiments show that, compared to the frequency-based method (FBM) (low memory and fast) and the conditional random fields (CRF) model (larger memory and slower), our double ranking strategy has the benefits of less memory and low power requirement with competitive performance. We believe a similar strategy could also be adopted to disambiguate conflicting linguistic patterns effectively.
- Chen, Z. and Lee, K.-F. 2000. A new statistical approach to Chinese Pinyin input. In Proceedings of the Association for Computational Linguistics (ACL’00). 241--247. Google ScholarDigital Library
- Cohn, T., Smith, A., and Osborne, M. 2005. Scaling conditional random fields using error-correcting codes. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL’05). 10--17. Google ScholarDigital Library
- Duan, H.-M., Bai X.-J., Chang, B.-B., and Yu, S.-W. 2003. Chinese word segmentation at Peking University. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing (SIGHAN’03). 17. Google ScholarDigital Library
- Even, G., Naor, J., Schieber, B., and Sudan, M. 1998. Approximating minimum feedback sets and multi-cuts in directed graphs. Algorithm 20, 2, 151--174.Google ScholarCross Ref
- Feng, H., Chen, K., Kit, C., and Deng, X. 2005. Unsupervised segmentation of Chinese corpus using accessor variety. In Proceedings of the Conference on Natural Language Processing (IJCNLP’04). 694--703. Google ScholarDigital Library
- Gao, J.-F. and Zhang, M. 2002. Improving language model size reduction using better pruning criteria. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL’02). 176--182. Google ScholarDigital Library
- Gao, J.-F., Goodman, J., Li, M., and Lee, K.-F. 2002. Toward a unified approach to statistical language modeling for Chinese. ACM Trans. Asian Lang. Inform. Process. 1, 1, 3--33. Google ScholarDigital Library
- Gao, J.-F., Suzuki, H., and Yuan, W. 2006. An empirical study on language model adaptation. ACM Trans. Asian Lang. Inform. Process. 5, 3, 209--227. Google ScholarDigital Library
- Graff, D. 2007. Chinese Gigaword 3rd Ed. Linguistic Data Consortium, Philadelphia, Catalog Number LDC2007T38.Google Scholar
- Guo, J., Hüffner, F., and Moser, H. 2007. Feedback arc set in bipartite tournaments is NP-complete. Inf. Proc. Lett. 102, 2--3, 62--65. Google ScholarDigital Library
- Gupta, S. 2008. Feedback arc set problem in bipartite tournaments. Inf. Proc. Lett. 105, 4, 150--154. Google ScholarDigital Library
- Huang, C.-R. 2009. Tagged Chinese Gigaword Version 2.0. Linguistic Data Consortium, Philadelphia, Catalog Number LDC2009T14.Google Scholar
- Huang, C.-R. Lee, L.-H., Qu, W.-G., and Yu, S.-W. 2008. Quality assurance of automatic annotation of very large corpora: A study based on heterogeneous tagging systems. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC’08).Google Scholar
- Jiang, M. T.-J., Lee, C.-W., Liu, C., Chang, Y.-C., and Hsu W.-L. 2011. Robustness analysis of adaptive chinese input methods. In Proceedings of the Workshop on Advances in Text Input Methods (WTIM’11). 53--61.Google Scholar
- Lafferty, J. D., Mccallum, A., and Pereira, F. C. N. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML’01). 282--289. Google ScholarDigital Library
- Levow, G. A. 2006. The 3rd International Chinese Language Processing Bakeoff: Word segmentation and named entity recognition. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing (CLP’06). 108--117.Google Scholar
- Li, L., Wang, X., Wang, X.-L., and Yu, Y.-B. 2009. A conditional random fields approach to Chinese Pinyin-to-character conversion. J. Comm. Comput. 6, 4, 25--31.Google Scholar
- Li, M., Gao, J.-F., Huang, C.-N., and Li, J.-F. 2003. Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing (SIGHAN’03). 17, 1--7. Google ScholarDigital Library
- Li, R., Liu, S.-H., Ye, S.-W., and Shi, Z.-Z. 2001. A method of crossing ambiguities in Chinese word segmentation based on SVM and k-NN (in Chinese). J. Chin. Inf. Proc. 15, 6, 13--18.Google Scholar
- Liang, N.-Y. 1987. A written Chinese automatic segmentation system (in Chinese). J. Chin. Inf. Proc. 2, 44--52.Google Scholar
- Liu, B.-Q. and Wang, X.-L. 2002. An approach to machine learning of Chinese Pinyin-to-character conversion for small-memory application. In Proceedings of the 1st International Conference on Machine Learning and Cybernetics (CMLC’02). 1287--1291.Google Scholar
- Liu, Y. and Wang, Q.-Q. 2007. Chinese Pinyin phrasal input on mobile phone: Usability and developing trends. In Proceedings of the 4th International Conference on Mobile Technology, Applications, and Systems and the 1st International Symposium on Computer Human Interaction in Mobile Technology (Mobility’07). 540--546. Google ScholarDigital Library
- Low, J. K., Ng, H. T., and Guo, W. 2005. A maximum entropy approach to Chinese word segmentation. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing (SIGHAN’05). 161--164.Google Scholar
- Mackenzie, S. I. and Soukoreff, W. R. 2002. Text entry for mobile computing: Models and methods, theory and practice. Hum. Comp. Inter. 17, 2, 147--198.Google ScholarCross Ref
- Peng, F., Feng, F., and Mccallum, A. 2004. Chinese segmentation and new word detection using conditional random fields. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’04). 562--568. Google ScholarDigital Library
- Qiao, W., Sun, M.-S., and Menzel, W. 2008. Statistical properties of overlapping ambiguities in Chinese word segmentation and a strategy for their disambiguation. In Proceedings of the 11th International Conference on Text, Speech and Dialogue (TSD’08). 177--186. Google ScholarDigital Library
- Sproat, R. and Emerson, T. 2003. The 1st International Chinese Word Segmentation Bakeoff. In Proceedings of the SIGHAN Workshop on Chinese Language Processing (SIGHAN’03). 133--143. Google ScholarDigital Library
- Stonedahl, F., Rand, W., and Wilensky, U. 2008. CrossNet: A framework for crossover with network-based chromosomal representations. In Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation (GECCO’08). 1057--1064. Google ScholarDigital Library
- Sun, M.-S. and Zuo, Z.-P. 1998. Overlapping ambiguity in Chinese text (in Chinese). Quantitative and Computational Studies on the Chinese Language. HK. 323--338.Google Scholar
- Sun, W. 2011. A stacked sub-word model for joint Chinese word segmentation and part-of-speech tagging. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL’11). 1385--1394. Google ScholarDigital Library
- Tsai, R. T.-H., Hung, H.-C., Sung, C.-L., Dai, H.-J., and Hsu, W.-L. 2006. On closed task of Chinese word segmentation: An improved CRF model coupled with character clustering and automatically generated template matching. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing (SIGHAN’06). 108--117.Google Scholar
- Tseng, H., Chang, P., Andrew, G., Jurafsky, D., and Manning, C. 2005. A conditional random field word segmenter for SIGHAN Bakeoff 2005. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing (SIGHAN’06). 168--171.Google Scholar
- Wang, X., Li, L., Yao, L., and Anwar, W. 2006. A maximum entropy approach to Chinese Pinyin-to-character conversion. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics (SMC’06).Google Scholar
- Ward, D. J., Blackwell, A. F., and Mackay, D. J. C. 2000. Dasher --- A data entry interface using continuous gestures and language models. In Proceedings of the 13th Annual ACM Symposium on User Interface Software and Technology (UIST’00). 129--137. Google ScholarDigital Library
- Wen, J., Wang, X.-J., Xu, W.-Z., and Jiang, H.-X. 2008. Ambiguity solution of Pinyin segmentation in continuous Pinyin-to-character conversion. In Proceedings of the IEEE International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE’08).Google Scholar
- Wu, G.-Q. and Zheng, F. 2003. A method to build a super small but practically accurate language model for handheld devices. J. Comput. Sci. Tech. 18, 6, 747--755. Google ScholarDigital Library
- Xiao, J.-H., Liu, B.-Q., and Wang, X.-L. 2007. Exploiting Pinyin constraints in Pinyin-to-character conversion task: A class-based maximum entropy Markov model approach. Comput. Linguist. Chin. Lang. Proc. 12, 3, 325--348.Google Scholar
- Xue, N. 2003. Chinese word segmentation as character tagging. Comput. Linguist. Chin. Lang. Proc. 18, 1, 29--48.Google Scholar
- Xue, N. and Shen, L. 2003. Chinese word segmentation as LMR tagging. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing (SIGHAN’03). 176--179. Google ScholarDigital Library
- Yang, K.-C., Ho, T.-H., Chien, L.-F., and Lee, L.-S. 1998. Statistics-based segment pattern lexicon: A new direction for Chinese language modeling. In Proceedings of the IEEE International Conference on Acoustic, Speech, Signal Processing (ICASSP’98). 169--172.Google Scholar
- Zhang, K. and Sun, M. 2011. A comparison study of candidate generation for Chinese word segmentation. In Proceedings of the 7th IEEE International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE’11). 60--67.Google Scholar
- Zhang, M., Zhou, G.-D., Yang, L.-P., and Ji, D.-H. 2006. Chinese word segmentation and named entity recognition based on a context-dependent mutual information independence model. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing (SIGHAN’06). 154--157.Google Scholar
- Zhao, H. and Kit, C.-Y. 2011. Integrating unsupervised and supervised word segmentation: The role of goodness measures. Inf. Sci. 181, 1, 163--183. Google ScholarDigital Library
- Zhao, H., Huang, C.-N., Li, M., and Lu, B.-L. 2010. A unified character-based tagging framework for Chinese word segmentation. ACM Trans. Asian Lang. Inform. Process. 9, 2. Google ScholarDigital Library
- Zheng, F. 1999. A syllable-synchronous network search algorithm for word decoding in Chinese speech recognition. In Proceedings of the Conference on Acoustics, Speech, and Signal Processing (ICASSP’99). 601--604. Google ScholarDigital Library
Index Terms
- The Left and Right Context of a Word: Overlapping Chinese Syllable Word Segmentation with Minimal Context
Recommendations
Two-Word Collocation Extraction Using Monolingual Word Alignment Method
Statistical bilingual word alignment has been well studied in the field of machine translation. This article adapts the bilingual word alignment algorithm into a monolingual scenario to extract collocations from monolingual corpus, based on the fact ...
Improving bilingual word embeddings mapping with monolingual context information
AbstractBilingual word embeddings (BWEs) play a very important role in many natural language processing (NLP) tasks, especially cross-lingual tasks such as machine translation (MT) and cross-language information retrieval. Most existing methods to train ...
Word independent context pair classification model for word sense disambiguation
CONLL '05: Proceedings of the Ninth Conference on Computational Natural Language LearningTraditionally, word sense disambiguation (WSD) involves a different context classification model for each individual word. This paper presents a weakly supervised learning approach to WSD based on learning a word independent context pair classification ...
Comments