Abstract
In our study, sentences are represented as sequences of critical fragments, and critical fragments with more than one distinct resolution found in the training corpus are considered as being ambiguous. Different from other studies, the ambiguous critical fragments are disambiguated using an example-based system in our study. The contexts, i.e. the adjacent characters, words and critical fragments, on either side of an ambiguous critical fragment, are used to measure the distance between training and testing examples. Two kinds of measures, overlap metric and chi-squared feature weighting, are employed, and our system achieves a precision of 93.65% and a recall of 96.56% in the open test.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Walter, D., Zavrel, J., van der Sloot, K., van den Bosch, A.: TiMBL: Tilburg Memory-Based Learner, Version 5.0, Reference Guide (2003), http://ilk.uvt.nl/downloads/pub/papers/ilk0310.ps.gz
Jin, G.: Critical Tokenization and Its Properties. Computational Linguistics 23(4), 569–596 (1997a)
Jin, G.: Longest Tokenization. International Journal of Computational Linguistics & Chinese Language Processing 2(2) (1997b)
Jin, G.: One Tokenization per Source. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, Montreal, Canada (1998)
Hockenmaier, J., Brew, C.: Error-driven Learning of Chinese Word Segmentation. In: Proceedings of the 12th Pacific Conference on Language and Information, Singapore, pp. 218–229 (1998)
Chunyu, K., Pan, H., Chen, H.: Learning Case-based Knowledge for Disambiguating Chinese Word Segmentation: A Preliminary Study. In: COLING 2002 Workshop: First SigHAN, pp. 33–39 (2002)
Lai, T.B.Y., Lun, S.C., Sun, C.F., Sun, M.S.: A Tagging-based First-order Markov Model Approach to Automatic Word Identification for Chinese Sentences. In: Proceedings of International Conference on Computer Processing of Chinese and Oriental Languages, Florida, pp. 17–23 (1992)
Mu, L., Gao, J., Huang, C.-N., Li, J.: Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation. In: SIGHAN 2002, Japan (2003)
Nanyuan, L.: (梁南元). An Automatic Word Segmentation System for Written Chinese. CDWS (書面漢語自動分詞系統. CDWS). Journal of Chinese Information Processing (3), 44–52 (1987)
Kaiying, L.: (劉開瑛). Automatic Segmentation and Tagging of Chinese Texts (中文文本自動分詞和標注). Commercial Press, Beijing (2000)
Siegel, S.: Nonparametric Statistics. McGraw-Hill, New York (1956)
Richard, S., Gale, W., Shih, C., Chang, N.: A Stochastic Finite- State Word-Segmentation Algorithm for Chinese. Computational Linguistics 22(3) (1996)
Richard, S., Emerson, T.: The First International Chinese Word Segmentation Bakeoff. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, Japan (2003)
Maosong, S., Zuo, Z., Tsou, B.K.: The Role of High Frequent Maximal Crossing Ambiguities in Chinese Word Segmentation. Journal of Chinese Information Processing (1), 27–34 (1999) (in Chinese)
Xiaolong, W., Wang, K., Li, Z., Bai, X.: (王曉龍, 王開鑄, 李仲榮, 白小華). The Problem of Least Word Segmentation and its Solution (最少分詞問題及其解法). Chinese Science Bulletin 13 (1989)
White Allan, P., Liu, W.Z.: Bias in Information-based Measures in Decision Tree Induction. Machine Learning 15(3), 321–329 (1994)
Tianshun, Y., Zhang, G., Wu, Y.: A Rule-based Chinese Automatic Segmentation System. Journal of Chinese Information Processing (1) (1990)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hu, Q., Pan, H., Kit, C. (2005). An Example-Based Study on Chinese Word Segmentation Using Critical Fragments. In: Su, KY., Tsujii, J., Lee, JH., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2004. IJCNLP 2004. Lecture Notes in Computer Science(), vol 3248. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30211-7_75
Download citation
DOI: https://doi.org/10.1007/978-3-540-30211-7_75
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24475-2
Online ISBN: 978-3-540-30211-7
eBook Packages: Computer ScienceComputer Science (R0)