An Example-Based Study on Chinese Word Segmentation Using Critical Fragments

Hu, Qinan; Pan, Haihua; Kit, Chunyu

doi:10.1007/978-3-540-30211-7_75

Qinan Hu²²,
Haihua Pan²² &
Chunyu Kit²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3248))

Included in the following conference series:

International Conference on Natural Language Processing

1574 Accesses

Abstract

In our study, sentences are represented as sequences of critical fragments, and critical fragments with more than one distinct resolution found in the training corpus are considered as being ambiguous. Different from other studies, the ambiguous critical fragments are disambiguated using an example-based system in our study. The contexts, i.e. the adjacent characters, words and critical fragments, on either side of an ambiguous critical fragment, are used to measure the distance between training and testing examples. Two kinds of measures, overlap metric and chi-squared feature weighting, are employed, and our system achieves a precision of 93.65% and a recall of 96.56% in the open test.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Walter, D., Zavrel, J., van der Sloot, K., van den Bosch, A.: TiMBL: Tilburg Memory-Based Learner, Version 5.0, Reference Guide (2003), http://ilk.uvt.nl/downloads/pub/papers/ilk0310.ps.gz
Jin, G.: Critical Tokenization and Its Properties. Computational Linguistics 23(4), 569–596 (1997a)
Google Scholar
Jin, G.: Longest Tokenization. International Journal of Computational Linguistics & Chinese Language Processing 2(2) (1997b)
Google Scholar
Jin, G.: One Tokenization per Source. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, Montreal, Canada (1998)
Google Scholar
Hockenmaier, J., Brew, C.: Error-driven Learning of Chinese Word Segmentation. In: Proceedings of the 12th Pacific Conference on Language and Information, Singapore, pp. 218–229 (1998)
Google Scholar
Chunyu, K., Pan, H., Chen, H.: Learning Case-based Knowledge for Disambiguating Chinese Word Segmentation: A Preliminary Study. In: COLING 2002 Workshop: First SigHAN, pp. 33–39 (2002)
Google Scholar
Lai, T.B.Y., Lun, S.C., Sun, C.F., Sun, M.S.: A Tagging-based First-order Markov Model Approach to Automatic Word Identification for Chinese Sentences. In: Proceedings of International Conference on Computer Processing of Chinese and Oriental Languages, Florida, pp. 17–23 (1992)
Google Scholar
Mu, L., Gao, J., Huang, C.-N., Li, J.: Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation. In: SIGHAN 2002, Japan (2003)
Google Scholar
Nanyuan, L.: (梁南元). An Automatic Word Segmentation System for Written Chinese. CDWS (書面漢語自動分詞系統. CDWS). Journal of Chinese Information Processing (3), 44–52 (1987)
Google Scholar
Kaiying, L.: (劉開瑛). Automatic Segmentation and Tagging of Chinese Texts (中文文本自動分詞和標注). Commercial Press, Beijing (2000)
Google Scholar
Siegel, S.: Nonparametric Statistics. McGraw-Hill, New York (1956)
MATH Google Scholar
Richard, S., Gale, W., Shih, C., Chang, N.: A Stochastic Finite- State Word-Segmentation Algorithm for Chinese. Computational Linguistics 22(3) (1996)
Google Scholar
Richard, S., Emerson, T.: The First International Chinese Word Segmentation Bakeoff. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, Japan (2003)
Google Scholar
Maosong, S., Zuo, Z., Tsou, B.K.: The Role of High Frequent Maximal Crossing Ambiguities in Chinese Word Segmentation. Journal of Chinese Information Processing (1), 27–34 (1999) (in Chinese)
Google Scholar
Xiaolong, W., Wang, K., Li, Z., Bai, X.: (王曉龍, 王開鑄, 李仲榮, 白小華). The Problem of Least Word Segmentation and its Solution (最少分詞問題及其解法). Chinese Science Bulletin 13 (1989)
Google Scholar
White Allan, P., Liu, W.Z.: Bias in Information-based Measures in Decision Tree Induction. Machine Learning 15(3), 321–329 (1994)
MATH Google Scholar
Tianshun, Y., Zhang, G., Wu, Y.: A Rule-based Chinese Automatic Segmentation System. Journal of Chinese Information Processing (1) (1990)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Chinese, Translation and Linguistics, City University of Hong Kong, Hong Kong
Qinan Hu, Haihua Pan & Chunyu Kit

Authors

Qinan Hu
View author publications
You can also search for this author in PubMed Google Scholar
Haihua Pan
View author publications
You can also search for this author in PubMed Google Scholar
Chunyu Kit
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Behavior Design Corporation, IV Science-Based Industrial Park Hsinchu, 2F, No.5, Industry E. Rd, Taiwan
Keh-Yih Su
University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0033, JST CREST, Honcho 4-1-8, Kawaguchi-shi,, 332-0012, Saitama,
Jun’ichi Tsujii
Pohang University of Science and Technology (POSTECH), AITrc, Republic of Korea
Jong-Hyeok Lee
Language Information Sciences Research Centre, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
Oi Yee Kwong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hu, Q., Pan, H., Kit, C. (2005). An Example-Based Study on Chinese Word Segmentation Using Critical Fragments. In: Su, KY., Tsujii, J., Lee, JH., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2004. IJCNLP 2004. Lecture Notes in Computer Science(), vol 3248. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30211-7_75

Download citation

DOI: https://doi.org/10.1007/978-3-540-30211-7_75
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24475-2
Online ISBN: 978-3-540-30211-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics