Skip to main content

An Example-Based Study on Chinese Word Segmentation Using Critical Fragments

  • Conference paper
Natural Language Processing – IJCNLP 2004 (IJCNLP 2004)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3248))

Included in the following conference series:

  • 1574 Accesses

Abstract

In our study, sentences are represented as sequences of critical fragments, and critical fragments with more than one distinct resolution found in the training corpus are considered as being ambiguous. Different from other studies, the ambiguous critical fragments are disambiguated using an example-based system in our study. The contexts, i.e. the adjacent characters, words and critical fragments, on either side of an ambiguous critical fragment, are used to measure the distance between training and testing examples. Two kinds of measures, overlap metric and chi-squared feature weighting, are employed, and our system achieves a precision of 93.65% and a recall of 96.56% in the open test.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Walter, D., Zavrel, J., van der Sloot, K., van den Bosch, A.: TiMBL: Tilburg Memory-Based Learner, Version 5.0, Reference Guide (2003), http://ilk.uvt.nl/downloads/pub/papers/ilk0310.ps.gz

  2. Jin, G.: Critical Tokenization and Its Properties. Computational Linguistics 23(4), 569–596 (1997a)

    Google Scholar 

  3. Jin, G.: Longest Tokenization. International Journal of Computational Linguistics & Chinese Language Processing 2(2) (1997b)

    Google Scholar 

  4. Jin, G.: One Tokenization per Source. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, Montreal, Canada (1998)

    Google Scholar 

  5. Hockenmaier, J., Brew, C.: Error-driven Learning of Chinese Word Segmentation. In: Proceedings of the 12th Pacific Conference on Language and Information, Singapore, pp. 218–229 (1998)

    Google Scholar 

  6. Chunyu, K., Pan, H., Chen, H.: Learning Case-based Knowledge for Disambiguating Chinese Word Segmentation: A Preliminary Study. In: COLING 2002 Workshop: First SigHAN, pp. 33–39 (2002)

    Google Scholar 

  7. Lai, T.B.Y., Lun, S.C., Sun, C.F., Sun, M.S.: A Tagging-based First-order Markov Model Approach to Automatic Word Identification for Chinese Sentences. In: Proceedings of International Conference on Computer Processing of Chinese and Oriental Languages, Florida, pp. 17–23 (1992)

    Google Scholar 

  8. Mu, L., Gao, J., Huang, C.-N., Li, J.: Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation. In: SIGHAN 2002, Japan (2003)

    Google Scholar 

  9. Nanyuan, L.: (梁南元). An Automatic Word Segmentation System for Written Chinese. CDWS (書面漢語自動分詞系統. CDWS). Journal of Chinese Information Processing (3), 44–52 (1987)

    Google Scholar 

  10. Kaiying, L.: (劉開瑛). Automatic Segmentation and Tagging of Chinese Texts (中文文本自動分詞和標注). Commercial Press, Beijing (2000)

    Google Scholar 

  11. Siegel, S.: Nonparametric Statistics. McGraw-Hill, New York (1956)

    MATH  Google Scholar 

  12. Richard, S., Gale, W., Shih, C., Chang, N.: A Stochastic Finite- State Word-Segmentation Algorithm for Chinese. Computational Linguistics 22(3) (1996)

    Google Scholar 

  13. Richard, S., Emerson, T.: The First International Chinese Word Segmentation Bakeoff. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, Japan (2003)

    Google Scholar 

  14. Maosong, S., Zuo, Z., Tsou, B.K.: The Role of High Frequent Maximal Crossing Ambiguities in Chinese Word Segmentation. Journal of Chinese Information Processing (1), 27–34 (1999) (in Chinese)

    Google Scholar 

  15. Xiaolong, W., Wang, K., Li, Z., Bai, X.: (王曉龍, 王開鑄, 李仲榮, 白小華). The Problem of Least Word Segmentation and its Solution (最少分詞問題及其解法). Chinese Science Bulletin 13 (1989)

    Google Scholar 

  16. White Allan, P., Liu, W.Z.: Bias in Information-based Measures in Decision Tree Induction. Machine Learning 15(3), 321–329 (1994)

    MATH  Google Scholar 

  17. Tianshun, Y., Zhang, G., Wu, Y.: A Rule-based Chinese Automatic Segmentation System. Journal of Chinese Information Processing (1) (1990)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hu, Q., Pan, H., Kit, C. (2005). An Example-Based Study on Chinese Word Segmentation Using Critical Fragments. In: Su, KY., Tsujii, J., Lee, JH., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2004. IJCNLP 2004. Lecture Notes in Computer Science(), vol 3248. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30211-7_75

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30211-7_75

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-24475-2

  • Online ISBN: 978-3-540-30211-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics