Skip to main content

A Chunking Strategy Towards Unknown Word Detection in Chinese Word Segmentation

  • Conference paper
Natural Language Processing – IJCNLP 2005 (IJCNLP 2005)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3651))

Included in the following conference series:

Abstract

This paper proposes a chunking strategy to detect unknown words in Chinese word segmentation. First, a raw sentence is pre-segmented into a sequence of word atoms using a maximum matching algorithm. Then a chunking model is applied to detect unknown words by chunking one or more word atoms together according to the word formation patterns of the word atoms. In this paper, a discriminative Markov model, named Mutual Information Independence Model (MIIM), is adopted in chunking. Besides, a maximum entropy model is applied to integrate various types of contexts and resolve the data sparseness problem in MIIM. Moreover, an error-driven learning approach is proposed to learn useful contexts in the maximum entropy model. In this way, the number of contexts in the maximum entropy model can be significantly reduced without performance decrease. This makes it possible for further improving the performance by considering more various types of contexts. Evaluation on the PK and CTB corpora in the First SIGHAN Chinese word segmentation bakeoff shows that our chunking approach successfully detects about 80% of unknown words on both of the corpora and outperforms the best-reported systems by 8.1% and 7.1% in unknown word detection on them respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Jie, C.Y., Liu, Y., Liang, N.Y.: On methods of Chinese automatic segmentation. Journal of Chinese Information Processing 3(1), 1–9 (1989)

    Google Scholar 

  2. Li, K.C., Liu, K.Y., Zhang, Y.K.: Segmenting Chinese word and processing different meanings structure. Journal of Chinese Information Processing 2(3), 27–33 (1988)

    Google Scholar 

  3. Liang, N.Y.: The knowledge of Chinese word segmentation. Journal of Chinese Information Processing 4(2), 29–33 (1990)

    Google Scholar 

  4. Lua, K.T.: From character to word - An application of information theory. Computer Processing of Chinese & Oriental Languages 4(4), 304–313 (1990)

    Google Scholar 

  5. Lua, K.T., Gan, G.W.: An application of information theory in Chinese word segmentation. Computer Processing of Chinese & Oriental Languages 8(1), 115–124 (1994)

    Google Scholar 

  6. Wang, Y.C., SU, H., Mo, Y.: Automatic processing of Chinese words. Journal of Chinese Information Processing 4(4), 1–11 (1990)

    Google Scholar 

  7. Wu, J.M., Tseng, G.: Chinese text segmentation for text retrieval: achievements and problems. Journal of the American Society for Information Science 44(9), 532–542 (1993)

    Article  Google Scholar 

  8. Xu, H., He, K.K., Sun, B.: The implementation of a written Chinese automatic segmentation expert system. Journal of Chinese Information Processing 5(3), 38–47 (1991)

    Google Scholar 

  9. Yao, T.S., Zhang, G.P., Wu, Y.M.: A rule-based Chinese automatic segmentation system. Journal of Chinese Information Processing 4(1), 37–43 (1990)

    Google Scholar 

  10. Yeh, C.L., Lee, H.J.: Rule-based word identification for Mandarin Chinese sentences - A unification approach. Computer Processing of Chinese & Oriental Languages 9(2), 97–118 (1995)

    Google Scholar 

  11. Nie, J.Y., Jin, W.Y., Hannan, M.-L.: A hybrid approach to unknown word detection and segmentation of Chinese. Chinese Processing of Chinese and Oriental Languages 11(4), 326–335 (1997)

    Google Scholar 

  12. Tung, C.H., Lee, H.J.: Identification of unknown word from a corpus. Computer Processing of Chinese & Oriental Languages 8 (Suppl.), 131–146 (1994)

    Google Scholar 

  13. Chang, J.S., et al.: A multi-corpus approach to recognition of proper names in Chinese Text. Computer Processing of Chinese & Oriental Languages 8(1), 75–86 (1994)

    Google Scholar 

  14. Sun, M.S., Huang, C.N., Gao, H.Y., Fang, J.: Identifying Chinese Names In Unrestricted Texts. Communications of Chinese and Oriental Languages Information Processing Society 4(2), 113–122 (1994)

    Google Scholar 

  15. Zhou, G.D., Lua, K.T.: Detection of Unknown Chinese Words Using a Hybrid Approach. Computer Processing of Chinese & Oriental Language 11(1), 63–75 (1997)

    Google Scholar 

  16. Charniak, E.: Statistical language learning. The MIT Press, Cambridge ISBN 0-262-03216-3

    Google Scholar 

  17. Zhou, G.D., Su, J.: Named Entity Recognition Using a HMM-based Chunk Tagger. In: Proceedings of the Conference on Annual Meeting for Computational Linguistics (ACL 2002), Philadelphia, pp. 473–480 (2002)

    Google Scholar 

  18. Rabiner, L.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. IEEE 77(2), 257–285 (1989)

    Article  Google Scholar 

  19. Viterbi, A.J.: Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm. IEEE Transactions on Information Theory, IT 13(2), 260–269 (1967)

    Article  MATH  Google Scholar 

  20. Gale, W.A., Sampson, G.: Good-Turing frequency estimation without tears. Journal of Quantitative Linguistics 2, 217–237 (1995)

    Article  Google Scholar 

  21. Jelinek, F.: Self-Organized Language Modeling for Speech Recognition. In: Waibel, A., Lee, K.-F. (eds.) Readings in Speech Recognitiopn, pp. 450–506. Morgan Kaufmann, San Francisco (1989)

    Google Scholar 

  22. Katz, S.M.: Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Transactions on Acoustics. Speech and Signal Processing 35, 400–401 (1987)

    Article  Google Scholar 

  23. Chen, Goodman: An Empirical Study of Smoothing Technniques for Language Modeling. In: Proceedings of the 34th Annual Meeting of the Association of Computational Linguistics (ACL 1996), Santa Cruz, California, USA, pp. 310–318 (1996)

    Google Scholar 

  24. Ratnaparkhi, A.: A Maximum Entropy Model for Part-of-Speech Tagging. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 133–142 (1996)

    Google Scholar 

  25. Zhang, H.P., Yu, H.K., Xiong, D.Y., Liu, Q.: HHMM-based Chinese Lexical Analyzer ICTCLAS. In: Proceedings of 2nd SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, pp. 184–187 (2003)

    Google Scholar 

  26. Wu, A.D.: Chinese Word Segmentation in MSR-NLP. In: Proceedings of 2nd SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, pp. 172–175 (2003)

    Google Scholar 

  27. Chen, A.T.: Chinese Word Segmentation Using Minimal Linguistic Knowledge. In: Proceedings of 2nd SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, pp. 148–151 (2003)

    Google Scholar 

  28. Duan, H.M., Bai, X.J., Chang, B.B., Yu, S.W.: Chinese Word Segmentation at Peking University. In: Proceedings of 2nd SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, pp. 152–155 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

GuoDong, Z. (2005). A Chunking Strategy Towards Unknown Word Detection in Chinese Word Segmentation. In: Dale, R., Wong, KF., Su, J., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2005. IJCNLP 2005. Lecture Notes in Computer Science(), vol 3651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562214_47

Download citation

  • DOI: https://doi.org/10.1007/11562214_47

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-29172-5

  • Online ISBN: 978-3-540-31724-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics