A Chunking Strategy Towards Unknown Word Detection in Chinese Word Segmentation

GuoDong, Zhou

doi:10.1007/11562214_47

Zhou GuoDong²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3651))

Included in the following conference series:

International Conference on Natural Language Processing

1650 Accesses

Abstract

This paper proposes a chunking strategy to detect unknown words in Chinese word segmentation. First, a raw sentence is pre-segmented into a sequence of word atoms using a maximum matching algorithm. Then a chunking model is applied to detect unknown words by chunking one or more word atoms together according to the word formation patterns of the word atoms. In this paper, a discriminative Markov model, named Mutual Information Independence Model (MIIM), is adopted in chunking. Besides, a maximum entropy model is applied to integrate various types of contexts and resolve the data sparseness problem in MIIM. Moreover, an error-driven learning approach is proposed to learn useful contexts in the maximum entropy model. In this way, the number of contexts in the maximum entropy model can be significantly reduced without performance decrease. This makes it possible for further improving the performance by considering more various types of contexts. Evaluation on the PK and CTB corpora in the First SIGHAN Chinese word segmentation bakeoff shows that our chunking approach successfully detects about 80% of unknown words on both of the corpora and outperforms the best-reported systems by 8.1% and 7.1% in unknown word detection on them respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Hierarchical Amharic Base Phrase Chunking Using HMM with Error Pruning

Neural Chinese Word Segmentation with Dictionary Knowledge

NE–LP: Normalized entropy- and loss prediction-based sampling for active learning in Chinese word segmentation on EHRs

Article 29 March 2021

References

Jie, C.Y., Liu, Y., Liang, N.Y.: On methods of Chinese automatic segmentation. Journal of Chinese Information Processing 3(1), 1–9 (1989)
Google Scholar
Li, K.C., Liu, K.Y., Zhang, Y.K.: Segmenting Chinese word and processing different meanings structure. Journal of Chinese Information Processing 2(3), 27–33 (1988)
Google Scholar
Liang, N.Y.: The knowledge of Chinese word segmentation. Journal of Chinese Information Processing 4(2), 29–33 (1990)
Google Scholar
Lua, K.T.: From character to word - An application of information theory. Computer Processing of Chinese & Oriental Languages 4(4), 304–313 (1990)
Google Scholar
Lua, K.T., Gan, G.W.: An application of information theory in Chinese word segmentation. Computer Processing of Chinese & Oriental Languages 8(1), 115–124 (1994)
Google Scholar
Wang, Y.C., SU, H., Mo, Y.: Automatic processing of Chinese words. Journal of Chinese Information Processing 4(4), 1–11 (1990)
Google Scholar
Wu, J.M., Tseng, G.: Chinese text segmentation for text retrieval: achievements and problems. Journal of the American Society for Information Science 44(9), 532–542 (1993)
Article Google Scholar
Xu, H., He, K.K., Sun, B.: The implementation of a written Chinese automatic segmentation expert system. Journal of Chinese Information Processing 5(3), 38–47 (1991)
Google Scholar
Yao, T.S., Zhang, G.P., Wu, Y.M.: A rule-based Chinese automatic segmentation system. Journal of Chinese Information Processing 4(1), 37–43 (1990)
Google Scholar
Yeh, C.L., Lee, H.J.: Rule-based word identification for Mandarin Chinese sentences - A unification approach. Computer Processing of Chinese & Oriental Languages 9(2), 97–118 (1995)
Google Scholar
Nie, J.Y., Jin, W.Y., Hannan, M.-L.: A hybrid approach to unknown word detection and segmentation of Chinese. Chinese Processing of Chinese and Oriental Languages 11(4), 326–335 (1997)
Google Scholar
Tung, C.H., Lee, H.J.: Identification of unknown word from a corpus. Computer Processing of Chinese & Oriental Languages 8 (Suppl.), 131–146 (1994)
Google Scholar
Chang, J.S., et al.: A multi-corpus approach to recognition of proper names in Chinese Text. Computer Processing of Chinese & Oriental Languages 8(1), 75–86 (1994)
Google Scholar
Sun, M.S., Huang, C.N., Gao, H.Y., Fang, J.: Identifying Chinese Names In Unrestricted Texts. Communications of Chinese and Oriental Languages Information Processing Society 4(2), 113–122 (1994)
Google Scholar
Zhou, G.D., Lua, K.T.: Detection of Unknown Chinese Words Using a Hybrid Approach. Computer Processing of Chinese & Oriental Language 11(1), 63–75 (1997)
Google Scholar
Charniak, E.: Statistical language learning. The MIT Press, Cambridge ISBN 0-262-03216-3
Google Scholar
Zhou, G.D., Su, J.: Named Entity Recognition Using a HMM-based Chunk Tagger. In: Proceedings of the Conference on Annual Meeting for Computational Linguistics (ACL 2002), Philadelphia, pp. 473–480 (2002)
Google Scholar
Rabiner, L.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. IEEE 77(2), 257–285 (1989)
Article Google Scholar
Viterbi, A.J.: Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm. IEEE Transactions on Information Theory, IT 13(2), 260–269 (1967)
Article MATH Google Scholar
Gale, W.A., Sampson, G.: Good-Turing frequency estimation without tears. Journal of Quantitative Linguistics 2, 217–237 (1995)
Article Google Scholar
Jelinek, F.: Self-Organized Language Modeling for Speech Recognition. In: Waibel, A., Lee, K.-F. (eds.) Readings in Speech Recognitiopn, pp. 450–506. Morgan Kaufmann, San Francisco (1989)
Google Scholar
Katz, S.M.: Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Transactions on Acoustics. Speech and Signal Processing 35, 400–401 (1987)
Article Google Scholar
Chen, Goodman: An Empirical Study of Smoothing Technniques for Language Modeling. In: Proceedings of the 34th Annual Meeting of the Association of Computational Linguistics (ACL 1996), Santa Cruz, California, USA, pp. 310–318 (1996)
Google Scholar
Ratnaparkhi, A.: A Maximum Entropy Model for Part-of-Speech Tagging. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 133–142 (1996)
Google Scholar
Zhang, H.P., Yu, H.K., Xiong, D.Y., Liu, Q.: HHMM-based Chinese Lexical Analyzer ICTCLAS. In: Proceedings of 2^nd SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, pp. 184–187 (2003)
Google Scholar
Wu, A.D.: Chinese Word Segmentation in MSR-NLP. In: Proceedings of 2^nd SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, pp. 172–175 (2003)
Google Scholar
Chen, A.T.: Chinese Word Segmentation Using Minimal Linguistic Knowledge. In: Proceedings of 2^nd SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, pp. 148–151 (2003)
Google Scholar
Duan, H.M., Bai, X.J., Chang, B.B., Yu, S.W.: Chinese Word Segmentation at Peking University. In: Proceedings of 2^nd SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, pp. 152–155 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Infocomm Research, 21 Heng Mui Keng Terrace, 119613, Singapore
Zhou GuoDong

Authors

Zhou GuoDong
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Language Technology, Macquarie University, 2019, Sydney, NSW, Australia
Robert Dale
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong
Kam-Fai Wong
Institute for Infocomm Research, 21, Heng Mui Keng Terrace, 119613, Singapore
Jian Su
Language Information Sciences Research Centre, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
Oi Yee Kwong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

GuoDong, Z. (2005). A Chunking Strategy Towards Unknown Word Detection in Chinese Word Segmentation. In: Dale, R., Wong, KF., Su, J., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2005. IJCNLP 2005. Lecture Notes in Computer Science(), vol 3651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562214_47

Download citation

DOI: https://doi.org/10.1007/11562214_47
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29172-5
Online ISBN: 978-3-540-31724-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics