Abstract
This paper proposes a chunking strategy to detect unknown words in Chinese word segmentation. First, a raw sentence is pre-segmented into a sequence of word atoms using a maximum matching algorithm. Then a chunking model is applied to detect unknown words by chunking one or more word atoms together according to the word formation patterns of the word atoms. In this paper, a discriminative Markov model, named Mutual Information Independence Model (MIIM), is adopted in chunking. Besides, a maximum entropy model is applied to integrate various types of contexts and resolve the data sparseness problem in MIIM. Moreover, an error-driven learning approach is proposed to learn useful contexts in the maximum entropy model. In this way, the number of contexts in the maximum entropy model can be significantly reduced without performance decrease. This makes it possible for further improving the performance by considering more various types of contexts. Evaluation on the PK and CTB corpora in the First SIGHAN Chinese word segmentation bakeoff shows that our chunking approach successfully detects about 80% of unknown words on both of the corpora and outperforms the best-reported systems by 8.1% and 7.1% in unknown word detection on them respectively.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Jie, C.Y., Liu, Y., Liang, N.Y.: On methods of Chinese automatic segmentation. Journal of Chinese Information Processing 3(1), 1–9 (1989)
Li, K.C., Liu, K.Y., Zhang, Y.K.: Segmenting Chinese word and processing different meanings structure. Journal of Chinese Information Processing 2(3), 27–33 (1988)
Liang, N.Y.: The knowledge of Chinese word segmentation. Journal of Chinese Information Processing 4(2), 29–33 (1990)
Lua, K.T.: From character to word - An application of information theory. Computer Processing of Chinese & Oriental Languages 4(4), 304–313 (1990)
Lua, K.T., Gan, G.W.: An application of information theory in Chinese word segmentation. Computer Processing of Chinese & Oriental Languages 8(1), 115–124 (1994)
Wang, Y.C., SU, H., Mo, Y.: Automatic processing of Chinese words. Journal of Chinese Information Processing 4(4), 1–11 (1990)
Wu, J.M., Tseng, G.: Chinese text segmentation for text retrieval: achievements and problems. Journal of the American Society for Information Science 44(9), 532–542 (1993)
Xu, H., He, K.K., Sun, B.: The implementation of a written Chinese automatic segmentation expert system. Journal of Chinese Information Processing 5(3), 38–47 (1991)
Yao, T.S., Zhang, G.P., Wu, Y.M.: A rule-based Chinese automatic segmentation system. Journal of Chinese Information Processing 4(1), 37–43 (1990)
Yeh, C.L., Lee, H.J.: Rule-based word identification for Mandarin Chinese sentences - A unification approach. Computer Processing of Chinese & Oriental Languages 9(2), 97–118 (1995)
Nie, J.Y., Jin, W.Y., Hannan, M.-L.: A hybrid approach to unknown word detection and segmentation of Chinese. Chinese Processing of Chinese and Oriental Languages 11(4), 326–335 (1997)
Tung, C.H., Lee, H.J.: Identification of unknown word from a corpus. Computer Processing of Chinese & Oriental Languages 8 (Suppl.), 131–146 (1994)
Chang, J.S., et al.: A multi-corpus approach to recognition of proper names in Chinese Text. Computer Processing of Chinese & Oriental Languages 8(1), 75–86 (1994)
Sun, M.S., Huang, C.N., Gao, H.Y., Fang, J.: Identifying Chinese Names In Unrestricted Texts. Communications of Chinese and Oriental Languages Information Processing Society 4(2), 113–122 (1994)
Zhou, G.D., Lua, K.T.: Detection of Unknown Chinese Words Using a Hybrid Approach. Computer Processing of Chinese & Oriental Language 11(1), 63–75 (1997)
Charniak, E.: Statistical language learning. The MIT Press, Cambridge ISBN 0-262-03216-3
Zhou, G.D., Su, J.: Named Entity Recognition Using a HMM-based Chunk Tagger. In: Proceedings of the Conference on Annual Meeting for Computational Linguistics (ACL 2002), Philadelphia, pp. 473–480 (2002)
Rabiner, L.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. IEEE 77(2), 257–285 (1989)
Viterbi, A.J.: Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm. IEEE Transactions on Information Theory, IT 13(2), 260–269 (1967)
Gale, W.A., Sampson, G.: Good-Turing frequency estimation without tears. Journal of Quantitative Linguistics 2, 217–237 (1995)
Jelinek, F.: Self-Organized Language Modeling for Speech Recognition. In: Waibel, A., Lee, K.-F. (eds.) Readings in Speech Recognitiopn, pp. 450–506. Morgan Kaufmann, San Francisco (1989)
Katz, S.M.: Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Transactions on Acoustics. Speech and Signal Processing 35, 400–401 (1987)
Chen, Goodman: An Empirical Study of Smoothing Technniques for Language Modeling. In: Proceedings of the 34th Annual Meeting of the Association of Computational Linguistics (ACL 1996), Santa Cruz, California, USA, pp. 310–318 (1996)
Ratnaparkhi, A.: A Maximum Entropy Model for Part-of-Speech Tagging. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 133–142 (1996)
Zhang, H.P., Yu, H.K., Xiong, D.Y., Liu, Q.: HHMM-based Chinese Lexical Analyzer ICTCLAS. In: Proceedings of 2nd SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, pp. 184–187 (2003)
Wu, A.D.: Chinese Word Segmentation in MSR-NLP. In: Proceedings of 2nd SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, pp. 172–175 (2003)
Chen, A.T.: Chinese Word Segmentation Using Minimal Linguistic Knowledge. In: Proceedings of 2nd SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, pp. 148–151 (2003)
Duan, H.M., Bai, X.J., Chang, B.B., Yu, S.W.: Chinese Word Segmentation at Peking University. In: Proceedings of 2nd SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, pp. 152–155 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
GuoDong, Z. (2005). A Chunking Strategy Towards Unknown Word Detection in Chinese Word Segmentation. In: Dale, R., Wong, KF., Su, J., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2005. IJCNLP 2005. Lecture Notes in Computer Science(), vol 3651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562214_47
Download citation
DOI: https://doi.org/10.1007/11562214_47
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29172-5
Online ISBN: 978-3-540-31724-1
eBook Packages: Computer ScienceComputer Science (R0)