Abstract
Previous research showed that Kalman filter based humancomputer interaction Chinese word segmentation algorithm achieves an encouraging effect in reducing user interventions. This paper designs an improved statistical model for ancient Chinese texts, and integrates it with the Kalman filter based framework. An online interactive system is presented to segment ancient Chinese corpora. Experiments showed that this approach has advantage in processing domain-specific text without the support of dictionaries or annotated corpora. Our improved statistical model outperformed the baseline model by 30% in segmentation precision.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Liang, N.Y.: CDWS: An Automatic Word Segmentation System for Written Chinese Texts. Journal of Chinese Information Processing 2(2), 44–52 (1987) (in Chinese)
Nie, J.Y., Jin, W., Hannan, M.L.: A Hybrid Approach to Unknown Word Detection and Segmentation of Chinese. In: Proceedings of the International Conference on Chinese Computing, pp. 326–335 (1994)
Sun, M., Shen, D., Tsou, B.K.: Chinese Word Segmentation Without Using Lexicon and Hand-Crafted Training Data. In: COLING/ACL 1998, pp. 1265–1271 (1998)
Luo, X., Sun, M., Tsou, B.K.: Covering Ambiguity Resolution in Chinese Word Segmentation Based on Contextual Information. In: COLING 2002, pp. 1–7 (2002)
Zhang, H.P., Liu, Q., Cheng, X.Q., Yu, H.K.: Chinese Lexical Analysis Using Hierarchical Hidden Markov Model. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, pp. 63–70 (2003)
Peng, F., Feng, F., McCallum, A.: Chinese Segmentation and New Word Detection Using Conditional Random Fields. In: COLING 2004, pp. 23–27 (2004)
Goldwater, S., Griffiths, T.L., Johnson, M.: Contextual Dependencies in Unsupervised Word Segmentation. In: COLING/ACL 2006, pp. 673–680 (2006)
Wang, Z., Araki, K., Tochinai, K.: A Word Segmentation Method with Dynamic Adapting to Text Using Inductive Learning. In: Proceedings of the First SIGHAN Workshop on Chinese Language Processing, pp. 1–5 (2002)
Li, M., Gao, J., Huang, C., Li, J.: Unsupervised Training for Overlapping Ambiguity Resolution in Chinese Word Segmentation. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, pp. 1–7 (2003)
Sproat, R., Gale, W., Shih, C., Chang, N.: A Stochastic Finite-State Word-Segmentation Algorithm for Chinese. Computation Linguistics 22(3), 377–404 (1996)
Zhu, W., Sun, N., Zou, X., Hu, J.: The Application of Kalman Filter Based Human-Computer Learning Model to Chinese Word Segmentation. In: Gelbukh, A. (ed.) CICLing 2013, Part I. LNCS, vol. 7816, pp. 218–230. Springer, Heidelberg (2013)
Sproat, R., Shih, C.: A Statistical Method for Finding Word Boundaries in Chinese Text. In: Computer Processing of Chinese and Oriental Languages, pp. 336–351 (1990)
Chien, L.F.: Pat-Tree-Based Keyword Extraction for Chinese Information Retrieval. ACM SIGIR Forum, 50–58 (1997)
Yamamoto, M., Kenneth, C.W.: Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus. Computer Linguistics 27(1), 1–30 (2001)
Sun, M., Xiao, M., Tsou, B.K.: Chinese Word Segmentation without Using Dictionary Based on Unsupervised Learning Strategy. Chinese Journal of Computers 27(6), 736–742 (2004) (in Chinese)
Kit, C., Wilks, Y.: Unsupervised Learning of Word Boundary with Description Length Gain. In: Proceedings of the CoNLL 1999 ACL Workshop, pp. 1–6 (1999)
Feng, H., Chen, K., Deng, X., Zheng, W.: Accessor Variety Criteria for Chinese Word Extraction. Computation Linguistics 30(1), 75–93 (2004)
Jin, Z., Tanaka-Ishii, K.: Unsupervised Segmentation of Chinese Text by Use of Branching Entropy. In: COLING/ACL 2006, pp. 428–435 (2006)
Shi, M., Li, B., Chen, X.: CRF Based Research on a Unified Approach to Word Segmentation and POS Tagging for Pre-Qin Chinese. Journal of Chinees Information Processing 24(2), 39–45 (2010) (in Chinese)
Feng, C., Chen, Z., Huang, H., Guan, Z.: Active Learning in Chinese Word Segmentation Based on Multigram Language Model. Journal of Chinese Information Processing 20(1), 50–58 (2006) (in Chinese)
Li, B., Chen, X.: A Human-Computer Interaction Word Segmentation Method Adapting to Chinese Unknown Texts. Journal of Chinese Information Processing 21(3), 92–98 (2007) (in Chinese)
Kalman, R.E.: A New Approach to Linear Filtering and Prediction Problems. Journal of Basic Engineering 82(1), 35–45 (1960)
Agarwal, D., Chen, B.C., Elango, P., Motgi, N., Park, S.T., Ramakrishnan, R., Roy, S., Zachariah, J.: Online Models for Content Optimization. In: Proceedings of NIPS 2008, pp. 17–24 (2008)
Liu, Z., Sun, M.: Web-Based Automatic Detection for IT New Terms. In: Proceedings of the 9th China National Conference on Computational Linguistics, pp. 515–521 (2007)
Bookstein, A., Klein, S.T., Raita, T.: Clumping Properties of Content-bearing Words. Journal of the American Society for Information Science 49(2), 102–114 (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chen, T., Zhu, W., Lv, X., Hu, J. (2013). A Kalman Filter Based Human-Computer Interactive Word Segmentation System for Ancient Chinese Texts. In: Sun, M., Zhang, M., Lin, D., Wang, H. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2013 2013. Lecture Notes in Computer Science(), vol 8202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41491-6_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-41491-6_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41490-9
Online ISBN: 978-3-642-41491-6
eBook Packages: Computer ScienceComputer Science (R0)