Abstract
This paper presents a human-computer interaction learning model for segmenting Chinese texts depending upon neither lexicon nor any annotated corpus. It enables users to add language knowledge to the system by directly intervening the segmentation process. Within limited times of user intervention, a segmentation result that fully matches the use (or with an accurate rate of 100% by manual judgement) is returned. A Kalman filter based model is adopted to learn and estimate the intention of users quickly and precisely from their interventions to reduce system prediction error hereafter. Experiments show that it achieves an encouraging performance in saving human effort and the segmenter with knowledge learned from users outperforms the baseline model by about 10% in segmenting homogenous texts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Liang, N.Y.: CDWS: An Automatic Word Segmentation System for Written Chinese Texts. Journal of Chinese Information Processing 1 (1987) (in Chinese)
Nie, J.Y., Jin, W., Hannan, M.L.: A Hybrid Approach to Unknown Word Detection and Segmentation of Chinese. In: Proceedings of the International Conference on Chinese Computing, pp. 326–335 (1994)
Wu, Z.: LDC Chinese Segmenter, http://www.ldc.upenn.edu/Projects/Chinese/segmenter/mansegment.perl
Luo, X., Sun, M., Tsou, B.K.: Covering Ambiguity Resolution in Chinese Word Segmentation Based on Contextual Information. In: COLING 2002, pp. 1–7 (2002)
Li, M., Gao, J., Huang, C.N., Li, J.: Unsupervised Training for Overlapping Ambiguity Resolution in Chinese Word Segmentation. In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing, pp. 1–7 (2003)
Sun, C., Huang, C.N., Guan, Y.: Combinative Ambiguity String Detection and Resolution Based on Annotated Corpus. In: Proceedings of the 3rd Student Workshop on Computational Linguistics (2006)
Sun, M.S., Shen, D.Y., Tsou, B.K.: Chinese Word Segmentation Without Using Lexicon and Hand-Crafted Training Data. In: COLING/ACL 1998, pp. 1265–1271 (1998)
Goldwater, S., Griffiths, T.L., Johnson, M.: Contextual Dependencies in Unsupervised Word Segmentation. In: COLING/ACL 2006, pp. 673–680 (2006)
Xue, N.: Chinese Word Segmentation as Character Tagging. Computational Linguistics and Chinese Language Processing 8, 29–48 (2003)
Zhang, H., Liu, Q., Cheng, X., Zhang, H., Yu, H.: Chinese Lexical Analysis Using Hierarchical Hidden Markov Model. In: Proceedings of the Second SIGHAN Workshop, pp. 63–70 (2003)
Peng, F., Feng, F., Mcallum, A.: Chinese Segmentation and New Word Detection Using Conditional Random Fields. In: COLING 2004, pp. 23–27 (2004)
Wang, Z., Araki, K., Tochinai, K.: A Word Segmentation Method with Dynamic Adapting to Text Using Inductive Learning. In: Proceedings of the First SIGHAN Workshop on Chinese Language, vol. 18, pp. 1–5 (2002)
Li, B., Chen, X.H.: A Human-Computuer Interaction Word Segmentation Method Adapting to Chinese Unknown Texts. Journal of Chinese Information Processing 21 (2007) (in Chinese)
Sproat, R., Shih, C., Gale, W., Chang, N.: A Stochastic Finite-State Word-Segmentation Algorithm for Chinese. Association for Computational Linguistics 22, 377–404 (1996)
Sproat, R., Shih, C.: A Statistical Method for Finding Word Boundaries in Chinese Text. Computer Processing of Chinese and Oriental Languages 4, 336–351 (1990)
Chien, L.F.: Pat-Tree-Based Keyword Extraction for Chinese Information Retrieval. In: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–58 (1997)
Zhang, J., Gao, J., Zhou, M.: Extraction of Chinese Compound Words–an Experimental Study on a Very Large Corpus. In: Proceedings of the Second Chinese Language Processing Workshop, pp. 132–139 (2000)
Yamamoto, M., Church, K.W.: Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus. Computational Linguistics 27, 1–30 (2001)
Sun, M., Xiao, M., Tsou, B.K.: Chinese Word Segmentation without Using Dictionary Based on Unsupervised Learning Strategy. Chinese Journal of Computers 6, 736–742 (2004)
Kit, C., Wilks, Y.: Unsupervised Learning of Word Boundary with Description Length Gain. In: Proceedings of the CoNLL 1999 ACL Workshop, pp. 1–6 (1999)
Feng, H., Chen, K., Deng, X., Zheng, W.: Accessor Variety Criteria for Chinese Word Extraction. Computational Linguistics 30, 75–93 (2004)
Jin, Z., Tanaka-Ishii, K.: Unsupervised Segmentation of Chinese Text by Use of Branching Entropy. In: COLING/ACL 2006, pp. 428–435 (2006)
Harris, Z.S.: Morpheme Boundaries within Words. In: Papers in Structural and Transformational Linguistics, pp. 68–77 (1970)
Feng, C., Chen, Z.X., Huang, H.Y., Guan, Z.Z.: Active Learning in Chinese Word Segmentation Based on Multigram Language Model. Journal of Chinese Information Processing 1 (2004) (in Chinese)
Kalman, R.E.: A New Approach to Linear Filtering and Prediction Problems. Transaction of the ASME-Journal of Basic Engineering, 35–45 (1960)
Agarwal, D., Chen, B., Elango, P., Motgi, N., Park, S., Ramakrishnan, R., Roy, S., Zachariah, J.: Online Models for Content Optimization. Advances in Neural Information Processing Systems 21, 17–24 (2009)
Chu, W., Park, S.T.: Personalized Recommendation on Dynamic Content Using Predictive Bilinear Models. In: Proc. of the 18th International World Wide Web Conference, pp. 691–700 (2009)
Tong, Y.: Chinese Word Segmentation Based on Statistical Method with General Dictionary and Component Information. Bachelor Degree Thesis. Peking University (2012)
Odelson, B.J., Rajamani, M.R., Rawlings, J.B.: A New Autocovariance Least-Squares Method for Estimating Noise Covariances. Automatica 42, 303–308 (2006)
Åkesson, B.M., Jørgensen, J.B., Poulsen, N.K., Jørgensen, S.B.: A Generalized Autocovariance Least-Squares Method for Kalman Filter Tuning. Journal of Process Control 18, 769–779 (2008)
Rajamani, M.R., Rawlings, J.B.: Estimation of the Disturbance Structure from Data Using Semidefinite Programming and Optimal Weighting. Automatica 45, 142–148 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhu, W., Sun, N., Zou, X., Hu, J. (2013). The Application of Kalman Filter Based Human-Computer Learning Model to Chinese Word Segmentation. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2013. Lecture Notes in Computer Science, vol 7816. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37247-6_18
Download citation
DOI: https://doi.org/10.1007/978-3-642-37247-6_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37246-9
Online ISBN: 978-3-642-37247-6
eBook Packages: Computer ScienceComputer Science (R0)