Abstract
We propose a new unsupervised training method for acquiring probability models that accurately segment Chinese character sequences into words. By constructing a core lexicon to guide unsupervised word learning, self-supervised segmentation overcomes the local maxima problems that hamper standard EM training. Our procedure uses successive EM phases to learn a good probability model over character strings, and then prunes this model with a mutual information selection criterion to obtain a more accurate word lexicon. The segmentations produced by these models are more accurate than those produced by training with EM alone.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ando, R. and Lee, L.; Mostly-Unsupervised Statistical Segmentation of Japanese: Application to Kanji. ANLP-NAACL, 2000.
Brand, M.; Structure learning in conditional probability models via an entropic prior and parameter extinction. In Neural Computation, vol.11, page 1155–1182, 1999.
Chang, J.-S. and Su, K.-Y.; An Unsupervised Iterative Method for Chinese New Lexicon Extraction. International Journal of Computational Linguistics & Chinese Language Processing, 1997.
Dahan, D. and Brent, M.; On the discovery of novel word-like units from utterances: An artificial-language study with implications for native-language acquisition. Journal of Experimental Psychology: General, 128, 165–185, 1999.
Deligne, S. and Bimbot, F.; Language Modeling by Variable Length Sequences: Theoretical Formulation and Evaluation of Multigrams. ICASSP, 1995.
Dempster, A., Laird, N, and Rubin, D.; Maximum-likelihood from incomplete data via the EM algorithm. J. Royal Statist. Soc. Ser. B., 39, 1977.
Fung, P.; Extracting key terms from Chinese and Japnese text. The International Journal on Computer Processing of Oriental Language, Special Issue on Information Retrieval on Oriental Languages, 1998, 99–121.
Ge, X., Pratt, W. and Smyth, P.; Discovering Chinese Words from Unsegmented Text. SIGIR-99, pages 271–272.
Jin, W.; Chinese Segmentation and its Disambiguation. MCCS-92-227, Computing Research Laboratory, New Mexico State University, Las Cruces, New Mexico.
Manning, C. and Schütze, H.; Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, Massachusetts, 1999, pages 66–68.
Palmer, D. and Burger, J.; Chinese Word Segmentation and Information Retrieval. AAAI Spring Symposium on Cross-Language Text and Speech Retrieval, Electronic Working Notes, 1997.
Ponte, J. and Croft, W.; Useg: A retargetable word segmentation procedure for information retrieval. Symposium on Document Analysis and Information Retrival 96 (SDAIR).
Rabiner, L.; A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of IEEE, Vol.77, No.2, 1989.
Sproat, R., Shih, C., Gale, W. and Chang, N.; A stochastic finite-state wordsegmentation algorithm for Chinese Computational Linguistics, 22 (3), 377–404, 1996.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Peng, F., Schuurmans, D. (2001). Self-Supervised Chinese Word Segmentation. In: Hoffmann, F., Hand, D.J., Adams, N., Fisher, D., Guimaraes, G. (eds) Advances in Intelligent Data Analysis. IDA 2001. Lecture Notes in Computer Science, vol 2189. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44816-0_24
Download citation
DOI: https://doi.org/10.1007/3-540-44816-0_24
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42581-6
Online ISBN: 978-3-540-44816-7
eBook Packages: Springer Book Archive