Self-Supervised Chinese Word Segmentation

Peng, Fuchun; Schuurmans, Dale

doi:10.1007/3-540-44816-0_24

Fuchun Peng⁵ &
Dale Schuurmans⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2189))

Included in the following conference series:

International Symposium on Intelligent Data Analysis

1287 Accesses
12 Citations

Abstract

We propose a new unsupervised training method for acquiring probability models that accurately segment Chinese character sequences into words. By constructing a core lexicon to guide unsupervised word learning, self-supervised segmentation overcomes the local maxima problems that hamper standard EM training. Our procedure uses successive EM phases to learn a good probability model over character strings, and then prunes this model with a mutual information selection criterion to obtain a more accurate word lexicon. The segmentations produced by these models are more accurate than those produced by training with EM alone.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ando, R. and Lee, L.; Mostly-Unsupervised Statistical Segmentation of Japanese: Application to Kanji. ANLP-NAACL, 2000.
Google Scholar
Brand, M.; Structure learning in conditional probability models via an entropic prior and parameter extinction. In Neural Computation, vol.11, page 1155–1182, 1999.
Article Google Scholar
Chang, J.-S. and Su, K.-Y.; An Unsupervised Iterative Method for Chinese New Lexicon Extraction. International Journal of Computational Linguistics & Chinese Language Processing, 1997.
Google Scholar
Dahan, D. and Brent, M.; On the discovery of novel word-like units from utterances: An artificial-language study with implications for native-language acquisition. Journal of Experimental Psychology: General, 128, 165–185, 1999.
Article Google Scholar
Deligne, S. and Bimbot, F.; Language Modeling by Variable Length Sequences: Theoretical Formulation and Evaluation of Multigrams. ICASSP, 1995.
Google Scholar
Dempster, A., Laird, N, and Rubin, D.; Maximum-likelihood from incomplete data via the EM algorithm. J. Royal Statist. Soc. Ser. B., 39, 1977.
Google Scholar
Fung, P.; Extracting key terms from Chinese and Japnese text. The International Journal on Computer Processing of Oriental Language, Special Issue on Information Retrieval on Oriental Languages, 1998, 99–121.
Google Scholar
Ge, X., Pratt, W. and Smyth, P.; Discovering Chinese Words from Unsegmented Text. SIGIR-99, pages 271–272.
Google Scholar
Jin, W.; Chinese Segmentation and its Disambiguation. MCCS-92-227, Computing Research Laboratory, New Mexico State University, Las Cruces, New Mexico.
Google Scholar
Manning, C. and Schütze, H.; Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, Massachusetts, 1999, pages 66–68.
MATH Google Scholar
Palmer, D. and Burger, J.; Chinese Word Segmentation and Information Retrieval. AAAI Spring Symposium on Cross-Language Text and Speech Retrieval, Electronic Working Notes, 1997.
Google Scholar
Ponte, J. and Croft, W.; Useg: A retargetable word segmentation procedure for information retrieval. Symposium on Document Analysis and Information Retrival 96 (SDAIR).
Google Scholar
Rabiner, L.; A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of IEEE, Vol.77, No.2, 1989.
Google Scholar
Sproat, R., Shih, C., Gale, W. and Chang, N.; A stochastic finite-state wordsegmentation algorithm for Chinese Computational Linguistics, 22 (3), 377–404, 1996.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Waterloo, 200 University Avenue West, Waterloo, Ontario, Canada, N2L 3G1
Fuchun Peng & Dale Schuurmans

Authors

Fuchun Peng
View author publications
You can also search for this author in PubMed Google Scholar
Dale Schuurmans
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Royal Institute of Technology, Centre for Autonomous Systems, 10044, Stockholm, Sweden
Frank Hoffmann
Imperial College, Huxley Building 180 Queen’s Gate, London, SW7 2BZ, UK
David J. Hand & Niall Adams &
Department of Computer Science, Vanderbilt University, Box 1679, Station B, Nashville, TN, 37235, USA
Douglas Fisher
Department of Computer Science, New University of Lisbon, 2825-114, Caparica, Portugal
Gabriela Guimaraes

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Peng, F., Schuurmans, D. (2001). Self-Supervised Chinese Word Segmentation. In: Hoffmann, F., Hand, D.J., Adams, N., Fisher, D., Guimaraes, G. (eds) Advances in Intelligent Data Analysis. IDA 2001. Lecture Notes in Computer Science, vol 2189. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44816-0_24

Download citation

DOI: https://doi.org/10.1007/3-540-44816-0_24
Published: 03 September 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42581-6
Online ISBN: 978-3-540-44816-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics