Skip to main content

Self-Supervised Chinese Word Segmentation

  • Conference paper
  • First Online:
Book cover Advances in Intelligent Data Analysis (IDA 2001)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2189))

Included in the following conference series:

Abstract

We propose a new unsupervised training method for acquiring probability models that accurately segment Chinese character sequences into words. By constructing a core lexicon to guide unsupervised word learning, self-supervised segmentation overcomes the local maxima problems that hamper standard EM training. Our procedure uses successive EM phases to learn a good probability model over character strings, and then prunes this model with a mutual information selection criterion to obtain a more accurate word lexicon. The segmentations produced by these models are more accurate than those produced by training with EM alone.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ando, R. and Lee, L.; Mostly-Unsupervised Statistical Segmentation of Japanese: Application to Kanji. ANLP-NAACL, 2000.

    Google Scholar 

  2. Brand, M.; Structure learning in conditional probability models via an entropic prior and parameter extinction. In Neural Computation, vol.11, page 1155–1182, 1999.

    Article  Google Scholar 

  3. Chang, J.-S. and Su, K.-Y.; An Unsupervised Iterative Method for Chinese New Lexicon Extraction. International Journal of Computational Linguistics & Chinese Language Processing, 1997.

    Google Scholar 

  4. Dahan, D. and Brent, M.; On the discovery of novel word-like units from utterances: An artificial-language study with implications for native-language acquisition. Journal of Experimental Psychology: General, 128, 165–185, 1999.

    Article  Google Scholar 

  5. Deligne, S. and Bimbot, F.; Language Modeling by Variable Length Sequences: Theoretical Formulation and Evaluation of Multigrams. ICASSP, 1995.

    Google Scholar 

  6. Dempster, A., Laird, N, and Rubin, D.; Maximum-likelihood from incomplete data via the EM algorithm. J. Royal Statist. Soc. Ser. B., 39, 1977.

    Google Scholar 

  7. Fung, P.; Extracting key terms from Chinese and Japnese text. The International Journal on Computer Processing of Oriental Language, Special Issue on Information Retrieval on Oriental Languages, 1998, 99–121.

    Google Scholar 

  8. Ge, X., Pratt, W. and Smyth, P.; Discovering Chinese Words from Unsegmented Text. SIGIR-99, pages 271–272.

    Google Scholar 

  9. Jin, W.; Chinese Segmentation and its Disambiguation. MCCS-92-227, Computing Research Laboratory, New Mexico State University, Las Cruces, New Mexico.

    Google Scholar 

  10. Manning, C. and Schütze, H.; Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, Massachusetts, 1999, pages 66–68.

    MATH  Google Scholar 

  11. Palmer, D. and Burger, J.; Chinese Word Segmentation and Information Retrieval. AAAI Spring Symposium on Cross-Language Text and Speech Retrieval, Electronic Working Notes, 1997.

    Google Scholar 

  12. Ponte, J. and Croft, W.; Useg: A retargetable word segmentation procedure for information retrieval. Symposium on Document Analysis and Information Retrival 96 (SDAIR).

    Google Scholar 

  13. Rabiner, L.; A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of IEEE, Vol.77, No.2, 1989.

    Google Scholar 

  14. Sproat, R., Shih, C., Gale, W. and Chang, N.; A stochastic finite-state wordsegmentation algorithm for Chinese Computational Linguistics, 22 (3), 377–404, 1996.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Peng, F., Schuurmans, D. (2001). Self-Supervised Chinese Word Segmentation. In: Hoffmann, F., Hand, D.J., Adams, N., Fisher, D., Guimaraes, G. (eds) Advances in Intelligent Data Analysis. IDA 2001. Lecture Notes in Computer Science, vol 2189. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44816-0_24

Download citation

  • DOI: https://doi.org/10.1007/3-540-44816-0_24

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-42581-6

  • Online ISBN: 978-3-540-44816-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics