Skip to main content

Word Frequency Approximation for Chinese Without Using Manually-Annotated Corpus

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2006)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3878))

Abstract

Word frequencies play important roles in a variety of NLP-related applications. Word frequency estimation for Chinese is a big challenge due to characteristics of Chinese, in particular word-formation and word segmentation. This paper concerns the issue of word frequency estimation in the condition that we only have a Chinese wordlist and a raw Chinese corpus with arbitrarily large size, and do not perform any manual annotation to the corpus. Several realistic schemes for approximating word frequencies under the framework of STR (frequency of string of characters as an approximation of word frequency) and MM (Maximal matching) are presented. Large-scale experiments indicate that the proposed scheme, MinMaxMM, can significantly benefit the estimation of word frequencies, though its performance is still not very satisfactory in some cases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. ‘863’ High Tech Program of China et al.: When Computers Can Have Ability to listen, to speak, and to read? – The Results of the Fifth Evaluation of Chinese Character Recognition, Speech Recognition, Speech Synthesis and Natural Language Processing. Computer World. E9, June 22 (1998)

    Google Scholar 

  2. Chen, G.L.: On Chinese Morphology. Xuelin Publisher, Shanghai (1994)

    Google Scholar 

  3. Dai, X.L.: Chinese Morphology and its Interface with the Syntax. Ph.D Dissertation, Ohio State University, USA (1992)

    Google Scholar 

  4. Emerson, T.: The Second International Chinese Word Segmentation Bakeoff. In: Proceedings of the Third SIHAN Workshop on Chinese Language Processing, Jeju, Korea (2005)

    Google Scholar 

  5. Liang, N.Y.: CDWS: A Word Segmentation System for Written Chinese Texts. Journal of Chinese Information Processing 1(2), 44–52 (1987)

    Google Scholar 

  6. Liu, E.S.: Frequency Dictionary of Chinese Words. Mouton & Co N.V. Publishers, Netherlands (1973)

    Google Scholar 

  7. Liu, K.Y.: Study on the Evaluation Technique for Word Segmentation of Contemporary Chinese. Applied Linguistics (1), 101–106 (1997)

    Google Scholar 

  8. Liu, Y., Liang, N.Y.: Counting Word Frequencies of Contemporary Chinese – An Engineering of Chinese Processing. Journal of Chinese Information Processing 0(1), 17–25 (1986)

    Google Scholar 

  9. Sproat, R., Emerson, T.: The First International Chinese Word Segmentation Bakeoff. In: Proceedings of the Second SIHAN Workshop on Chinese Language Processing, Sapporo, Japan, pp. 133–143 (2003)

    Google Scholar 

  10. Sun, M.S., Shen, D.Y., T’sou, B.K.Y.: Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data. In: Proceedings of 36th ACL & 17th COLING, Montreal, Canada, pp. 1265–1271 (1998)

    Google Scholar 

  11. Sun, M.S., T’sou, B.K.Y.: Ambiguity Resolution in Chinese Word Segmentation. In: Proceedings of the 10th Pacific Asia Conference on Language, Information & Computation, Hong Kong, pp. 121–126 (1995)

    Google Scholar 

  12. Sun, M.S., Wang, H.J., et al.: Wordlist of Contemporary Chinese for Information Processing. Applied Linguistics 4, 84–89 (2001)

    Google Scholar 

  13. Tang, T.C.: Chinese Morphology and Syntax, vol. 3. Taiwan Student Publisher, Taipei (1992)

    Google Scholar 

  14. Zhu, D.X.: Lectures on Grammar. The Commercial Press, Beijing (1982)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sun, M., Zhang, Z., T’sou, B.KY., Lu, H. (2006). Word Frequency Approximation for Chinese Without Using Manually-Annotated Corpus. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2006. Lecture Notes in Computer Science, vol 3878. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11671299_13

Download citation

  • DOI: https://doi.org/10.1007/11671299_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-32205-4

  • Online ISBN: 978-3-540-32206-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics