Word Frequency Approximation for Chinese Without Using Manually-Annotated Corpus

Sun, Maosong; Zhang, Zhengcao; T’sou, Benjamin Ka-Yin; Lu, Huaming

doi:10.1007/11671299_13

Maosong Sun¹⁷,
Zhengcao Zhang¹⁷,
Benjamin Ka-Yin T’sou¹⁸ &
…
Huaming Lu¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3878))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1387 Accesses
2 Citations

Abstract

Word frequencies play important roles in a variety of NLP-related applications. Word frequency estimation for Chinese is a big challenge due to characteristics of Chinese, in particular word-formation and word segmentation. This paper concerns the issue of word frequency estimation in the condition that we only have a Chinese wordlist and a raw Chinese corpus with arbitrarily large size, and do not perform any manual annotation to the corpus. Several realistic schemes for approximating word frequencies under the framework of STR (frequency of string of characters as an approximation of word frequency) and MM (Maximal matching) are presented. Large-scale experiments indicate that the proposed scheme, MinMaxMM, can significantly benefit the estimation of word frequencies, though its performance is still not very satisfactory in some cases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

‘863’ High Tech Program of China et al.: When Computers Can Have Ability to listen, to speak, and to read? – The Results of the Fifth Evaluation of Chinese Character Recognition, Speech Recognition, Speech Synthesis and Natural Language Processing. Computer World. E9, June 22 (1998)
Google Scholar
Chen, G.L.: On Chinese Morphology. Xuelin Publisher, Shanghai (1994)
Google Scholar
Dai, X.L.: Chinese Morphology and its Interface with the Syntax. Ph.D Dissertation, Ohio State University, USA (1992)
Google Scholar
Emerson, T.: The Second International Chinese Word Segmentation Bakeoff. In: Proceedings of the Third SIHAN Workshop on Chinese Language Processing, Jeju, Korea (2005)
Google Scholar
Liang, N.Y.: CDWS: A Word Segmentation System for Written Chinese Texts. Journal of Chinese Information Processing 1(2), 44–52 (1987)
Google Scholar
Liu, E.S.: Frequency Dictionary of Chinese Words. Mouton & Co N.V. Publishers, Netherlands (1973)
Google Scholar
Liu, K.Y.: Study on the Evaluation Technique for Word Segmentation of Contemporary Chinese. Applied Linguistics (1), 101–106 (1997)
Google Scholar
Liu, Y., Liang, N.Y.: Counting Word Frequencies of Contemporary Chinese – An Engineering of Chinese Processing. Journal of Chinese Information Processing 0(1), 17–25 (1986)
Google Scholar
Sproat, R., Emerson, T.: The First International Chinese Word Segmentation Bakeoff. In: Proceedings of the Second SIHAN Workshop on Chinese Language Processing, Sapporo, Japan, pp. 133–143 (2003)
Google Scholar
Sun, M.S., Shen, D.Y., T’sou, B.K.Y.: Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data. In: Proceedings of 36th ACL & 17th COLING, Montreal, Canada, pp. 1265–1271 (1998)
Google Scholar
Sun, M.S., T’sou, B.K.Y.: Ambiguity Resolution in Chinese Word Segmentation. In: Proceedings of the 10th Pacific Asia Conference on Language, Information & Computation, Hong Kong, pp. 121–126 (1995)
Google Scholar
Sun, M.S., Wang, H.J., et al.: Wordlist of Contemporary Chinese for Information Processing. Applied Linguistics 4, 84–89 (2001)
Google Scholar
Tang, T.C.: Chinese Morphology and Syntax, vol. 3. Taiwan Student Publisher, Taipei (1992)
Google Scholar
Zhu, D.X.: Lectures on Grammar. The Commercial Press, Beijing (1982)
Google Scholar

Download references

Author information

Authors and Affiliations

The State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Maosong Sun & Zhengcao Zhang
Language Information Sciences Research Center, City University of Hong Kong,
Benjamin Ka-Yin T’sou
School of Business, Beijing Institute of Machinery, Beijing, 100085, China
Huaming Lu

Authors

Maosong Sun
View author publications
You can also search for this author in PubMed Google Scholar
Zhengcao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Ka-Yin T’sou
View author publications
You can also search for this author in PubMed Google Scholar
Huaming Lu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, México
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, M., Zhang, Z., T’sou, B.KY., Lu, H. (2006). Word Frequency Approximation for Chinese Without Using Manually-Annotated Corpus. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2006. Lecture Notes in Computer Science, vol 3878. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11671299_13

Download citation

DOI: https://doi.org/10.1007/11671299_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32205-4
Online ISBN: 978-3-540-32206-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics