Abstract
Word frequencies play important roles in many NLP-related applications. Word frequency estimation for Chinese remains a big challenge due to the characteristics of Chinese. An underlying fact is that a perfect word-segmented Chinese corpus never exists, and currently we only have raw corpora, which can be of arbitrarily large size, automatically word-segmented corpora derived from raw corpora, and a number of manually word-segmented corpora, with relatively smaller size, which are developed under various word segmentation standards by different researchers. In this paper we propose a new scheme to do word frequency approximation by combining the factors above. Experiments indicate that in most cases this scheme can benefit the word frequency estimation, though in other cases its performance is still not very satisfactory.
Keywords
The research is supported by the National Natural Science Foundation of China under grant number 60573187 and 60321002, and the Tsinghua-ALVIS Project co-sponsored by the National Natural Science Foundation of China under grant number 60520130299 and EU FP6.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Chen, G.L.: On Chinese Morphology. Xuelin Publisher, Shanghai (1994)
Dai, X.L.: Chinese Morphology and its Interface with the Syntax. Ph.D Dissertation, Ohio State University, USA (1992)
Emerson, T.: The Second International Chinese Word Segmentation Bakeoff. In: Proceedings of the Third SIHAN Workshop on Chinese Language Processing, Jeju, Korea (2005)
Liang, N.Y.: CDWS: A Word Segmentation System for Written Chinese Texts. Journal of Chinese Information Processing 1(2), 44–52 (1987)
Liu, E.S.: Frequency Dictionary of Chinese Words. Mouton and Co. N.V. Publishers (1973)
Liu, K.Y.: Study on the Evaluation Technique for Word Segmentation of Contemporary Chinese. Applied Linguistics (Beijing) (1), 101–106 (1997)
Liu, Y., Liang, N.Y.: Counting Word Frequencies of Contemporary Chinese - An Engineering of Chinese Processing. Journal of Chinese Information Processing 0(1), 17–25 (1986)
Sproat, R., Emerson, T.: The First International Chinese Word Segmentation Bakeoff. In: Proceedings of the Second SIHAN Workshop on Chinese Language Processing, Sapporo, Japan, pp. 133–143 (2003)
Sun, M.S., Shen, D.Y., T’sou, B.K.Y.: Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data. In: Proceedings of 36th ACL and 17th COLING, Montreal, Canada, pp. 1265–1271 (1998)
Sun, M.S., T’sou, B.K.Y.: Ambiguity Resolution in Chinese Word Segmentation. In: Proceedings of the 10th Pacific Asia Conference on Language, Information and Computation, Hong Kong, pp. 121–126 (1995)
Sun, M.S., Wang, H.J., et al.: Wordlist of Contemporary Chinese for Information Processing. Applied Linguistics (Beijing) (4), 84–89 (2001)
Sun, M., Zhang, Z., T’sou, B.K.-Y., Lu, H.: Word Frequency Approximation for Chinese Without Using Manually-Annotated Corpus. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 105–116. Springer, Heidelberg (2006)
Tang, T.C.: Chinese Morphology and Syntax, vol. 3. Taiwan Student Publisher, Taipei (1992)
Zhu, D.X.: Lectures on Grammar. The Commercial Press, Beijing (1982)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Qiao, W., Sun, M. (2006). Word Frequency Approximation for Chinese Using Raw, MM-Segmented and Manually Segmented Corpora. In: Matsumoto, Y., Sproat, R.W., Wong, KF., Zhang, M. (eds) Computer Processing of Oriental Languages. Beyond the Orient: The Research Challenges Ahead. ICCPOL 2006. Lecture Notes in Computer Science(), vol 4285. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11940098_27
Download citation
DOI: https://doi.org/10.1007/11940098_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49667-0
Online ISBN: 978-3-540-49668-7
eBook Packages: Computer ScienceComputer Science (R0)