Abstract
Traditional Chinese documents classifiers are based on keywords in the documents, which need dictionaries support and efficient segmentation procedures. This paper explores the techniques of utilizing N-gram information to categorize Chinese documents so that the classifier can shake off the burden of large dictionaries and complex segmentation processing, and subsequently be domain and time independent. A Chinese documents classification system following above described techniques is implemented with Naive Bayes, kNN and hierarchical classification methods. Experimental results show that our system can achieve satisfactory performance, which is comparable with other traditional classifiers.
This work was supported by China Post-doctoral Science Foundation and the Natural Science Foundation of China (NSFC) under grant number 60173027.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
B. Masand, et al. Classifying news stories using memory-based reasoning. In International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 59–65, 1992.
K. Lang. Newsweeder: learning to filter netnews. In International Conference on Machine Learning (ICML), 1995.
T. Joachims, et al. Webwatcher: A tour guide for the World Wide Web. In International Joint Conference on Artificial Intelligence (IJCAI), 1997.
T. Zou, et al. The Design and Implementation of an Automatic Chinese Documents Classification System, Journal of Chinese Information Processing, 13(3): 26–32, 1999. (In Chinese).
Y. Liu, Q. Tan, and X. Shen. Modern Chinese Segmentation Specification and Automatic Segmentation Methods for Information Processing, Tsinghua University Press. (In Chinese).
Z. Wu and G. Tseng. Chinese Text Segmentation for Text Retrieval: Achievements and Problems. Journal of th American Society for Information Science, 44:532–542, October 1993.
B. Zhao and L. Xu. Processing Chinese Information with Computer, Vol.2. Space Publisher House, 1988. (In Chinese).
S. Zhou. Key Techniques of Chinese Text Database. PhD thesis of Fudan University, China. 2000.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhou, S., Guan, J. (2002). Chinese Documents Classification Based on N-Grams. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2002. Lecture Notes in Computer Science, vol 2276. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45715-1_43
Download citation
DOI: https://doi.org/10.1007/3-540-45715-1_43
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43219-7
Online ISBN: 978-3-540-45715-2
eBook Packages: Springer Book Archive