Chinese Documents Classification Based on N-Grams

Zhou, Shuigeng; Guan, Jihong

doi:10.1007/3-540-45715-1_43

Shuigeng Zhou⁵ &
Jihong Guan⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2276))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1517 Accesses
9 Citations

Abstract

Traditional Chinese documents classifiers are based on keywords in the documents, which need dictionaries support and efficient segmentation procedures. This paper explores the techniques of utilizing N-gram information to categorize Chinese documents so that the classifier can shake off the burden of large dictionaries and complex segmentation processing, and subsequently be domain and time independent. A Chinese documents classification system following above described techniques is implemented with Naive Bayes, kNN and hierarchical classification methods. Experimental results show that our system can achieve satisfactory performance, which is comparable with other traditional classifiers.

This work was supported by China Post-doctoral Science Foundation and the Natural Science Foundation of China (NSFC) under grant number 60173027.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

B. Masand, et al. Classifying news stories using memory-based reasoning. In International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 59–65, 1992.
Google Scholar
K. Lang. Newsweeder: learning to filter netnews. In International Conference on Machine Learning (ICML), 1995.
Google Scholar
T. Joachims, et al. Webwatcher: A tour guide for the World Wide Web. In International Joint Conference on Artificial Intelligence (IJCAI), 1997.
Google Scholar
T. Zou, et al. The Design and Implementation of an Automatic Chinese Documents Classification System, Journal of Chinese Information Processing, 13(3): 26–32, 1999. (In Chinese).
Google Scholar
Y. Liu, Q. Tan, and X. Shen. Modern Chinese Segmentation Specification and Automatic Segmentation Methods for Information Processing, Tsinghua University Press. (In Chinese).
Google Scholar
Z. Wu and G. Tseng. Chinese Text Segmentation for Text Retrieval: Achievements and Problems. Journal of th American Society for Information Science, 44:532–542, October 1993.
Google Scholar
B. Zhao and L. Xu. Processing Chinese Information with Computer, Vol.2. Space Publisher House, 1988. (In Chinese).
Google Scholar
S. Zhou. Key Techniques of Chinese Text Database. PhD thesis of Fudan University, China. 2000.
Google Scholar

Download references

Author information

Authors and Affiliations

State Key Lab of Software Engineering, Wuhan University, 430072, Wuhan, China
Shuigeng Zhou
School of Computer Science, Wuhan University, 430072, Wuhan, China
Jihong Guan

Authors

Shuigeng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jihong Guan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

CIC Centro de Investigacion en Computacion, IPN Instituto Politecnico Nacional, Col Zacateno, CP 07738, Mexico DF, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, S., Guan, J. (2002). Chinese Documents Classification Based on N-Grams. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2002. Lecture Notes in Computer Science, vol 2276. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45715-1_43

Download citation

DOI: https://doi.org/10.1007/3-540-45715-1_43
Published: 05 February 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43219-7
Online ISBN: 978-3-540-45715-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics