Abstract
How to improve the accuracy of categorization is a big challenge in text categorization. This paper proposes a high performance prototype system for Chinese text categorization, which mainly includes feature extraction subsystem, feature selection subsystem, and reliability evaluation subsystem for classification results. The proposed prototype system employs a two-step classifying strategy. First, the features that are effective for all testing texts are used to classify texts. Then, the reliability evaluation subsystem evaluates the classification results directly according to the outputs of the classifier, and divides them into two parts: texts classified reliable or not. Only for the texts classified unreliable at the first step, go to the second step. Second, a classifier uses the features that are more subtle and powerful for those texts classified unreliable to classify the texts. The proposed prototype system is successfully implemented in a case that exploits a Naive Bayesian classifier as the classifier in the first and second steps. Experiments show that the proposed prototype system achieves a high performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Lewis, D.: Naive Bayes at Forty: The Independence Assumption in Information Retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)
Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)
Mitchell, T.M.: Machine Learning. McCraw Hill, New York (1996)
Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: Proceedings of SIGIR 1999, pp. 42–49 (1999)
Fan, X.: Causality Reasoning and Text Categorization, Postdoctoral Research Report of Tsinghua University, P.R. China (April 2004)
Fan, X., Sun, M., Choi, K.-s., Zhang, Q.: Classifying Chinese texts in two steps. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 302–313. Springer, Heidelberg (2005)
Fan, X., Sun, M.: A high performance two-class Chinese text categorization method. Chinese Journal of Computers 29(1), 124–131 (2006)
Dumais, S.T., Platt, J., Hecherman, D., Sahami, M.: Inductive Learning Algorithms and Representation for Text Categorization. In: Proceedings of CIKM 1998, Bethesda, MD, pp. 148–155 (1998)
Sahami, M., Dumais, S., Hecherman, D., Horvitz, E.A.: Bayesian Approach to Filtering Junk E-Mail. In: Learning for Text Categorization: Papers from the AAAI Workshop, pp. 55–62, Madison Wisconsin. AAAI Technical Report WS-98-05 (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fan, X. (2006). A High Performance Prototype System for Chinese Text Categorization. In: Gelbukh, A., Reyes-Garcia, C.A. (eds) MICAI 2006: Advances in Artificial Intelligence. MICAI 2006. Lecture Notes in Computer Science(), vol 4293. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11925231_97
Download citation
DOI: https://doi.org/10.1007/11925231_97
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49026-5
Online ISBN: 978-3-540-49058-6
eBook Packages: Computer ScienceComputer Science (R0)