Abstract
Automatic Chinese text classification is an important and well-known research topic in the field of information retrieval and natural language processing. However, past researches often ignore the problem of word segmentation and the relationship between words. This paper proposes an N-gram-based language model for Chinese text classification which considers the relationship between words. To prevent from the out-of-vocabulary problem, a novel smoothing method based on logistic regression is also proposed to improve the performance. The experimental result shows that our approach outperforms the previous N-gram-based classification model above 11% on micro-average F-measure.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Aizawa, A.: Linguistic Techniques to Improve the Performance of Automatic Text Categorization. In: 6th Natural Language Processing Pacific Rim Symposium, pp. 307–314 (2001)
Chen, S., Goodman, J.: An Empirical Study of Smoothing Techniques for Language Modeling. In: 34th Annual Meeting of the Association for Computational Linguistics, pp. 310–318 (1998)
Cavnar, W., Trenkle, J.: N-Gram-Based Text Categorization. In: 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175 (1994)
Damashek, M.: Gauging Similarity with N-Grams: Language-Independent Categorization of Text. Science 267, 843–848 (1995)
Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive Learning Algorithms and Representations for Text Categorization. In: 7th International Conference on Information and Knowledge Management, pp. 148–155 (1998)
He, J., Tan, A., Tan, C.: On Machine Learning Methods for Chinese Document Categorization. Applied Intelligence 18, 311–322 (2003)
Jiang, E.: Learning to Semantically Classify Email Messages. In: 2nd International Conference on Intelligent Computing, pp. 664–675 (2006)
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: European Conference on Machine Learning, pp. 137–142 (1998)
Lam, W., Ruiz, M., Srinivasan, P.: Automatic Text Categorization and Its Application to Text Retrieval. IEEE Transactions on Knowledge and Data Engineering 11, 865–879 (1999)
Manning, C.D., Schuetze, H.: Fundations of Statistical Natural Language Processing, pp. 191–227. MIT Press, Cambridge (2004)
Peng, F., Huang, X., Schuurmans, D., Cercone, N.: Investigating the Relationship of Word Segmentation Performance and Retrieval Performance in Chinese IR. In: 15th International Conference on Computational Linguistics, pp. 72–78 (2002)
Peng, F., Huang, X., Schuurmans, D., Wang, S.: Text Classification in Asian Languages without Word Segmentation. In: 6th International Workshop on Information Retrieval with Asian Languages, pp. 41–48 (2003)
Peng, F., Schuurmans, D.: Combining Naive Bayes and N-Gram Language Models for Text Classification. In: 25th European Conference on Information Retrieval Research, pp. 335–350 (2003)
Sebastian, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34, 1–47 (2002)
Silva, C., Ribeiro, B.: Scaling Text Classification with Relevance Vector Machines. In: IEEE International Conference on Systems, Man, and Cybernetics, pp. 4186–4191 (2006)
Teahan, W., Harper, D.: Using Compression-Based Language Models for Text Categorization. In: Workshop on Language Models for Information Retrieval, pp. 83–88 (2001)
Tipping, M.: Sparse Bayesian Learning and the Relevance Vector Machine. Journal of Machine Learning Research 1, 211–214 (2001)
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
Wu, Y.C.: Chinese Text Categorization with Term Clustering. M.S. Thesis, Mining Chuan University (2003)
Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval Journal 1, 69–90 (1999)
Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49 (1999)
Yen, S., Lee, Y., Lin, C., Ying, J.: Investigating the Effect of Sampling Methods for Imbalanced Data Distributions. In: IEEE International Conference on Systems, Man, and Cybernetics, pp. 4163–4168 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yen, SJ., Lee, YS., Wu, YC., Ying, JC., Tseng, V.S. (2010). Automatic Chinese Text Classification Using N-Gram Model. In: Taniar, D., Gervasi, O., Murgante, B., Pardede, E., Apduhan, B.O. (eds) Computational Science and Its Applications – ICCSA 2010. ICCSA 2010. Lecture Notes in Computer Science, vol 6018. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12179-1_38
Download citation
DOI: https://doi.org/10.1007/978-3-642-12179-1_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12178-4
Online ISBN: 978-3-642-12179-1
eBook Packages: Computer ScienceComputer Science (R0)