Automatic Chinese Text Classification Using N-Gram Model

Yen, Show-Jane; Lee, Yue-Shi; Wu, Yu-Chieh; Ying, Jia-Ching; Tseng, Vincent S.

doi:10.1007/978-3-642-12179-1_38

Automatic Chinese Text Classification Using N-Gram Model

Show-Jane Yen²¹,
Yue-Shi Lee²¹,
Yu-Chieh Wu²¹,
Jia-Ching Ying²² &
…
Vincent S. Tseng²²

Conference paper

974 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6018))

Abstract

Automatic Chinese text classification is an important and well-known research topic in the field of information retrieval and natural language processing. However, past researches often ignore the problem of word segmentation and the relationship between words. This paper proposes an N-gram-based language model for Chinese text classification which considers the relationship between words. To prevent from the out-of-vocabulary problem, a novel smoothing method based on logistic regression is also proposed to improve the performance. The experimental result shows that our approach outperforms the previous N-gram-based classification model above 11% on micro-average F-measure.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aizawa, A.: Linguistic Techniques to Improve the Performance of Automatic Text Categorization. In: 6th Natural Language Processing Pacific Rim Symposium, pp. 307–314 (2001)
Google Scholar
Chen, S., Goodman, J.: An Empirical Study of Smoothing Techniques for Language Modeling. In: 34th Annual Meeting of the Association for Computational Linguistics, pp. 310–318 (1998)
Google Scholar
Cavnar, W., Trenkle, J.: N-Gram-Based Text Categorization. In: 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175 (1994)
Google Scholar
Damashek, M.: Gauging Similarity with N-Grams: Language-Independent Categorization of Text. Science 267, 843–848 (1995)
Article Google Scholar
Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive Learning Algorithms and Representations for Text Categorization. In: 7th International Conference on Information and Knowledge Management, pp. 148–155 (1998)
Google Scholar
He, J., Tan, A., Tan, C.: On Machine Learning Methods for Chinese Document Categorization. Applied Intelligence 18, 311–322 (2003)
Article MATH Google Scholar
Jiang, E.: Learning to Semantically Classify Email Messages. In: 2nd International Conference on Intelligent Computing, pp. 664–675 (2006)
Google Scholar
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: European Conference on Machine Learning, pp. 137–142 (1998)
Google Scholar
Lam, W., Ruiz, M., Srinivasan, P.: Automatic Text Categorization and Its Application to Text Retrieval. IEEE Transactions on Knowledge and Data Engineering 11, 865–879 (1999)
Article Google Scholar
Manning, C.D., Schuetze, H.: Fundations of Statistical Natural Language Processing, pp. 191–227. MIT Press, Cambridge (2004)
Google Scholar
Peng, F., Huang, X., Schuurmans, D., Cercone, N.: Investigating the Relationship of Word Segmentation Performance and Retrieval Performance in Chinese IR. In: 15th International Conference on Computational Linguistics, pp. 72–78 (2002)
Google Scholar
Peng, F., Huang, X., Schuurmans, D., Wang, S.: Text Classification in Asian Languages without Word Segmentation. In: 6th International Workshop on Information Retrieval with Asian Languages, pp. 41–48 (2003)
Google Scholar
Peng, F., Schuurmans, D.: Combining Naive Bayes and N-Gram Language Models for Text Classification. In: 25th European Conference on Information Retrieval Research, pp. 335–350 (2003)
Google Scholar
Sebastian, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34, 1–47 (2002)
Article Google Scholar
Silva, C., Ribeiro, B.: Scaling Text Classification with Relevance Vector Machines. In: IEEE International Conference on Systems, Man, and Cybernetics, pp. 4186–4191 (2006)
Google Scholar
Teahan, W., Harper, D.: Using Compression-Based Language Models for Text Categorization. In: Workshop on Language Models for Information Retrieval, pp. 83–88 (2001)
Google Scholar
Tipping, M.: Sparse Bayesian Learning and the Relevance Vector Machine. Journal of Machine Learning Research 1, 211–214 (2001)
Article MATH MathSciNet Google Scholar
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
MATH Google Scholar
Wu, Y.C.: Chinese Text Categorization with Term Clustering. M.S. Thesis, Mining Chuan University (2003)
Google Scholar
Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval Journal 1, 69–90 (1999)
Article Google Scholar
Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49 (1999)
Google Scholar
Yen, S., Lee, Y., Lin, C., Ying, J.: Investigating the Effect of Sampling Methods for Imbalanced Data Distributions. In: IEEE International Conference on Systems, Man, and Cybernetics, pp. 4163–4168 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science and Information Engineering, Ming Chuan University, 5 The-Ming Rd., Gwei Shan District, Taoyuan County, 333, Taiwan
Show-Jane Yen, Yue-Shi Lee & Yu-Chieh Wu
Dept. of Computer Science and Information Engineering, National Cheng Kung University, 1 University Rd., Tainan City, 701, Taiwan
Jia-Ching Ying & Vincent S. Tseng

Authors

Show-Jane Yen
View author publications
You can also search for this author in PubMed Google Scholar
Yue-Shi Lee
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Chieh Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jia-Ching Ying
View author publications
You can also search for this author in PubMed Google Scholar
Vincent S. Tseng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Clayton School of Information Technology, Monash University, 3800, Clayton, VIC, Australia
David Taniar
Department of Mathematics and Computer Science, University of Perugia, Via Vanvitelli, 1, 06123, Perugia, Italy
Osvaldo Gervasi
L.I.S.U.T. - D.A.P.I.T., University of Basilicata, Viale dell’Ateneo Lucano 10, 85100, Potenza, Italy
Beniamino Murgante
Department of Computer Science and Computer Engineeering, LaTrobe University, 3086, Bundoora, VIC, Australia
Eric Pardede
Department of Intelligent Informatics, Kyushu Sangyo University, 813-8503, Fukuoka, Japan
Bernady O. Apduhan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yen, SJ., Lee, YS., Wu, YC., Ying, JC., Tseng, V.S. (2010). Automatic Chinese Text Classification Using N-Gram Model. In: Taniar, D., Gervasi, O., Murgante, B., Pardede, E., Apduhan, B.O. (eds) Computational Science and Its Applications – ICCSA 2010. ICCSA 2010. Lecture Notes in Computer Science, vol 6018. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12179-1_38

Download citation

DOI: https://doi.org/10.1007/978-3-642-12179-1_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12178-4
Online ISBN: 978-3-642-12179-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics