Abstract
In this paper, we perform Chinese text classification using n-gram text representation on TanCorp which is a new large corpus special for Chinese text classification more than 14,000 texts divided into 12 classes. We use different n-gram feature (1-, 2-grams or 1-, 2-, 3-grams) to represent documents. Different feature weights (absolute text frequency, relative text frequency, absolute n-gram frequency and relative n-gram frequency) are compared. The sparseness of “document by feature” matrices is analyzed in various cases. We use the C-SVC classifier which is the SVM algorithm designed for the multi-classification task. We perform our experiments in the TANAGRA platform. We found out that the feature selection methods based on n-gram frequency (absolute or relative) always give better results and produce denser matrices.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Miao, D.Q., Wei, Z.H.: Chinese Language Understanding Algorithms and Applications. Tsinghua University Press (2007)
Radwan, J., Chauchat, J.-H.: Pourquoi les n-grammes permettent de classer des textes? Recherche de mots-clefs pertinents l’aide des n-grammes caractèristiques. In: JADT 2002: 6es Journées internationales d’Analyse statistique des Données Textuelles, pp. 381–390 (2002)
Alain, L., Halleb, M., Delprat, B.: Recherche d’information et cartographie dans des corpus textuels à partir des fréquences de n-grammes. In: Mellet, S. (ed.) 4èmes Journées Internationales d’Analyse statistique des Données Textuelles, Université de Nice - Sophia Antipolis, pp. 391–400 (1998)
Joachims, T.: Learning to Classify Text Using Support Vector Machines. University Dortmund (February 2001)
Zhou, S.G., et al.: A Chinese Document Categorization System Without Dictionary Support and Segmentation Processing. Journal of Computer Research and Development 38(7), 839–844 (2001)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Benzécri, J.-P., L’Analyse, D.: T1 = la Taxinomie. DUNOD, Paris (1973)
Tan, S.B., et al.: A novel refinement approach for text categorization. In: CIKM 2005, pp. 469–476 (2005)
Fan, R.-E., Chen, P.-H., Lin, C.-J.: Working set selection using second order information for training SVM. Journal of Machine Learning Research, 1889–1918 (2005)
Ricco, R.: TANAGRA: un logiciel gratuit pour l’enseignement et la recherché. In: EGC 2005, RNTI-E-32, pp. 697–702 (2005)
Van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)
Artur, S̆, et al.: Detailed experiment with letter n-gram method on Croatian-English parallel corpus. In: EPIA 2007, Portuguese Conference on Artificial Intelligence (2007)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wei, Z., Miao, D., Chauchat, JH., Zhong, C. (2008). Feature Selection on Chinese Text Classification Using Character N-Grams. In: Wang, G., Li, T., Grzymala-Busse, J.W., Miao, D., Skowron, A., Yao, Y. (eds) Rough Sets and Knowledge Technology. RSKT 2008. Lecture Notes in Computer Science(), vol 5009. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-79721-0_68
Download citation
DOI: https://doi.org/10.1007/978-3-540-79721-0_68
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-79720-3
Online ISBN: 978-3-540-79721-0
eBook Packages: Computer ScienceComputer Science (R0)