Abstract
Text Categorization is the process of automatically assigning predefined categories to free text documents. Feature weight, which calculates feature (term) values in documents, is one of important preprocessing techniques in text categorization. This paper is a comparative study of feature weight methods in statistical learning of text categorization. Four methods were evaluated, including tf*idf, tf*CRF, tf*OddsRatio, and tf*CHI. We have evaluated these methods on benchmark collection Reuters-21578 with Support Vector Machines (SVMs) classifiers. We found that tf*CHI is most effective in our experiments. Using tf*CHI with a SVMs classifier yielded a very high classification accuracy (87.5% for micro-average F 1 and 87.8% for micro-average break-even point). tf*idf, which is widely used in text categorization, compares favorably with tf *CRF but is not as effective as tf*CHI and tf*OddsRatio.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Yang, Y.: Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In: 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1994), pp. 13-22 (1994)
McCallum, A., Nigam, K.: A comparison of event models for naïve bayes text classification. In: AAA 1998 Workshop on Learning for Text Categorization (1998)
Apte, C., Damerau, F., Weiss, S.: Text mining with decision rules and decision trees. In: Proceedings of Conference on Automated Learning and Discovery, Workshop 6: Learning from Text and the Web (1998)
Ng, H.T., Goh, W.B., Low, K.L.: Feature selection, perceptron learning, and a usability case study for text categorization. In: 20th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1997), pp. 67-73 (1997)
Schapire, R.E., Singer, Y.: BoosTexter: A Boosting-based System for Text Categorization. Machine Learning 39(2/3), 135–168 (2000)
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Salton, G., Lesk, M.E.: Computer evaluation of indexing and text processing. Journal of the ACM 15(1), 8–36 (1968)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)
Deng, Z.H., Tang, S.W., Yang, D.Q., Zhang, M., Wu, X.B., Yang, M.: A Linear Text Classification Algorithm Based on Category Relevance Factors. In: Lim, E.-p., Foo, S.S.-B., Khoo, C., Chen, H., Fox, E., Urs, S.R., Costantino, T. (eds.) ICADL 2002. LNCS, vol. 2555, pp. 88–98. Springer, Heidelberg (2002)
Mladenic, D., Grobelnik, M.: Feature Selection for Classification Based on Text Hierarchy. In: Working notes of Learning from Text and the Web, Conference on Automated Learning and Discovery (CONALD 1998) (1998)
Yang, Y., Pedersen, J.P.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of 14th International Conference on Machine Learning, pp. 412-420 (1997)
Vapnic, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Cortes, C., Vapnik, V.: Support Vector networks. Machine Learning 20, 273–297 (1995)
Osuna, Freund, R., Girosi, F.: Support vector machines: Training and applications. In: A.I. Memo. MIT A.I. Lab (1996)
Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, pp. 148–155 (1998)
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1999), pp. 42-49 (1999)
Cooley, R.: Classification of News Stories Using Support Vector Machines. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence Text Mining Workshop (1999)
Bekkerman, R., Ran, E.Y., Tishby, N., Winter, Y.: On feature distributional clustering for text categorization. In: Proceedings of the 24th ACM SIGIR International Conference on Research and Development in Information Retrieval, pp. 146-153 (2001)
Joachims, T.: Making large-Scale SVM Learning Practical. In: Advances in Kernel Methods - Support Vector Learning. MIT-Press, Cambridge (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Deng, ZH., Tang, SW., Yang, DQ., Li, M.Z.LY., Xie, KQ. (2004). A Comparative Study on Feature Weight in Text Categorization. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds) Advanced Web Technologies and Applications. APWeb 2004. Lecture Notes in Computer Science, vol 3007. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24655-8_64
Download citation
DOI: https://doi.org/10.1007/978-3-540-24655-8_64
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21371-0
Online ISBN: 978-3-540-24655-8
eBook Packages: Springer Book Archive