A Comparative Study on Feature Weight in Text Categorization

Deng, Zhi-Hong; Tang, Shi-Wei; Yang, Dong-Qing; Li, Ming Zhang Li-Yu; Xie, Kun-Qing

doi:10.1007/978-3-540-24655-8_64

Zhi-Hong Deng¹⁶,
Shi-Wei Tang¹⁶,
Dong-Qing Yang¹⁶,
Ming Zhang Li-Yu Li¹⁶ &
…
Kun-Qing Xie¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3007))

Included in the following conference series:

Asia-Pacific Web Conference

768 Accesses
37 Citations

Abstract

Text Categorization is the process of automatically assigning predefined categories to free text documents. Feature weight, which calculates feature (term) values in documents, is one of important preprocessing techniques in text categorization. This paper is a comparative study of feature weight methods in statistical learning of text categorization. Four methods were evaluated, including tf*idf, tf*CRF, tf*OddsRatio, and tf*CHI. We have evaluated these methods on benchmark collection Reuters-21578 with Support Vector Machines (SVMs) classifiers. We found that tf*CHI is most effective in our experiments. Using tf*CHI with a SVMs classifier yielded a very high classification accuracy (87.5% for micro-average F ₁ and 87.8% for micro-average break-even point). tf*idf, which is widely used in text categorization, compares favorably with tf *CRF but is not as effective as tf*CHI and tf*OddsRatio.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Yang, Y.: Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In: 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1994), pp. 13-22 (1994)
Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for naïve bayes text classification. In: AAA 1998 Workshop on Learning for Text Categorization (1998)
Google Scholar
Apte, C., Damerau, F., Weiss, S.: Text mining with decision rules and decision trees. In: Proceedings of Conference on Automated Learning and Discovery, Workshop 6: Learning from Text and the Web (1998)
Google Scholar
Ng, H.T., Goh, W.B., Low, K.L.: Feature selection, perceptron learning, and a usability case study for text categorization. In: 20th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1997), pp. 67-73 (1997)
Google Scholar
Schapire, R.E., Singer, Y.: BoosTexter: A Boosting-based System for Text Categorization. Machine Learning 39(2/3), 135–168 (2000)
Article MATH Google Scholar
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Salton, G., Lesk, M.E.: Computer evaluation of indexing and text processing. Journal of the ACM 15(1), 8–36 (1968)
Article MATH Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)
Article Google Scholar
Deng, Z.H., Tang, S.W., Yang, D.Q., Zhang, M., Wu, X.B., Yang, M.: A Linear Text Classification Algorithm Based on Category Relevance Factors. In: Lim, E.-p., Foo, S.S.-B., Khoo, C., Chen, H., Fox, E., Urs, S.R., Costantino, T. (eds.) ICADL 2002. LNCS, vol. 2555, pp. 88–98. Springer, Heidelberg (2002)
Chapter Google Scholar
Mladenic, D., Grobelnik, M.: Feature Selection for Classification Based on Text Hierarchy. In: Working notes of Learning from Text and the Web, Conference on Automated Learning and Discovery (CONALD 1998) (1998)
Google Scholar
Yang, Y., Pedersen, J.P.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of 14th International Conference on Machine Learning, pp. 412-420 (1997)
Google Scholar
Vapnic, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Google Scholar
Cortes, C., Vapnik, V.: Support Vector networks. Machine Learning 20, 273–297 (1995)
MATH Google Scholar
Osuna, Freund, R., Girosi, F.: Support vector machines: Training and applications. In: A.I. Memo. MIT A.I. Lab (1996)
Google Scholar
Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, pp. 148–155 (1998)
Google Scholar
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1999), pp. 42-49 (1999)
Google Scholar
Cooley, R.: Classification of News Stories Using Support Vector Machines. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence Text Mining Workshop (1999)
Google Scholar
Bekkerman, R., Ran, E.Y., Tishby, N., Winter, Y.: On feature distributional clustering for text categorization. In: Proceedings of the 24th ACM SIGIR International Conference on Research and Development in Information Retrieval, pp. 146-153 (2001)
Google Scholar
Joachims, T.: Making large-Scale SVM Learning Practical. In: Advances in Kernel Methods - Support Vector Learning. MIT-Press, Cambridge (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electronics Engineering and Computer Science, Peking University, Beijing, 100871
Zhi-Hong Deng, Shi-Wei Tang, Dong-Qing Yang, Ming Zhang Li-Yu Li & Kun-Qing Xie

Authors

Zhi-Hong Deng
View author publications
You can also search for this author in PubMed Google Scholar
Shi-Wei Tang
View author publications
You can also search for this author in PubMed Google Scholar
Dong-Qing Yang
View author publications
You can also search for this author in PubMed Google Scholar
Ming Zhang Li-Yu Li
View author publications
You can also search for this author in PubMed Google Scholar
Kun-Qing Xie
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Chinese University of Hong Kong, Hong Kong, China
Jeffrey Xu Yu
The University of News South Wales, NSW 2052, Australia
Xuemin Lin
Department of Computer Science, Tsinghua University, 100084, Beijing, P.R. China
Hongjun Lu
Victoria University, Australia
Yanchun Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Deng, ZH., Tang, SW., Yang, DQ., Li, M.Z.LY., Xie, KQ. (2004). A Comparative Study on Feature Weight in Text Categorization. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds) Advanced Web Technologies and Applications. APWeb 2004. Lecture Notes in Computer Science, vol 3007. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24655-8_64

Download citation

DOI: https://doi.org/10.1007/978-3-540-24655-8_64
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21371-0
Online ISBN: 978-3-540-24655-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics