Skip to main content

A Comparative Study on Feature Weight in Text Categorization

  • Conference paper
Book cover Advanced Web Technologies and Applications (APWeb 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3007))

Included in the following conference series:

Abstract

Text Categorization is the process of automatically assigning predefined categories to free text documents. Feature weight, which calculates feature (term) values in documents, is one of important preprocessing techniques in text categorization. This paper is a comparative study of feature weight methods in statistical learning of text categorization. Four methods were evaluated, including tf*idf, tf*CRF, tf*OddsRatio, and tf*CHI. We have evaluated these methods on benchmark collection Reuters-21578 with Support Vector Machines (SVMs) classifiers. We found that tf*CHI is most effective in our experiments. Using tf*CHI with a SVMs classifier yielded a very high classification accuracy (87.5% for micro-average F 1 and 87.8% for micro-average break-even point). tf*idf, which is widely used in text categorization, compares favorably with tf *CRF but is not as effective as tf*CHI and tf*OddsRatio.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Yang, Y.: Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In: 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1994), pp. 13-22 (1994)

    Google Scholar 

  2. McCallum, A., Nigam, K.: A comparison of event models for naïve bayes text classification. In: AAA 1998 Workshop on Learning for Text Categorization (1998)

    Google Scholar 

  3. Apte, C., Damerau, F., Weiss, S.: Text mining with decision rules and decision trees. In: Proceedings of Conference on Automated Learning and Discovery, Workshop 6: Learning from Text and the Web (1998)

    Google Scholar 

  4. Ng, H.T., Goh, W.B., Low, K.L.: Feature selection, perceptron learning, and a usability case study for text categorization. In: 20th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1997), pp. 67-73 (1997)

    Google Scholar 

  5. Schapire, R.E., Singer, Y.: BoosTexter: A Boosting-based System for Text Categorization. Machine Learning 39(2/3), 135–168 (2000)

    Article  MATH  Google Scholar 

  6. Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  7. Salton, G., Lesk, M.E.: Computer evaluation of indexing and text processing. Journal of the ACM 15(1), 8–36 (1968)

    Article  MATH  Google Scholar 

  8. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)

    Article  Google Scholar 

  9. Deng, Z.H., Tang, S.W., Yang, D.Q., Zhang, M., Wu, X.B., Yang, M.: A Linear Text Classification Algorithm Based on Category Relevance Factors. In: Lim, E.-p., Foo, S.S.-B., Khoo, C., Chen, H., Fox, E., Urs, S.R., Costantino, T. (eds.) ICADL 2002. LNCS, vol. 2555, pp. 88–98. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  10. Mladenic, D., Grobelnik, M.: Feature Selection for Classification Based on Text Hierarchy. In: Working notes of Learning from Text and the Web, Conference on Automated Learning and Discovery (CONALD 1998) (1998)

    Google Scholar 

  11. Yang, Y., Pedersen, J.P.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of 14th International Conference on Machine Learning, pp. 412-420 (1997)

    Google Scholar 

  12. Vapnic, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)

    Google Scholar 

  13. Cortes, C., Vapnik, V.: Support Vector networks. Machine Learning 20, 273–297 (1995)

    MATH  Google Scholar 

  14. Osuna, Freund, R., Girosi, F.: Support vector machines: Training and applications. In: A.I. Memo. MIT A.I. Lab (1996)

    Google Scholar 

  15. Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, pp. 148–155 (1998)

    Google Scholar 

  16. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1999), pp. 42-49 (1999)

    Google Scholar 

  17. Cooley, R.: Classification of News Stories Using Support Vector Machines. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence Text Mining Workshop (1999)

    Google Scholar 

  18. Bekkerman, R., Ran, E.Y., Tishby, N., Winter, Y.: On feature distributional clustering for text categorization. In: Proceedings of the 24th ACM SIGIR International Conference on Research and Development in Information Retrieval, pp. 146-153 (2001)

    Google Scholar 

  19. Joachims, T.: Making large-Scale SVM Learning Practical. In: Advances in Kernel Methods - Support Vector Learning. MIT-Press, Cambridge (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Deng, ZH., Tang, SW., Yang, DQ., Li, M.Z.LY., Xie, KQ. (2004). A Comparative Study on Feature Weight in Text Categorization. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds) Advanced Web Technologies and Applications. APWeb 2004. Lecture Notes in Computer Science, vol 3007. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24655-8_64

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24655-8_64

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-21371-0

  • Online ISBN: 978-3-540-24655-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics