Skip to main content
Log in

Best terms: an efficient feature-selection algorithm for text categorization

  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

In this paper, we propose a new feature-selection algorithm for text classification, called best terms (BT). The complexity of BT is linear in respect to the number of the training-set documents and is independent from both the vocabulary size and the number of categories. We evaluate BT on two benchmark document collections, Reuters-21578 and 20-Newsgroups, using two classification algorithms, naive Bayes (NB) and support vector machines (SVM). Our experimental results, comparing BT with an extensive and representative list of feature-selection algorithms, show that (1) BT is faster than the existing feature-selection algorithms; (2) BT leads to a considerable increase in the classification accuracy of NB and SVM as measured by the F1 measure; (3) BT leads to a considerable improvement in the speed of NB and SVM; in most cases, the training time of SVM has dropped by an order of magnitude; (4) in most cases, the combination of BT with the simple, but very fast, NB algorithm leads to classification accuracy comparable with SVM while sometimes it is even more accurate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Baker LD, McCallum AK (1998) Distributional clustering of words for text classification. In: Proceedings of SIGIR-98, 21st ACM international conference on research and development in information retrieval, Melbourne, AU, pp 96–103

  • Bekkerman R, El-Yaniv R, Tishby N, Winter Y (2001) On feature distributional clustering for text categorization. In: Proceedings of 24th ACM international conference on research and development in information retrieval, New York, US, pp 146–153. ACM Press

  • Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artific Intell 97:245–271

    Article  Google Scholar 

  • Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic indexing. J Am Soc Inf Sci 41:391–407

    Article  Google Scholar 

  • Duda R, Hart P (1973) Pattern classification and scene analysis. Wiley, New York

  • Dumais ST, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: Proceedings of (CIKM-98), 7th ACM international conference on information and knowledge management, ethesda, MD, pp 148–155

  • Fuhr N, Hartmann S, Knorz G, Lustig G, Schwantner M, Tzeras K (1991) AIR/X—a rule-based multistage indexing system for large subject fields. In: Proceedings of RIAO-91, 3rd international conference recherche d’information assistee par ordinateu, Barcelona, ES, pp 606–623

  • Galavotti L, Sebastiani F, Simi M (2000) Experiments on the use of feature selection and negative evidence in automated text categorization. In: Proceedings of ECDL-00, 4th European conference on research and advanced technology for digital libraries, Lisbon, PT, pp 59–68

  • Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of ECML-98, 10th European conference on machine learning, Chemnitz, DE, pp 137–142

  • Joachims T (1999) In: Scholkopf B, Burges C, Smola A (eds) Advances in kernel methods—support vector learning. MIT Press

  • John GH, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem. In: Proceedings of ICML-94, 11th international conference on machine learning, New Brunswick, NJ, pp 121–129

  • Lang K (1995) NewsWeeder: Learning to filter netnews. In: Proceedings of ICML-95, 12th international conference on machine learning, pp 331–339

  • Lewis DD (1992) Feature selection and feature extraction for text categorization, speech and natural language. In: Proceedings of a workshop held at Harriman, NY, pp 212–217. Kaufmann, San Mateo, CA

  • McCallum A, Nigam K (1998) A comparison of event models for naive Bayes text classification. In: Proceedings of the AAAI/ICML-98 workshop on learning for text categorization. AAAI Press, pp 41–48

  • Ng HT, Goh WB, Low KL (1997) Feature selection, perceptron learning, and a usability case study for text categorization. In: Proceedings of SIGIR-97, 20th ACM international conference on research and development in information retrieval, pp 67–73

  • Quinlan JR (1983) Learning efficient classification procedures and their applications. In: Michalski RS, Carbonell JG, Mitchell TM (eds) Machine learning: an artificial intelligence approach. Morgan Kaufmann, San Francisco, CA

  • Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann, San Francisco, CA

  • Rogati M, Yang Y (2002) High-performing feature selection for text classification. In: Proceedings of 11th international conference on information and knowledge management (CIKM’02)

  • Ruiz ME, Srinivasan P (1999) Hierarchical neural networks for text categorization. In: Proceedings of SIGIR-99, 22nd ACM international conference on research and development in information retrieval, pp 281–282

  • Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34:1–47

    Article  MathSciNet  Google Scholar 

  • Van Rijsbergen CJ (1979) Information retrieval. Butterworths, London

  • Vapnik V (1998) Statistical learning theory. Wiley, New York

  • Wiener ED, Pedersen JO, Weigend AS (1995) A neural network approach to topic spotting. In: Proceedings of SDAIR-95, 4th annual symposium on document analysis and information retrieval, Las Vegas, NV, pp 317–332

  • Yang Y (1999) An evaluation of statistical approaches to text categorization. J Inf Retrieval 1:67–88

    Article  Google Scholar 

  • Yang Y, Liu X (1999) A re-examination of text categorization. In: Proceedings of the 22nd ACM SIGIR conference on research and development in information retrieval (SIGIR’99), pp 42–49

  • Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. 14th international conference on machine learning, pp 412–420. Kaufmann

  • Zhang T, Oles FJ (2001) Text categorization based on regularized linear classification methods. Inf Retrieval 4:5–31

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dimitris Fragoudis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fragoudis, D., Meretakis, D. & Likothanassis, S. Best terms: an efficient feature-selection algorithm for text categorization. Knowl Inf Syst 8, 16–33 (2005). https://doi.org/10.1007/s10115-004-0177-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-004-0177-2

Keywords

Navigation