Best terms: an efficient feature-selection algorithm for text categorization

Fragoudis, Dimitris; Meretakis, Dimitris; Likothanassis, Spiridon

doi:10.1007/s10115-004-0177-2

Best terms: an efficient feature-selection algorithm for text categorization

Published: 01 July 2005

Volume 8, pages 16–33, (2005)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Dimitris Fragoudis¹,
Dimitris Meretakis² &
Spiridon Likothanassis^1,3

225 Accesses
48 Citations
Explore all metrics

Abstract

In this paper, we propose a new feature-selection algorithm for text classification, called best terms (BT). The complexity of BT is linear in respect to the number of the training-set documents and is independent from both the vocabulary size and the number of categories. We evaluate BT on two benchmark document collections, Reuters-21578 and 20-Newsgroups, using two classification algorithms, naive Bayes (NB) and support vector machines (SVM). Our experimental results, comparing BT with an extensive and representative list of feature-selection algorithms, show that (1) BT is faster than the existing feature-selection algorithms; (2) BT leads to a considerable increase in the classification accuracy of NB and SVM as measured by the F1 measure; (3) BT leads to a considerable improvement in the speed of NB and SVM; in most cases, the training time of SVM has dropped by an order of magnitude; (4) in most cases, the combination of BT with the simple, but very fast, NB algorithm leads to classification accuracy comparable with SVM while sometimes it is even more accurate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Feature selection techniques for machine learning: a survey of more than two decades of research

Article 01 December 2023

A review of unsupervised feature selection methods

Article 29 January 2019

References

Baker LD, McCallum AK (1998) Distributional clustering of words for text classification. In: Proceedings of SIGIR-98, 21st ACM international conference on research and development in information retrieval, Melbourne, AU, pp 96–103
Bekkerman R, El-Yaniv R, Tishby N, Winter Y (2001) On feature distributional clustering for text categorization. In: Proceedings of 24th ACM international conference on research and development in information retrieval, New York, US, pp 146–153. ACM Press
Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artific Intell 97:245–271
Article Google Scholar
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic indexing. J Am Soc Inf Sci 41:391–407
Article Google Scholar
Duda R, Hart P (1973) Pattern classification and scene analysis. Wiley, New York
Dumais ST, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: Proceedings of (CIKM-98), 7th ACM international conference on information and knowledge management, ethesda, MD, pp 148–155
Fuhr N, Hartmann S, Knorz G, Lustig G, Schwantner M, Tzeras K (1991) AIR/X—a rule-based multistage indexing system for large subject fields. In: Proceedings of RIAO-91, 3rd international conference recherche d’information assistee par ordinateu, Barcelona, ES, pp 606–623
Galavotti L, Sebastiani F, Simi M (2000) Experiments on the use of feature selection and negative evidence in automated text categorization. In: Proceedings of ECDL-00, 4th European conference on research and advanced technology for digital libraries, Lisbon, PT, pp 59–68
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of ECML-98, 10th European conference on machine learning, Chemnitz, DE, pp 137–142
Joachims T (1999) In: Scholkopf B, Burges C, Smola A (eds) Advances in kernel methods—support vector learning. MIT Press
John GH, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem. In: Proceedings of ICML-94, 11th international conference on machine learning, New Brunswick, NJ, pp 121–129
Lang K (1995) NewsWeeder: Learning to filter netnews. In: Proceedings of ICML-95, 12th international conference on machine learning, pp 331–339
Lewis DD (1992) Feature selection and feature extraction for text categorization, speech and natural language. In: Proceedings of a workshop held at Harriman, NY, pp 212–217. Kaufmann, San Mateo, CA
McCallum A, Nigam K (1998) A comparison of event models for naive Bayes text classification. In: Proceedings of the AAAI/ICML-98 workshop on learning for text categorization. AAAI Press, pp 41–48
Ng HT, Goh WB, Low KL (1997) Feature selection, perceptron learning, and a usability case study for text categorization. In: Proceedings of SIGIR-97, 20th ACM international conference on research and development in information retrieval, pp 67–73
Quinlan JR (1983) Learning efficient classification procedures and their applications. In: Michalski RS, Carbonell JG, Mitchell TM (eds) Machine learning: an artificial intelligence approach. Morgan Kaufmann, San Francisco, CA
Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann, San Francisco, CA
Rogati M, Yang Y (2002) High-performing feature selection for text classification. In: Proceedings of 11th international conference on information and knowledge management (CIKM’02)
Ruiz ME, Srinivasan P (1999) Hierarchical neural networks for text categorization. In: Proceedings of SIGIR-99, 22nd ACM international conference on research and development in information retrieval, pp 281–282
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34:1–47
Article MathSciNet Google Scholar
Van Rijsbergen CJ (1979) Information retrieval. Butterworths, London
Vapnik V (1998) Statistical learning theory. Wiley, New York
Wiener ED, Pedersen JO, Weigend AS (1995) A neural network approach to topic spotting. In: Proceedings of SDAIR-95, 4th annual symposium on document analysis and information retrieval, Las Vegas, NV, pp 317–332
Yang Y (1999) An evaluation of statistical approaches to text categorization. J Inf Retrieval 1:67–88
Article Google Scholar
Yang Y, Liu X (1999) A re-examination of text categorization. In: Proceedings of the 22nd ACM SIGIR conference on research and development in information retrieval (SIGIR’99), pp 42–49
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. 14th international conference on machine learning, pp 412–420. Kaufmann
Zhang T, Oles FJ (2001) Text categorization based on regularized linear classification methods. Inf Retrieval 4:5–31
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Engineering and Informatics Department, University of Patras, Rio—Patras, GR-26500, Greece
Dimitris Fragoudis & Spiridon Likothanassis
Novartis Pharma, Griffith University, Basel, Switzerland
Dimitris Meretakis
Computer Technology Institute, Patras, Greece
Spiridon Likothanassis

Authors

Dimitris Fragoudis
View author publications
You can also search for this author in PubMed Google Scholar
Dimitris Meretakis
View author publications
You can also search for this author in PubMed Google Scholar
Spiridon Likothanassis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dimitris Fragoudis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fragoudis, D., Meretakis, D. & Likothanassis, S. Best terms: an efficient feature-selection algorithm for text categorization. Knowl Inf Syst 8, 16–33 (2005). https://doi.org/10.1007/s10115-004-0177-2

Download citation

Received: 21 September 2003
Revised: 30 January 2004
Accepted: 05 March 2004
Published: 01 July 2005
Issue Date: July 2005
DOI: https://doi.org/10.1007/s10115-004-0177-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Best terms: an efficient feature-selection algorithm for text categorization

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Feature selection techniques for machine learning: a survey of more than two decades of research

A review of unsupervised feature selection methods

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Best terms: an efficient feature-selection algorithm for text categorization

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Feature selection techniques for machine learning: a survey of more than two decades of research

A review of unsupervised feature selection methods

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation