skip to main content
article

Feature selection for text categorization on imbalanced data

Published:01 June 2004Publication History
Skip Abstract Section

Abstract

A number of feature selection metrics have been explored in text categorization, among which information gain (IG), chi-square (CHI), correlation coefficient (CC) and odds ratios (OR) are considered most effective. CC and OR are one-sided metrics while IG and CHI are two-sided. Feature selection using one-sided metrics selects the features most indicative of membership only, while feature selection using two-sided metrics implicitly combines the features most indicative of membership (e.g. positive features) and non-membership (e.g. negative features) by ignoring the signs of features. The former never consider the negative features, which are quite valuable, while the latter cannot ensure the optimal combination of the two kinds of features especially on imbalanced data. In this work, we investigate the usefulness of explicit control of that combination within a proposed feature selection framework. Using multinomial naïve Bayes and regularized logistic regression as classifiers, our experiments show both great potential and actual merits of explicitly combining positive and negative features in a nearly optimal fashion according to the imbalanced data.

References

  1. S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. Proceedigs of the Seventh International Conference on Information and Knowledge Management, pages 148--155, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. T. Fawcett and F. Provost. Adaptive fraud detection. Data Mining and Knowledge Discovery, 1(3):291--316, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. G. Forman. An extensive empirical study of feature selection metrics for text classification. JMLR, Special Issue on Variable and Feature Selection, pages 1289--1305, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1--2):273--324, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Lewis and M. Ringuette. A comparison of two learning algorithms for text categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis, and Information Retrieval, pages 81--93, 1994.Google ScholarGoogle Scholar
  7. A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. AAAI-98 Workshop on Learning for Text Categorization, 1998.Google ScholarGoogle Scholar
  8. D. Mladeni. Machine Learning on non-homogeneous, distributed text data. PhD Dissertation, University of Ljubljana, Slovenia, 1998.Google ScholarGoogle Scholar
  9. D. Mladeni and G. Marko. Feture selection for unbalanced class distribution and naive bayes. The Sixteenth International Conference on Machine Learning, pages 258--267, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. Ng, W. Goh, and K. Low. Feature selection, perceptron learning, and a usability case study for text categorization. ACM SIGIR Conference on Research and Development in Information Retrieval, pages 67--73, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. V. Rijsbergen. Information Retrieval. Butterworths, London, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Singhal, M. Mitra, and C. Buckley. Learning routing queries in a query zone. In Proceedings of SIGIR-97, 20th ACM International Conference on Research and Development in Information Retrieval, pages 25--32, Philadelphia, US, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1(1/2):67--88, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Y. Yang and J. Pedersen. A comparative study on feature selection in text categorization. The Fourteenth International Conference on Machine Learning, pages 412--420, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Zhang and Y. Yang. Robustness of regularized linear classification methods in text categorization. ACM SIGIR Conference on Research and Development in Information Retrieval, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. Zhang and F. Oles. Text categorization based on regularized linear classification methods. Information Retrieval, 4:5--31, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Feature selection for text categorization on imbalanced data
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGKDD Explorations Newsletter
      ACM SIGKDD Explorations Newsletter  Volume 6, Issue 1
      Special issue on learning from imbalanced datasets
      June 2004
      117 pages
      ISSN:1931-0145
      EISSN:1931-0153
      DOI:10.1145/1007730
      Issue’s Table of Contents

      Copyright © 2004 Authors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 June 2004

      Check for updates

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader