article

Feature selection for text categorization on imbalanced data

Authors:
Zhaohui Zheng

University at Buffalo, Amherst, NY

University at Buffalo, Amherst, NY
View Profile

,
Xiaoyun Wu

University at Buffalo, Amherst, NY

University at Buffalo, Amherst, NY
View Profile

,
Rohini Srihari

University at Buffalo, Amherst, NY

University at Buffalo, Amherst, NY
View Profile

Authors Info & Claims

ACM SIGKDD Explorations Newsletter Volume 6 Issue 1June 2004pp 80–89https://doi.org/10.1145/1007730.1007741

Published:01 June 2004Publication History

ACM SIGKDD Explorations Newsletter

Abstract

A number of feature selection metrics have been explored in text categorization, among which information gain (IG), chi-square (CHI), correlation coefficient (CC) and odds ratios (OR) are considered most effective. CC and OR are one-sided metrics while IG and CHI are two-sided. Feature selection using one-sided metrics selects the features most indicative of membership only, while feature selection using two-sided metrics implicitly combines the features most indicative of membership (e.g. positive features) and non-membership (e.g. negative features) by ignoring the signs of features. The former never consider the negative features, which are quite valuable, while the latter cannot ensure the optimal combination of the two kinds of features especially on imbalanced data. In this work, we investigate the usefulness of explicit control of that combination within a proposed feature selection framework. Using multinomial naïve Bayes and regularized logistic regression as classifiers, our experiments show both great potential and actual merits of explicitly combining positive and negative features in a nearly optimal fashion according to the imbalanced data.

References

S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. Proceedigs of the Seventh International Conference on Information and Knowledge Management, pages 148--155, 1998. Google ScholarDigital Library
T. Fawcett and F. Provost. Adaptive fraud detection. Data Mining and Knowledge Discovery, 1(3):291--316, 1997. Google ScholarDigital Library
G. Forman. An extensive empirical study of feature selection metrics for text classification. JMLR, Special Issue on Variable and Feature Selection, pages 1289--1305, 2003. Google ScholarDigital Library
N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5), 2002. Google ScholarDigital Library
R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1--2):273--324, 1997. Google ScholarDigital Library
D. Lewis and M. Ringuette. A comparison of two learning algorithms for text categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis, and Information Retrieval, pages 81--93, 1994.Google Scholar
A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. AAAI-98 Workshop on Learning for Text Categorization, 1998.Google Scholar
D. Mladeni. Machine Learning on non-homogeneous, distributed text data. PhD Dissertation, University of Ljubljana, Slovenia, 1998.Google Scholar
D. Mladeni and G. Marko. Feture selection for unbalanced class distribution and naive bayes. The Sixteenth International Conference on Machine Learning, pages 258--267, 1999. Google ScholarDigital Library
H. Ng, W. Goh, and K. Low. Feature selection, perceptron learning, and a usability case study for text categorization. ACM SIGIR Conference on Research and Development in Information Retrieval, pages 67--73, 1997. Google ScholarDigital Library
V. Rijsbergen. Information Retrieval. Butterworths, London, 1979. Google ScholarDigital Library
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002. Google ScholarDigital Library
A. Singhal, M. Mitra, and C. Buckley. Learning routing queries in a query zone. In Proceedings of SIGIR-97, 20th ACM International Conference on Research and Development in Information Retrieval, pages 25--32, Philadelphia, US, 1997. Google ScholarDigital Library
Y. Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1(1/2):67--88, 1999. Google ScholarDigital Library
Y. Yang and J. Pedersen. A comparative study on feature selection in text categorization. The Fourteenth International Conference on Machine Learning, pages 412--420, 1997. Google ScholarDigital Library
J. Zhang and Y. Yang. Robustness of regularized linear classification methods in text categorization. ACM SIGIR Conference on Research and Development in Information Retrieval, 2003. Google ScholarDigital Library
T. Zhang and F. Oles. Text categorization based on regularized linear classification methods. Information Retrieval, 4:5--31, 2001. Google ScholarDigital Library

Index Terms

Feature selection for text categorization on imbalanced data
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification

Index terms have been assigned to the content through auto-classification.

Recommendations

MMR-based feature selection for text categorization
HLT-NAACL-Short '04: Proceedings of HLT-NAACL 2004: Short Papers

We introduce a new method of feature selection for text categorization. Our MMR-based feature selection method strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization. ...
Read More
A General Framework of Feature Selection for Text Categorization
MLDM '09: Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition

Many feature selection methods have been proposed for text categorization. However, their performances are usually verified by experiments, so the results rely on the corpora used and may not be accurate. This paper proposes a novel feature selection ...
Read More
Feature selection with conditional mutual information maximin in text categorization
CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management

Feature selection is an important component of text categorization. This technique can both increase a classifier's computation speed, and reduce the overfitting problem. Several feature selection methods, such as information gain and mutual information,...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGKDD Explorations Newsletter Volume 6, Issue 1
Special issue on learning from imbalanced datasets
June 2004
117 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/1007730
Issue’s Table of Contents

Copyright © 2004 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2004
Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 422
  Total Citations
  View Citations
- 3,364
  Total Downloads
- Downloads (Last 12 months)66
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Feature selection for text categorization on imbalanced data

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

MMR-based feature selection for text categorization

A General Framework of Feature Selection for Text Categorization

Feature selection with conditional mutual information maximin in text categorization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Feature selection for text categorization on imbalanced data

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

MMR-based feature selection for text categorization

A General Framework of Feature Selection for Text Categorization

Feature selection with conditional mutual information maximin in text categorization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media