Article

Local decomposition for rare class analysis

Authors:
Junjie Wu

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Hui Xiong

Rutgers University

Rutgers University
View Profile

,
Peng Wu

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Jian Chen

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2007Pages 814–823https://doi.org/10.1145/1281192.1281279

Published:12 August 2007Publication History

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 814–823

ABSTRACT

Given its importance, the problem of predicting rare classes in large-scale multi-labeled data sets has attracted great attentions in the literature. However, the rare-class problem remains a critical challenge, because there is no natural way developed for handling imbalanced class distributions. This paper thus fills this crucial void by developing a method for Classification using lOcal clusterinG (COG). Specifically, for a data set with an imbalanced class distribution, we perform clustering within each large class and produce sub-classes with relatively balanced sizes. Then, we apply traditional supervised learning algorithms, such as Support Vector Machines (SVMs), for classification. Indeed, our experimental results on various real-world data sets show that our method produces significantly higher prediction accuracies on rare classes than state-of-the-art methods. Furthermore, we show that COG can also improve the performance of traditional supervised learning algorithms on data sets with balanced class distributions.

Supplemental Material

p814-wu-200.mov

mov

38 MB

Download

p814-wu-768.mov

mov

129 MB

Download

References

Bmr. In http://www.stat.rutgers.edu/ madigan/BMR/.Google Scholar
C4.5. In http://www.rulequest.com/Personal/.Google Scholar
Kddcup. In http://www.acm.org/sigs/sigkdd/kddcup/index.php.Google Scholar
Kddcup99data. In http://kdd.ics.uci.edu//databases/kddcup99/kddcup99.html.Google Scholar
Libsvm. In www.csie.ntu.edu.tw/ cjlin/libsvm/.Google Scholar
N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer. Smote: Synthetic minority over-sampling technique. Journal of AI Research, 16:321--357, 2002. Google ScholarDigital Library
W. Cohen. Fast effective rule induction. In ICML, pages 115--123, 1995.Google ScholarCross Ref
N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines (and other kernel-based learning methods). Cambridge University Press, 2000. Google ScholarDigital Library
M. DeGroot and M. Schervish. Probability and Statistics (3 edition). Addison Wesley, 2001.Google Scholar
P. Domingos. Metacost: a general method for making classifiers cost-sensitive. In KDD, pages 155--164, 1999. Google ScholarDigital Library
C. Drummond and R. Holte. C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In ICML Workshop, 2003.Google Scholar
R. Duda, P. Hart, and D. Stork. Pattern classification. Wiley New York, 2001. Google ScholarDigital Library
C. Elkan. The foundations of cost-sensitive learning. In IJCAI, pages 973--978, 2001. Google ScholarDigital Library
W. Fan, S. Stolfo, J. Zhang, and P. Chan. Adacost: misclassification cost-sensitive boosting. In ICML, pages 97--105, 1999. Google ScholarDigital Library
E.-H. Han and et al. Webace: A web agent for document categorization and exploration. In Int'l Conf. on Autonomous Agents, 1998. Google ScholarDigital Library
N. Japkowicz. Supervised learning with unsupervised output separation. In Int'l Conf on Artificial Intelligence and Soft Computing, pages 321--325, 2002.Google Scholar
M. Joshi, R. Agarwal, and V. Kumar. Mining needle in a haystack: Classifying rare classes via two-phase rule induction. In SIGMOD, pages 91--102, 2001. Google ScholarDigital Library
M. Joshi, R. Agarwal, and V. Kumar. Predicting rare classes: Can boosting make any weak learner strong? In KDD, 2002. Google ScholarDigital Library
G. Karypis. Cluto - software for clustering high-dimensional datasets, version 2.1.1. In http://glaros.dtc.umn.edu/gkhome/views/cluto.Google Scholar
M. Kubat, R. Holte, and S. Matwin. Machine learning for the detection of oil spills in satellite radar imaages. Machine Learning, 30:195--215, 1998. Google ScholarDigital Library
M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: One-sided selection. In ICML, pages 179--186, 1997.Google Scholar
C. Ling and C. Li. Data mining for direct marketing: Problems and solutions. In KDD, pages 73--79, 1998.Google Scholar
O. Maimon and L. Rokach, editors. The Data Mining and Knowledge Discovery Handbook. Springer, 2005. Google ScholarDigital Library
D. Margineantu and T. Dietterich. Learning decision trees for loss minimization in multi-class problems. In TR 99--30--03. Oregon State University, 1999.Google Scholar
P. Murphy and D. Aha. In UCI Repository of Machine Learning Databases. U. of California at Irvine, 1994.Google Scholar
D. Newman, S. Hettich, C. Blake, and C. Merz. Uci repository of machine learning databases, 1998.Google Scholar
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, July 1980.Google ScholarCross Ref
S. Raudys and A. Jain. Small sample size effects in statistical pattern recognition: Recommendations for practitioners. TPAMI, 13(3):252--264, 1991. Google ScholarDigital Library
P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison-Wesley, 2005. Google ScholarDigital Library
TREC. In http://trec.nist.gov.Google Scholar
G. Weiss. Mining with rarity: a unifying framework. ACM SIGKDD Explorations, 6(1):7--19, 2004. Google ScholarDigital Library
B. Zadrozny, J. Langford, and N. Abe. Cost-sensitive learning by cost-proportionate example weighting. In ICDM, pages 435--442, 2003. Google ScholarDigital Library
J. Zurada, B. Foster, and T. Ward. Investigation of artificial neural networks for classifying levels of financial distress of firms: The case of an unbalanced training sample. In Knowledge Discovery for Business Information Systems, pages 397--423. Kluwer, 2001.Google Scholar

Index Terms

Local decomposition for rare class analysis
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

COG: local decomposition for rare class analysis

Given its importance, the problem of predicting rare classes in large-scale multi-labeled data sets has attracted great attention in the literature. However, rare class analysis remains a critical challenge, because there is no natural way developed for ...
Read More
Exploiting probabilistic topic models to improve text categorization under class imbalance

In text categorization, it is quite often that the numbers of documents in different categories are different, i.e., the class distribution is imbalanced. We propose a unique approach to improve text categorization under class imbalance by exploiting ...
Read More
A hybrid approach for classification of rare class data

Learning of rare class data is a challenging problem in field of classification process. A rare class or imbalanced class learning is the common problem faced by many real-world applications, because of this many researcher work focused on this issue. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2007
1080 pages
ISBN:9781595936097
DOI:10.1145/1281192
General Chair:
Pavel Berkhin
Yahoo!, USA
,
Program Chairs:
Rich Caruana
Cornell University, USA
,
Xindong Wu
University of Vermont, USA
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 August 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
k-means clustering support vector machines
local clustering
rare class analysis
Qualifiers
- Article
Conference

Acceptance Rates
KDD '07 Paper Acceptance Rate111of573submissions,19%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 57
  Total Citations
  View Citations
- 881
  Total Downloads
- Downloads (Last 12 months)22
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Local decomposition for rare class analysis

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

COG: local decomposition for rare class analysis

Exploiting probabilistic topic models to improve text categorization under class imbalance

A hybrid approach for classification of rare class data