ABSTRACT
Given its importance, the problem of predicting rare classes in large-scale multi-labeled data sets has attracted great attentions in the literature. However, the rare-class problem remains a critical challenge, because there is no natural way developed for handling imbalanced class distributions. This paper thus fills this crucial void by developing a method for Classification using lOcal clusterinG (COG). Specifically, for a data set with an imbalanced class distribution, we perform clustering within each large class and produce sub-classes with relatively balanced sizes. Then, we apply traditional supervised learning algorithms, such as Support Vector Machines (SVMs), for classification. Indeed, our experimental results on various real-world data sets show that our method produces significantly higher prediction accuracies on rare classes than state-of-the-art methods. Furthermore, we show that COG can also improve the performance of traditional supervised learning algorithms on data sets with balanced class distributions.
Supplemental Material
- Bmr. In http://www.stat.rutgers.edu/ madigan/BMR/.Google Scholar
- C4.5. In http://www.rulequest.com/Personal/.Google Scholar
- Kddcup. In http://www.acm.org/sigs/sigkdd/kddcup/index.php.Google Scholar
- Kddcup99data. In http://kdd.ics.uci.edu//databases/kddcup99/kddcup99.html.Google Scholar
- Libsvm. In www.csie.ntu.edu.tw/ cjlin/libsvm/.Google Scholar
- N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer. Smote: Synthetic minority over-sampling technique. Journal of AI Research, 16:321--357, 2002. Google ScholarDigital Library
- W. Cohen. Fast effective rule induction. In ICML, pages 115--123, 1995.Google ScholarCross Ref
- N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines (and other kernel-based learning methods). Cambridge University Press, 2000. Google ScholarDigital Library
- M. DeGroot and M. Schervish. Probability and Statistics (3 edition). Addison Wesley, 2001.Google Scholar
- P. Domingos. Metacost: a general method for making classifiers cost-sensitive. In KDD, pages 155--164, 1999. Google ScholarDigital Library
- C. Drummond and R. Holte. C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In ICML Workshop, 2003.Google Scholar
- R. Duda, P. Hart, and D. Stork. Pattern classification. Wiley New York, 2001. Google ScholarDigital Library
- C. Elkan. The foundations of cost-sensitive learning. In IJCAI, pages 973--978, 2001. Google ScholarDigital Library
- W. Fan, S. Stolfo, J. Zhang, and P. Chan. Adacost: misclassification cost-sensitive boosting. In ICML, pages 97--105, 1999. Google ScholarDigital Library
- E.-H. Han and et al. Webace: A web agent for document categorization and exploration. In Int'l Conf. on Autonomous Agents, 1998. Google ScholarDigital Library
- N. Japkowicz. Supervised learning with unsupervised output separation. In Int'l Conf on Artificial Intelligence and Soft Computing, pages 321--325, 2002.Google Scholar
- M. Joshi, R. Agarwal, and V. Kumar. Mining needle in a haystack: Classifying rare classes via two-phase rule induction. In SIGMOD, pages 91--102, 2001. Google ScholarDigital Library
- M. Joshi, R. Agarwal, and V. Kumar. Predicting rare classes: Can boosting make any weak learner strong? In KDD, 2002. Google ScholarDigital Library
- G. Karypis. Cluto - software for clustering high-dimensional datasets, version 2.1.1. In http://glaros.dtc.umn.edu/gkhome/views/cluto.Google Scholar
- M. Kubat, R. Holte, and S. Matwin. Machine learning for the detection of oil spills in satellite radar imaages. Machine Learning, 30:195--215, 1998. Google ScholarDigital Library
- M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: One-sided selection. In ICML, pages 179--186, 1997.Google Scholar
- C. Ling and C. Li. Data mining for direct marketing: Problems and solutions. In KDD, pages 73--79, 1998.Google Scholar
- O. Maimon and L. Rokach, editors. The Data Mining and Knowledge Discovery Handbook. Springer, 2005. Google ScholarDigital Library
- D. Margineantu and T. Dietterich. Learning decision trees for loss minimization in multi-class problems. In TR 99--30--03. Oregon State University, 1999.Google Scholar
- P. Murphy and D. Aha. In UCI Repository of Machine Learning Databases. U. of California at Irvine, 1994.Google Scholar
- D. Newman, S. Hettich, C. Blake, and C. Merz. Uci repository of machine learning databases, 1998.Google Scholar
- M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, July 1980.Google ScholarCross Ref
- S. Raudys and A. Jain. Small sample size effects in statistical pattern recognition: Recommendations for practitioners. TPAMI, 13(3):252--264, 1991. Google ScholarDigital Library
- P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison-Wesley, 2005. Google ScholarDigital Library
- TREC. In http://trec.nist.gov.Google Scholar
- G. Weiss. Mining with rarity: a unifying framework. ACM SIGKDD Explorations, 6(1):7--19, 2004. Google ScholarDigital Library
- B. Zadrozny, J. Langford, and N. Abe. Cost-sensitive learning by cost-proportionate example weighting. In ICDM, pages 435--442, 2003. Google ScholarDigital Library
- J. Zurada, B. Foster, and T. Ward. Investigation of artificial neural networks for classifying levels of financial distress of firms: The case of an unbalanced training sample. In Knowledge Discovery for Business Information Systems, pages 397--423. Kluwer, 2001.Google Scholar
Index Terms
- Local decomposition for rare class analysis
Recommendations
COG: local decomposition for rare class analysis
Given its importance, the problem of predicting rare classes in large-scale multi-labeled data sets has attracted great attention in the literature. However, rare class analysis remains a critical challenge, because there is no natural way developed for ...
Exploiting probabilistic topic models to improve text categorization under class imbalance
In text categorization, it is quite often that the numbers of documents in different categories are different, i.e., the class distribution is imbalanced. We propose a unique approach to improve text categorization under class imbalance by exploiting ...
A hybrid approach for classification of rare class data
Learning of rare class data is a challenging problem in field of classification process. A rare class or imbalanced class learning is the common problem faced by many real-world applications, because of this many researcher work focused on this issue. ...
Comments