skip to main content
10.1145/1281192.1281279acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Local decomposition for rare class analysis

Published:12 August 2007Publication History

ABSTRACT

Given its importance, the problem of predicting rare classes in large-scale multi-labeled data sets has attracted great attentions in the literature. However, the rare-class problem remains a critical challenge, because there is no natural way developed for handling imbalanced class distributions. This paper thus fills this crucial void by developing a method for Classification using lOcal clusterinG (COG). Specifically, for a data set with an imbalanced class distribution, we perform clustering within each large class and produce sub-classes with relatively balanced sizes. Then, we apply traditional supervised learning algorithms, such as Support Vector Machines (SVMs), for classification. Indeed, our experimental results on various real-world data sets show that our method produces significantly higher prediction accuracies on rare classes than state-of-the-art methods. Furthermore, we show that COG can also improve the performance of traditional supervised learning algorithms on data sets with balanced class distributions.

Skip Supplemental Material Section

Supplemental Material

p814-wu-200.mov

mov

38 MB

p814-wu-768.mov

mov

129 MB

References

  1. Bmr. In http://www.stat.rutgers.edu/ madigan/BMR/.Google ScholarGoogle Scholar
  2. C4.5. In http://www.rulequest.com/Personal/.Google ScholarGoogle Scholar
  3. Kddcup. In http://www.acm.org/sigs/sigkdd/kddcup/index.php.Google ScholarGoogle Scholar
  4. Kddcup99data. In http://kdd.ics.uci.edu//databases/kddcup99/kddcup99.html.Google ScholarGoogle Scholar
  5. Libsvm. In www.csie.ntu.edu.tw/ cjlin/libsvm/.Google ScholarGoogle Scholar
  6. N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer. Smote: Synthetic minority over-sampling technique. Journal of AI Research, 16:321--357, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. W. Cohen. Fast effective rule induction. In ICML, pages 115--123, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  8. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines (and other kernel-based learning methods). Cambridge University Press, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. DeGroot and M. Schervish. Probability and Statistics (3 edition). Addison Wesley, 2001.Google ScholarGoogle Scholar
  10. P. Domingos. Metacost: a general method for making classifiers cost-sensitive. In KDD, pages 155--164, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Drummond and R. Holte. C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In ICML Workshop, 2003.Google ScholarGoogle Scholar
  12. R. Duda, P. Hart, and D. Stork. Pattern classification. Wiley New York, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Elkan. The foundations of cost-sensitive learning. In IJCAI, pages 973--978, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. W. Fan, S. Stolfo, J. Zhang, and P. Chan. Adacost: misclassification cost-sensitive boosting. In ICML, pages 97--105, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E.-H. Han and et al. Webace: A web agent for document categorization and exploration. In Int'l Conf. on Autonomous Agents, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. N. Japkowicz. Supervised learning with unsupervised output separation. In Int'l Conf on Artificial Intelligence and Soft Computing, pages 321--325, 2002.Google ScholarGoogle Scholar
  17. M. Joshi, R. Agarwal, and V. Kumar. Mining needle in a haystack: Classifying rare classes via two-phase rule induction. In SIGMOD, pages 91--102, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Joshi, R. Agarwal, and V. Kumar. Predicting rare classes: Can boosting make any weak learner strong? In KDD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. Karypis. Cluto - software for clustering high-dimensional datasets, version 2.1.1. In http://glaros.dtc.umn.edu/gkhome/views/cluto.Google ScholarGoogle Scholar
  20. M. Kubat, R. Holte, and S. Matwin. Machine learning for the detection of oil spills in satellite radar imaages. Machine Learning, 30:195--215, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: One-sided selection. In ICML, pages 179--186, 1997.Google ScholarGoogle Scholar
  22. C. Ling and C. Li. Data mining for direct marketing: Problems and solutions. In KDD, pages 73--79, 1998.Google ScholarGoogle Scholar
  23. O. Maimon and L. Rokach, editors. The Data Mining and Knowledge Discovery Handbook. Springer, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Margineantu and T. Dietterich. Learning decision trees for loss minimization in multi-class problems. In TR 99--30--03. Oregon State University, 1999.Google ScholarGoogle Scholar
  25. P. Murphy and D. Aha. In UCI Repository of Machine Learning Databases. U. of California at Irvine, 1994.Google ScholarGoogle Scholar
  26. D. Newman, S. Hettich, C. Blake, and C. Merz. Uci repository of machine learning databases, 1998.Google ScholarGoogle Scholar
  27. M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, July 1980.Google ScholarGoogle ScholarCross RefCross Ref
  28. S. Raudys and A. Jain. Small sample size effects in statistical pattern recognition: Recommendations for practitioners. TPAMI, 13(3):252--264, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison-Wesley, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. TREC. In http://trec.nist.gov.Google ScholarGoogle Scholar
  31. G. Weiss. Mining with rarity: a unifying framework. ACM SIGKDD Explorations, 6(1):7--19, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. B. Zadrozny, J. Langford, and N. Abe. Cost-sensitive learning by cost-proportionate example weighting. In ICDM, pages 435--442, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Zurada, B. Foster, and T. Ward. Investigation of artificial neural networks for classifying levels of financial distress of firms: The case of an unbalanced training sample. In Knowledge Discovery for Business Information Systems, pages 397--423. Kluwer, 2001.Google ScholarGoogle Scholar

Index Terms

  1. Local decomposition for rare class analysis

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
          August 2007
          1080 pages
          ISBN:9781595936097
          DOI:10.1145/1281192

          Copyright © 2007 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 12 August 2007

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          KDD '07 Paper Acceptance Rate111of573submissions,19%Overall Acceptance Rate1,133of8,635submissions,13%

          Upcoming Conference

          KDD '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader