skip to main content
article

Mining with rarity: a unifying framework

Published:01 June 2004Publication History
Skip Abstract Section

Abstract

Rare objects are often of great interest and great value. Until recently, however, rarity has not received much attention in the context of data mining. Now, as increasingly complex real-world problems are addressed, rarity, and the related problem of imbalanced data, are taking center stage. This article discusses the role that rare classes and rare cases play in data mining. The problems that can result from these two forms of rarity are described in detail, as are methods for addressing these problems. These descriptions utilize examples from existing research. So that this article provides a good survey of the literature on rarity in data mining. This article also demonstrates that rare classes and rare cases are very similar phenomena---both forms of rarity are shown to cause similar problems during data mining and benefit from the same remediation methods.

References

  1. R. Agarwal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207--217, 1993.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. K. Ali, and M. Pazzani HYDRA-MM: learning multiple descriptions to improve classification accuracy. International Journal of Artificial Intelligence Tools, 4, 1995.]]Google ScholarGoogle Scholar
  3. A. van den Bosch, T. Weijters, H. J. van den Herik, and W. Daelemans. When small disjuncts abound, try lazy learning: A case study. In Proceedings of the Seventh Belgian-Dutch Conference on Machine Learning, pages 109--118, 1997.]]Google ScholarGoogle Scholar
  4. A. Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7): 1145--1159, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and regression trees. Chapman and Hall/CRC Press, Boca Raton, Fl, 1984.]]Google ScholarGoogle Scholar
  6. C. Cardie. Improving minority class prediction using case-specific feature weights. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 57--65, Morgan Kaufmann, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. R. Carvalho, and A. A. Freitas. A genetic algorithm for discovering small-disjunct rules in data mining. Applied Soft Computing 2(2):75--88, 2002.]]Google ScholarGoogle ScholarCross RefCross Ref
  8. D. R. Carvalho, and A. A. Freitas. New results for a hybrid decision tree/genetic algorithm for data mining. In Proceedings of the Fourth International Conference on Recent Advances in Soft Computing, pages 260--265, 2002.]]Google ScholarGoogle Scholar
  9. P. K. Chan, and S. J. Stolfo. Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 164--168, 2001.]]Google ScholarGoogle Scholar
  10. N. V. Chawla. C4.5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In Workshop on Learning from Imbalanced Datasets II, International Conference on Machine Learning, 2003.]]Google ScholarGoogle Scholar
  11. N. V. Chawla, K. W. Bowyer. L. O. Hall, and W. P. Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16: 321--357, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. V. Chawla, A. Lazarevie, L. O. Hall, and K. Bowyer. SMOTEBoost: Improving prediction of the minority class in boosting. In Proceedings of Principles of Knowledge Discovery in Databases, 2003.]]Google ScholarGoogle ScholarCross RefCross Ref
  13. W. W. Cohen. Fast effective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning, pages 115--123, 1995.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. Drummond, and R. C. Holte. Exploiting the cost (in)sensitivity of decision tree splitting criteria. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 239--246, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Drummond and R. C. Holte. C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on Learning from Imbalanced Data Sets II, International Conference on Machine Learning, 2003.]]Google ScholarGoogle Scholar
  16. C. Elkan. The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 239--246, 2001.]]Google ScholarGoogle Scholar
  17. A. Estabrooks, and N. Japkowicz. A mixture-of-experts framework for learning from unbalanced data sets. In Proceedings of the 2001 Intelligent Data Analysis Conference, pages 34--43, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan. AdaCost: misclassification cost-sensitive boosting. In Proceedings of the Sixteenth International Conference on Machine Learning, pages 99--105, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. A. Freitas, Evolutionary computation. Handbook of Data Mining and Knowledge Discovery, Oxford University Press, pages 698--706, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. H. Friedman, R. Kohavi, and Y. Yun. Lazy decision trees. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 717--724, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, 1989.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. W. Grzymala-Busse, Z. Zheng, L. K. Goodwin, and W. J. Grzymala-Busse. An approach to imbalanced data sets based on changing rule strength. In Learning from Imbalanced Data Sets: Papers from the AAAI Workshop, pages 69--74, AAAI Press Technical Report WS-00-05, 2000.]]Google ScholarGoogle Scholar
  23. R. C. Holte, L. E. Acker, and B. W. Porter, Concept learning and the problem of small disjuncts. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, pages 813--818, 1989.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. N. Japkowicz. Concept learning in the presence of between-class and within-class imbalances. In Proceedings of the Fourteenth Conference of the Canadian Society for Computational Studies of Intelligence, pages 67--77, Springer-Verlag, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. N. Japkowicz. Supervised learning with unsupervised output separation. In International Conference on Artificial Intelligence and Soft Computing, pages 321--325, 2002.]]Google ScholarGoogle Scholar
  26. N. Japkowicz. Class imbalances: are we focusing on the right issue? In International Conference on Machine Learning Workshop on Learning from Imbalanced Data Sets II, 2003.]]Google ScholarGoogle Scholar
  27. N. Japkowicz, C. Myers, and M. A. Gluck. A novelty detection approach to classification. In Proceedings of the Fourteenth Joint Conference on Artificial Intelligence, pages 518--523, 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. N. Japkowicz, and S. Stephen. The class imbalance problem: a systematic study. Intelligent Data Analysis, 6(5):429--450, 2002.]] Google ScholarGoogle ScholarCross RefCross Ref
  29. M. V. Joshi, R. C. Agarwal, and V. Kumar. Mining needles in a haystack: classifying rare classes via two-phase rule induction In SIGMOD '01 Conference on Management of Data, pages 91--102, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. V. Joshi, R. C. Agarwal, and V. Kumar. Predicting rare classes: can boosting make any weak learner strong? In Proceedings of the Eighth International Conference on Knowledge Discovery and Data Mining, pages 297--306, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. V. Joshi, V. Kumar, and R. C. Agarwal. Evaluating boosting algorithms to classify rare cases: comparison and improvements. In First IEEE International Conference on Data Mining, pages 257--264, November 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Kohavi. Data mining with MineSet: what worked, what did not, and what might. In Workshop on Commercial Success of Data Mining, Fourth International Conference on Knowledge Discovery and Date Mining, 1998.]]Google ScholarGoogle Scholar
  33. M. Kubat, R. Holte, and S. Matwin. Learning when negative examples abound. Machine Learning: ECML-97, Lecture Notes in Artificial Intelligence 1224, pages 146--153, Springer, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Kubat, R. C. Holte, and S. Matwin. Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30(2):195--215, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. Kubat, and S. Matwin. Addressing the curse of imbalanced training sets: one-sided selection. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 179--186. Morgan Kaufmann, 1997.]]Google ScholarGoogle Scholar
  36. C. Ling, and C. Li. Data mining for direct marketing problems and solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 73--79, 1998.]]Google ScholarGoogle Scholar
  37. B. Liu, W. Hsu, and Y. Ma. Mining association rules with multiple minimum supports. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 337--341, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. M. Pazzani, C. Merz, P. Murphy, K. Ali, T. Hume, and C. Brunk. Reducing misclassification costs. In Proceedings of the Eleventh International Conference on Machine Learning, pages 217--225, 1994.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. F. Provost, and P. Domingos. Tree Induction for probability-based rankings. Machine Learning, 52(3).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. F. Provost, and T. Fawcett. Robust classification for imprecise environments. Machine Learning, 42: 203--231, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. J. R. Quinlan. Improved estimates for the accuracy of small disjuncts. Machine Learning 6:93--98, 1991.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. J. R. Quinian. C4.5: Programs for Machine Learning. Morgan Kaufmann. San Mateo, CA, 1993.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. B. Raskutti, and A. Kowalczyk. Extreme re-balancing for SVMs: a case study. In Workshop on Learning from Imbalanced Data Sets II. International Conference on Machine Learning, 2003.]]Google ScholarGoogle Scholar
  44. P. Riddle, R. Segal and O. Etzioni. Representation design and brute-force induction in a Boeing manufacturing design. Applied Artificial Intelligence, 8:125--147, 1994.]]Google ScholarGoogle ScholarCross RefCross Ref
  45. C. J. van Rijsbergen. Information Retrieval, Butterworths, London, 1979.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. R. E. Schapire. A brief introduction to boosting. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, pages 1401--1406, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. K. M. Ting. The problem of small disjuncts: its remedy in decision trees. In Proceeding of the Tenth Canadian Conference on Artificial Intelligence, pages 91--97, 1994.]]Google ScholarGoogle Scholar
  48. G. M. Weiss. Learning with rare cases and small disjuncts. In Proceedings of the Twelfth International Conference on Machine Learning, pages 558--565, Morgan Kaufmann, 1995.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. G. M. Weiss. Timeweaver: a genetic algorithm for identifying predictive patterns in sequences of events. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 718--725, Morgan Kaufmann, 1999.]]Google ScholarGoogle Scholar
  50. G. M. Weiss, and H. Hirsh. Learning to predict rare events in event sequences. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 359--363, 1998.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. G. M. Weiss, and H. Hirsh. A quantitative study of small disjuncts. In Proceedings of the Seventeenth National Conference on Artificial Intelligence, pages 665--670, AAAI Press, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. G. M. Weiss, and F. Provost. Learning when training data are costly: the effect of class distribution on tree induction. Journal of Artificial Intelligence Research, 19:315--354, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. R. Yan, Y. Liu, R. Jin, and A. Hauptmann. On predicting rare classes with SVM ensembles in scene classification. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2003.]]Google ScholarGoogle Scholar
  54. B. Zadrozny, and C. Elkan. Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining, pages 204--213, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mining with rarity: a unifying framework
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader