Abstract
Rare objects are often of great interest and great value. Until recently, however, rarity has not received much attention in the context of data mining. Now, as increasingly complex real-world problems are addressed, rarity, and the related problem of imbalanced data, are taking center stage. This article discusses the role that rare classes and rare cases play in data mining. The problems that can result from these two forms of rarity are described in detail, as are methods for addressing these problems. These descriptions utilize examples from existing research. So that this article provides a good survey of the literature on rarity in data mining. This article also demonstrates that rare classes and rare cases are very similar phenomena---both forms of rarity are shown to cause similar problems during data mining and benefit from the same remediation methods.
- R. Agarwal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207--217, 1993.]] Google ScholarDigital Library
- K. Ali, and M. Pazzani HYDRA-MM: learning multiple descriptions to improve classification accuracy. International Journal of Artificial Intelligence Tools, 4, 1995.]]Google Scholar
- A. van den Bosch, T. Weijters, H. J. van den Herik, and W. Daelemans. When small disjuncts abound, try lazy learning: A case study. In Proceedings of the Seventh Belgian-Dutch Conference on Machine Learning, pages 109--118, 1997.]]Google Scholar
- A. Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7): 1145--1159, 1997.]] Google ScholarDigital Library
- L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and regression trees. Chapman and Hall/CRC Press, Boca Raton, Fl, 1984.]]Google Scholar
- C. Cardie. Improving minority class prediction using case-specific feature weights. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 57--65, Morgan Kaufmann, 1997.]] Google ScholarDigital Library
- D. R. Carvalho, and A. A. Freitas. A genetic algorithm for discovering small-disjunct rules in data mining. Applied Soft Computing 2(2):75--88, 2002.]]Google ScholarCross Ref
- D. R. Carvalho, and A. A. Freitas. New results for a hybrid decision tree/genetic algorithm for data mining. In Proceedings of the Fourth International Conference on Recent Advances in Soft Computing, pages 260--265, 2002.]]Google Scholar
- P. K. Chan, and S. J. Stolfo. Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 164--168, 2001.]]Google Scholar
- N. V. Chawla. C4.5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In Workshop on Learning from Imbalanced Datasets II, International Conference on Machine Learning, 2003.]]Google Scholar
- N. V. Chawla, K. W. Bowyer. L. O. Hall, and W. P. Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16: 321--357, 2002.]] Google ScholarDigital Library
- N. V. Chawla, A. Lazarevie, L. O. Hall, and K. Bowyer. SMOTEBoost: Improving prediction of the minority class in boosting. In Proceedings of Principles of Knowledge Discovery in Databases, 2003.]]Google ScholarCross Ref
- W. W. Cohen. Fast effective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning, pages 115--123, 1995.]]Google ScholarDigital Library
- C. Drummond, and R. C. Holte. Exploiting the cost (in)sensitivity of decision tree splitting criteria. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 239--246, 2000.]] Google ScholarDigital Library
- C. Drummond and R. C. Holte. C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on Learning from Imbalanced Data Sets II, International Conference on Machine Learning, 2003.]]Google Scholar
- C. Elkan. The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 239--246, 2001.]]Google Scholar
- A. Estabrooks, and N. Japkowicz. A mixture-of-experts framework for learning from unbalanced data sets. In Proceedings of the 2001 Intelligent Data Analysis Conference, pages 34--43, 2001.]] Google ScholarDigital Library
- W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan. AdaCost: misclassification cost-sensitive boosting. In Proceedings of the Sixteenth International Conference on Machine Learning, pages 99--105, 1999.]] Google ScholarDigital Library
- A. A. Freitas, Evolutionary computation. Handbook of Data Mining and Knowledge Discovery, Oxford University Press, pages 698--706, 2002.]] Google ScholarDigital Library
- J. H. Friedman, R. Kohavi, and Y. Yun. Lazy decision trees. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 717--724, 1996.]] Google ScholarDigital Library
- D. E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, 1989.]] Google ScholarDigital Library
- J. W. Grzymala-Busse, Z. Zheng, L. K. Goodwin, and W. J. Grzymala-Busse. An approach to imbalanced data sets based on changing rule strength. In Learning from Imbalanced Data Sets: Papers from the AAAI Workshop, pages 69--74, AAAI Press Technical Report WS-00-05, 2000.]]Google Scholar
- R. C. Holte, L. E. Acker, and B. W. Porter, Concept learning and the problem of small disjuncts. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, pages 813--818, 1989.]]Google ScholarDigital Library
- N. Japkowicz. Concept learning in the presence of between-class and within-class imbalances. In Proceedings of the Fourteenth Conference of the Canadian Society for Computational Studies of Intelligence, pages 67--77, Springer-Verlag, 2001.]] Google ScholarDigital Library
- N. Japkowicz. Supervised learning with unsupervised output separation. In International Conference on Artificial Intelligence and Soft Computing, pages 321--325, 2002.]]Google Scholar
- N. Japkowicz. Class imbalances: are we focusing on the right issue? In International Conference on Machine Learning Workshop on Learning from Imbalanced Data Sets II, 2003.]]Google Scholar
- N. Japkowicz, C. Myers, and M. A. Gluck. A novelty detection approach to classification. In Proceedings of the Fourteenth Joint Conference on Artificial Intelligence, pages 518--523, 1995.]] Google ScholarDigital Library
- N. Japkowicz, and S. Stephen. The class imbalance problem: a systematic study. Intelligent Data Analysis, 6(5):429--450, 2002.]] Google ScholarCross Ref
- M. V. Joshi, R. C. Agarwal, and V. Kumar. Mining needles in a haystack: classifying rare classes via two-phase rule induction In SIGMOD '01 Conference on Management of Data, pages 91--102, 2001.]] Google ScholarDigital Library
- M. V. Joshi, R. C. Agarwal, and V. Kumar. Predicting rare classes: can boosting make any weak learner strong? In Proceedings of the Eighth International Conference on Knowledge Discovery and Data Mining, pages 297--306, 2002.]] Google ScholarDigital Library
- M. V. Joshi, V. Kumar, and R. C. Agarwal. Evaluating boosting algorithms to classify rare cases: comparison and improvements. In First IEEE International Conference on Data Mining, pages 257--264, November 2001.]] Google ScholarDigital Library
- R. Kohavi. Data mining with MineSet: what worked, what did not, and what might. In Workshop on Commercial Success of Data Mining, Fourth International Conference on Knowledge Discovery and Date Mining, 1998.]]Google Scholar
- M. Kubat, R. Holte, and S. Matwin. Learning when negative examples abound. Machine Learning: ECML-97, Lecture Notes in Artificial Intelligence 1224, pages 146--153, Springer, 1997.]] Google ScholarDigital Library
- M. Kubat, R. C. Holte, and S. Matwin. Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30(2):195--215, 1998.]] Google ScholarDigital Library
- M. Kubat, and S. Matwin. Addressing the curse of imbalanced training sets: one-sided selection. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 179--186. Morgan Kaufmann, 1997.]]Google Scholar
- C. Ling, and C. Li. Data mining for direct marketing problems and solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 73--79, 1998.]]Google Scholar
- B. Liu, W. Hsu, and Y. Ma. Mining association rules with multiple minimum supports. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 337--341, 1999.]] Google ScholarDigital Library
- M. Pazzani, C. Merz, P. Murphy, K. Ali, T. Hume, and C. Brunk. Reducing misclassification costs. In Proceedings of the Eleventh International Conference on Machine Learning, pages 217--225, 1994.]]Google ScholarDigital Library
- F. Provost, and P. Domingos. Tree Induction for probability-based rankings. Machine Learning, 52(3).]] Google ScholarDigital Library
- F. Provost, and T. Fawcett. Robust classification for imprecise environments. Machine Learning, 42: 203--231, 2001.]] Google ScholarDigital Library
- J. R. Quinlan. Improved estimates for the accuracy of small disjuncts. Machine Learning 6:93--98, 1991.]] Google ScholarDigital Library
- J. R. Quinian. C4.5: Programs for Machine Learning. Morgan Kaufmann. San Mateo, CA, 1993.]] Google ScholarDigital Library
- B. Raskutti, and A. Kowalczyk. Extreme re-balancing for SVMs: a case study. In Workshop on Learning from Imbalanced Data Sets II. International Conference on Machine Learning, 2003.]]Google Scholar
- P. Riddle, R. Segal and O. Etzioni. Representation design and brute-force induction in a Boeing manufacturing design. Applied Artificial Intelligence, 8:125--147, 1994.]]Google ScholarCross Ref
- C. J. van Rijsbergen. Information Retrieval, Butterworths, London, 1979.]] Google ScholarDigital Library
- R. E. Schapire. A brief introduction to boosting. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, pages 1401--1406, 1999.]] Google ScholarDigital Library
- K. M. Ting. The problem of small disjuncts: its remedy in decision trees. In Proceeding of the Tenth Canadian Conference on Artificial Intelligence, pages 91--97, 1994.]]Google Scholar
- G. M. Weiss. Learning with rare cases and small disjuncts. In Proceedings of the Twelfth International Conference on Machine Learning, pages 558--565, Morgan Kaufmann, 1995.]]Google ScholarDigital Library
- G. M. Weiss. Timeweaver: a genetic algorithm for identifying predictive patterns in sequences of events. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 718--725, Morgan Kaufmann, 1999.]]Google Scholar
- G. M. Weiss, and H. Hirsh. Learning to predict rare events in event sequences. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 359--363, 1998.]]Google ScholarDigital Library
- G. M. Weiss, and H. Hirsh. A quantitative study of small disjuncts. In Proceedings of the Seventeenth National Conference on Artificial Intelligence, pages 665--670, AAAI Press, 2000.]] Google ScholarDigital Library
- G. M. Weiss, and F. Provost. Learning when training data are costly: the effect of class distribution on tree induction. Journal of Artificial Intelligence Research, 19:315--354, 2003.]] Google ScholarDigital Library
- R. Yan, Y. Liu, R. Jin, and A. Hauptmann. On predicting rare classes with SVM ensembles in scene classification. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2003.]]Google Scholar
- B. Zadrozny, and C. Elkan. Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining, pages 204--213, 2001.]] Google ScholarDigital Library
Index Terms
Mining with rarity: a unifying framework
Recommendations
Class imbalances versus small disjuncts
Special issue on learning from imbalanced datasetsIt is often assumed that class imbalances are responsible for significant losses of performance in standard classifiers. The purpose of this paper is to the question whether class imbalances are truly responsible for this degradation or whether it can ...
Does cost-sensitive learning beat sampling for classifying rare classes?
UBDM '05: Proceedings of the 1st international workshop on Utility-based data miningA highly-skewed class distribution usually causes the learned classifier to predict the majority class much more often than the minority class. This is a consequence of the fact that most classifiers are designed to maximize accuracy. In many instances, ...
Mining with Rarity for Web Intelligence
WWW '22: Companion Proceedings of the Web Conference 2022Mining with rarity makes sense to take advantage of data mining for Web intelligence. In some scenarios, the rare patterns are meaningful in data intelligent systems. Interesting pattern discovery plays an important role in real-world applications. In ...
Comments