Abstract
Given its importance, the problem of predicting rare classes in large-scale multi-labeled data sets has attracted great attention in the literature. However, rare class analysis remains a critical challenge, because there is no natural way developed for handling imbalanced class distributions. This paper thus fills this crucial void by developing a method for classification using local clustering (COG). Specifically, for a data set with an imbalanced class distribution, we perform clustering within each large class and produce sub-classes with relatively balanced sizes. Then, we apply traditional supervised learning algorithms, such as support vector machines (SVMs), for classification. Along this line, we explore key properties of local clustering for a better understanding of the effect of COG on rare class analysis. Also, we provide a systematic analysis of time and space complexity of the COG method. Indeed, the experimental results on various real-world data sets show that COG produces significantly higher prediction accuracies on rare classes than state-of-the-art methods and the COG scheme can greatly improve the computational performance of SVMs. Furthermore, we show that COG can also improve the performances of traditional supervised learning algorithms on data sets with balanced class distributions. Finally, as two case studies, we have applied COG for two real-world applications: credit card fraud detection and network intrusion detection.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Boser B, Guyon I, Vapnik V (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the 5th annual workshop on computational learning theory, pp 144–152
Chang C-C, Lin C-J (2001) LIBSVM—a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Chawla N, Bowyer K, Hall L, Kegelmeyer W (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16: 321–357
Cohen W (1995) Fast effective rule induction. In: Proceedings of the 12th international conference on machine learning, pp 115–123
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge
DeGroot M, Schervish M (2001) Probability and statistics, 3rd edn. Addison Wesley, Reading, MA
Domingos P (1999) MetaCost: a general method for making classifiers cost-sensitive. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, pp 155–164
Drummond C, Holte R (2003) C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Proceedings of the 20th international conference on machine learning, workshop on learning from imbalanced data sets II
Duda R, Hart P (1973) Pattern classification and scene analysis. Wiley, New York
Duda R, Hart P, Stork D (2000) Pattern classification, 2nd edn. Wiley, New York
Elkan C (2001) The foundations of cost-sensitive learning. In: Proceedings of the 2001 international joint conferences on artificial intelligence, pp 973–978
Fan W, Stolfo S, Zhang J, Chan P (1999) AdaCost: misclassification cost-sensitive boosting. In: Proceedings of the 16th internation conference on machine learning, pp 97–105
Freund Y, Schapire R (1995) A decision-theoretic generalization of on-line learning and an application to boosting. In: Proceedings of the 2nd European conference on computational learning Theory, pp 23–37
Genkin A, Lewis D, Madigan D (2005) BMR: Bayesian multinomial regression software. http://www.stat.rutgers.edu/~madigan/BMR/
Han E-H, Boley D, Gini M, Gross R, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1998) WebAce: a web agent for document categorization and exploration. In: Proceedings of the 2nd international conference on autonomous agents
Japkowicz N (2002) Supervised learning with unsupervised output separation. In: Proceedings of the 6th international conference on artificial intelligence and soft computing, pp 321–325
Joshi M, Agarwal R, Kumar V (2001a) Mining needle in a haystack: classifying rare classes via two-phase rule induction. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data, pp 91–102
Joshi M, Kumar V, Agarwal R (2001b) Evaluating boosting algorithms to classify rare classes: comparison and improvements. In: Proceedings of the 2001 IEEE international conference on data mining, pp 257–264
Karypis G (2003) CLUTO—software for clustering high-dimensional datasets, Version 2.1.1. http://glaros.dtc.umn.edu/gkhome/views/cluto
Kubat M, Holte R, Matwin S (1997) Learning when negative examples abound. In: Proceedings of the 9th European conference on machine learning, pp 146–153
Kubat M, Holte R, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30: 195–215
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the 14th international conference on machine learning, pp 179–186
Ling C, Li C (1998) Data mining for direct marketing: problems and solutions. In: Proceedings of the 4th ACM SIGKDD international conference on knowledge discovery and data mining, pp 73–79
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, pp 281–297
Maimon O, Rokach L (eds) (2005) The data mining and knowledge discovery handbook. Springer, Berlin
Margineantu D, Dietterich T (1999) Learning decision trees for loss minimization in multi-class problems. In: TR 99-30-03
Newman D, Hettich S, Blake C, Merz C (1998) UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html
Porter M (1980) An algorithm for suffix stripping. Program 14(3): 130–137
Quinlan R (1992) C4.5 Release 8. http://www.rulequest.com/Personal/
Raudys S, Jain A (1991) Small sample size effects in statistical pattern recognition: recommendations for practitioners. IEEE Trans Pattern Anal Mach Intell 13(3): 252–264
Riddle P, Segal R, Etzioni O (1994) Representation design and brute-force induction in a boeing manufacturing design. Appl Artif Intell 8: 125–147
Tan P-N, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley, Reading, MA
TREC (2000) Text retrieval conference. http://trec.nist.gov
Vapnik V (1995) The nature of statistical learning. Springer–Verlag, New York
Weiss G (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor 6(1): 7–19
Weiss G, Hirsh H (1998) Learning to predict rare events in event sequences. In: Proceedings of the 4th ACM SIGKDD international conference on knowledge discovery and data mining, pp 359–363
Witten I, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San Francisco
Wu J, Xiong H, Wu P, Chen J (2007) Local decomposition for rare class analysis. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, pp 814–823
Xiong H, Wu J, Chen J (2006) K-means clustering versus validation measures: a data distribution perspective. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 779–784
Zadrozny B, Langford J, Abe N (2003) Cost-sensitive learning by cost-proportionate example weighting. In: Proceedings of the 2003 IEEE international conference on data mining, pp 435–442
Zurada J, Foster B, Ward T (2001) Investigation of artificial neural networks for classifying levels of financial distress of firms: the case of an unbalanced training sample. In: Knowledge discovery for business information systems, pp 397–423
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Sanjay Chawla.
Rights and permissions
About this article
Cite this article
Wu, J., Xiong, H. & Chen, J. COG: local decomposition for rare class analysis. Data Min Knowl Disc 20, 191–220 (2010). https://doi.org/10.1007/s10618-009-0146-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-009-0146-1