Skip to main content
Log in

COG: local decomposition for rare class analysis

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Given its importance, the problem of predicting rare classes in large-scale multi-labeled data sets has attracted great attention in the literature. However, rare class analysis remains a critical challenge, because there is no natural way developed for handling imbalanced class distributions. This paper thus fills this crucial void by developing a method for classification using local clustering (COG). Specifically, for a data set with an imbalanced class distribution, we perform clustering within each large class and produce sub-classes with relatively balanced sizes. Then, we apply traditional supervised learning algorithms, such as support vector machines (SVMs), for classification. Along this line, we explore key properties of local clustering for a better understanding of the effect of COG on rare class analysis. Also, we provide a systematic analysis of time and space complexity of the COG method. Indeed, the experimental results on various real-world data sets show that COG produces significantly higher prediction accuracies on rare classes than state-of-the-art methods and the COG scheme can greatly improve the computational performance of SVMs. Furthermore, we show that COG can also improve the performances of traditional supervised learning algorithms on data sets with balanced class distributions. Finally, as two case studies, we have applied COG for two real-world applications: credit card fraud detection and network intrusion detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Boser B, Guyon I, Vapnik V (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the 5th annual workshop on computational learning theory, pp 144–152

  • Chang C-C, Lin C-J (2001) LIBSVM—a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm/

  • Chawla N, Bowyer K, Hall L, Kegelmeyer W (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16: 321–357

    MATH  Google Scholar 

  • Cohen W (1995) Fast effective rule induction. In: Proceedings of the 12th international conference on machine learning, pp 115–123

  • Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge

    Google Scholar 

  • DeGroot M, Schervish M (2001) Probability and statistics, 3rd edn. Addison Wesley, Reading, MA

    Google Scholar 

  • Domingos P (1999) MetaCost: a general method for making classifiers cost-sensitive. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, pp 155–164

  • Drummond C, Holte R (2003) C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Proceedings of the 20th international conference on machine learning, workshop on learning from imbalanced data sets II

  • Duda R, Hart P (1973) Pattern classification and scene analysis. Wiley, New York

    MATH  Google Scholar 

  • Duda R, Hart P, Stork D (2000) Pattern classification, 2nd edn. Wiley, New York

    Google Scholar 

  • Elkan C (2001) The foundations of cost-sensitive learning. In: Proceedings of the 2001 international joint conferences on artificial intelligence, pp 973–978

  • Fan W, Stolfo S, Zhang J, Chan P (1999) AdaCost: misclassification cost-sensitive boosting. In: Proceedings of the 16th internation conference on machine learning, pp 97–105

  • Freund Y, Schapire R (1995) A decision-theoretic generalization of on-line learning and an application to boosting. In: Proceedings of the 2nd European conference on computational learning Theory, pp 23–37

  • Genkin A, Lewis D, Madigan D (2005) BMR: Bayesian multinomial regression software. http://www.stat.rutgers.edu/~madigan/BMR/

  • Han E-H, Boley D, Gini M, Gross R, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1998) WebAce: a web agent for document categorization and exploration. In: Proceedings of the 2nd international conference on autonomous agents

  • Japkowicz N (2002) Supervised learning with unsupervised output separation. In: Proceedings of the 6th international conference on artificial intelligence and soft computing, pp 321–325

  • Joshi M, Agarwal R, Kumar V (2001a) Mining needle in a haystack: classifying rare classes via two-phase rule induction. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data, pp 91–102

  • Joshi M, Kumar V, Agarwal R (2001b) Evaluating boosting algorithms to classify rare classes: comparison and improvements. In: Proceedings of the 2001 IEEE international conference on data mining, pp 257–264

  • Karypis G (2003) CLUTO—software for clustering high-dimensional datasets, Version 2.1.1. http://glaros.dtc.umn.edu/gkhome/views/cluto

  • Kubat M, Holte R, Matwin S (1997) Learning when negative examples abound. In: Proceedings of the 9th European conference on machine learning, pp 146–153

  • Kubat M, Holte R, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30: 195–215

    Article  Google Scholar 

  • Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the 14th international conference on machine learning, pp 179–186

  • Ling C, Li C (1998) Data mining for direct marketing: problems and solutions. In: Proceedings of the 4th ACM SIGKDD international conference on knowledge discovery and data mining, pp 73–79

  • MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, pp 281–297

  • Maimon O, Rokach L (eds) (2005) The data mining and knowledge discovery handbook. Springer, Berlin

    Google Scholar 

  • Margineantu D, Dietterich T (1999) Learning decision trees for loss minimization in multi-class problems. In: TR 99-30-03

  • Newman D, Hettich S, Blake C, Merz C (1998) UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html

  • Porter M (1980) An algorithm for suffix stripping. Program 14(3): 130–137

    Google Scholar 

  • Quinlan R (1992) C4.5 Release 8. http://www.rulequest.com/Personal/

  • Raudys S, Jain A (1991) Small sample size effects in statistical pattern recognition: recommendations for practitioners. IEEE Trans Pattern Anal Mach Intell 13(3): 252–264

    Article  Google Scholar 

  • Riddle P, Segal R, Etzioni O (1994) Representation design and brute-force induction in a boeing manufacturing design. Appl Artif Intell 8: 125–147

    Article  Google Scholar 

  • Tan P-N, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley, Reading, MA

    Google Scholar 

  • TREC (2000) Text retrieval conference. http://trec.nist.gov

  • Vapnik V (1995) The nature of statistical learning. Springer–Verlag, New York

    MATH  Google Scholar 

  • Weiss G (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor 6(1): 7–19

    Article  Google Scholar 

  • Weiss G, Hirsh H (1998) Learning to predict rare events in event sequences. In: Proceedings of the 4th ACM SIGKDD international conference on knowledge discovery and data mining, pp 359–363

  • Witten I, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San Francisco

    MATH  Google Scholar 

  • Wu J, Xiong H, Wu P, Chen J (2007) Local decomposition for rare class analysis. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, pp 814–823

  • Xiong H, Wu J, Chen J (2006) K-means clustering versus validation measures: a data distribution perspective. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 779–784

  • Zadrozny B, Langford J, Abe N (2003) Cost-sensitive learning by cost-proportionate example weighting. In: Proceedings of the 2003 IEEE international conference on data mining, pp 435–442

  • Zurada J, Foster B, Ward T (2001) Investigation of artificial neural networks for classifying levels of financial distress of firms: the case of an unbalanced training sample. In: Knowledge discovery for business information systems, pp 397–423

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hui Xiong.

Additional information

Responsible editor: Sanjay Chawla.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, J., Xiong, H. & Chen, J. COG: local decomposition for rare class analysis. Data Min Knowl Disc 20, 191–220 (2010). https://doi.org/10.1007/s10618-009-0146-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-009-0146-1

Keywords

Navigation