COG: local decomposition for rare class analysis

Wu, Junjie; Xiong, Hui; Chen, Jian

doi:10.1007/s10618-009-0146-1

COG: local decomposition for rare class analysis

Published: 22 January 2010

Volume 20, pages 191–220, (2010)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Junjie Wu¹,
Hui Xiong² &
Jian Chen³

540 Accesses
36 Citations
Explore all metrics

Abstract

Given its importance, the problem of predicting rare classes in large-scale multi-labeled data sets has attracted great attention in the literature. However, rare class analysis remains a critical challenge, because there is no natural way developed for handling imbalanced class distributions. This paper thus fills this crucial void by developing a method for classification using local clustering (COG). Specifically, for a data set with an imbalanced class distribution, we perform clustering within each large class and produce sub-classes with relatively balanced sizes. Then, we apply traditional supervised learning algorithms, such as support vector machines (SVMs), for classification. Along this line, we explore key properties of local clustering for a better understanding of the effect of COG on rare class analysis. Also, we provide a systematic analysis of time and space complexity of the COG method. Indeed, the experimental results on various real-world data sets show that COG produces significantly higher prediction accuracies on rare classes than state-of-the-art methods and the COG scheme can greatly improve the computational performance of SVMs. Furthermore, we show that COG can also improve the performances of traditional supervised learning algorithms on data sets with balanced class distributions. Finally, as two case studies, we have applied COG for two real-world applications: credit card fraud detection and network intrusion detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Boser B, Guyon I, Vapnik V (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the 5th annual workshop on computational learning theory, pp 144–152
Chang C-C, Lin C-J (2001) LIBSVM—a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Chawla N, Bowyer K, Hall L, Kegelmeyer W (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16: 321–357
MATH Google Scholar
Cohen W (1995) Fast effective rule induction. In: Proceedings of the 12th international conference on machine learning, pp 115–123
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge
Google Scholar
DeGroot M, Schervish M (2001) Probability and statistics, 3rd edn. Addison Wesley, Reading, MA
Google Scholar
Domingos P (1999) MetaCost: a general method for making classifiers cost-sensitive. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, pp 155–164
Drummond C, Holte R (2003) C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Proceedings of the 20th international conference on machine learning, workshop on learning from imbalanced data sets II
Duda R, Hart P (1973) Pattern classification and scene analysis. Wiley, New York
MATH Google Scholar
Duda R, Hart P, Stork D (2000) Pattern classification, 2nd edn. Wiley, New York
Google Scholar
Elkan C (2001) The foundations of cost-sensitive learning. In: Proceedings of the 2001 international joint conferences on artificial intelligence, pp 973–978
Fan W, Stolfo S, Zhang J, Chan P (1999) AdaCost: misclassification cost-sensitive boosting. In: Proceedings of the 16th internation conference on machine learning, pp 97–105
Freund Y, Schapire R (1995) A decision-theoretic generalization of on-line learning and an application to boosting. In: Proceedings of the 2nd European conference on computational learning Theory, pp 23–37
Genkin A, Lewis D, Madigan D (2005) BMR: Bayesian multinomial regression software. http://www.stat.rutgers.edu/~madigan/BMR/
Han E-H, Boley D, Gini M, Gross R, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1998) WebAce: a web agent for document categorization and exploration. In: Proceedings of the 2nd international conference on autonomous agents
Japkowicz N (2002) Supervised learning with unsupervised output separation. In: Proceedings of the 6th international conference on artificial intelligence and soft computing, pp 321–325
Joshi M, Agarwal R, Kumar V (2001a) Mining needle in a haystack: classifying rare classes via two-phase rule induction. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data, pp 91–102
Joshi M, Kumar V, Agarwal R (2001b) Evaluating boosting algorithms to classify rare classes: comparison and improvements. In: Proceedings of the 2001 IEEE international conference on data mining, pp 257–264
Karypis G (2003) CLUTO—software for clustering high-dimensional datasets, Version 2.1.1. http://glaros.dtc.umn.edu/gkhome/views/cluto
Kubat M, Holte R, Matwin S (1997) Learning when negative examples abound. In: Proceedings of the 9th European conference on machine learning, pp 146–153
Kubat M, Holte R, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30: 195–215
Article Google Scholar
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the 14th international conference on machine learning, pp 179–186
Ling C, Li C (1998) Data mining for direct marketing: problems and solutions. In: Proceedings of the 4th ACM SIGKDD international conference on knowledge discovery and data mining, pp 73–79
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, pp 281–297
Maimon O, Rokach L (eds) (2005) The data mining and knowledge discovery handbook. Springer, Berlin
Google Scholar
Margineantu D, Dietterich T (1999) Learning decision trees for loss minimization in multi-class problems. In: TR 99-30-03
Newman D, Hettich S, Blake C, Merz C (1998) UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html
Porter M (1980) An algorithm for suffix stripping. Program 14(3): 130–137
Google Scholar
Quinlan R (1992) C4.5 Release 8. http://www.rulequest.com/Personal/
Raudys S, Jain A (1991) Small sample size effects in statistical pattern recognition: recommendations for practitioners. IEEE Trans Pattern Anal Mach Intell 13(3): 252–264
Article Google Scholar
Riddle P, Segal R, Etzioni O (1994) Representation design and brute-force induction in a boeing manufacturing design. Appl Artif Intell 8: 125–147
Article Google Scholar
Tan P-N, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley, Reading, MA
Google Scholar
TREC (2000) Text retrieval conference. http://trec.nist.gov
Vapnik V (1995) The nature of statistical learning. Springer–Verlag, New York
MATH Google Scholar
Weiss G (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor 6(1): 7–19
Article Google Scholar
Weiss G, Hirsh H (1998) Learning to predict rare events in event sequences. In: Proceedings of the 4th ACM SIGKDD international conference on knowledge discovery and data mining, pp 359–363
Witten I, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San Francisco
MATH Google Scholar
Wu J, Xiong H, Wu P, Chen J (2007) Local decomposition for rare class analysis. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, pp 814–823
Xiong H, Wu J, Chen J (2006) K-means clustering versus validation measures: a data distribution perspective. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 779–784
Zadrozny B, Langford J, Abe N (2003) Cost-sensitive learning by cost-proportionate example weighting. In: Proceedings of the 2003 IEEE international conference on data mining, pp 435–442
Zurada J, Foster B, Ward T (2001) Investigation of artificial neural networks for classifying levels of financial distress of firms: the case of an unbalanced training sample. In: Knowledge discovery for business information systems, pp 397–423

Download references

Author information

Authors and Affiliations

Department of Information Systems, School of Economics and Management, Beihang University, Beijing, China
Junjie Wu
Department of Management Science and Information Systems, Rutgers Business School, Rutgers University, Newark, NJ, USA
Hui Xiong
Department of Management Science and Engineering, School of Economics and Management, Tsinghua University, Beijing, China
Jian Chen

Authors

Junjie Wu
View author publications
You can also search for this author in PubMed Google Scholar
Hui Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Jian Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hui Xiong.

Additional information

Responsible editor: Sanjay Chawla.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, J., Xiong, H. & Chen, J. COG: local decomposition for rare class analysis. Data Min Knowl Disc 20, 191–220 (2010). https://doi.org/10.1007/s10618-009-0146-1

Download citation

Received: 01 July 2008
Accepted: 14 August 2009
Published: 22 January 2010
Issue Date: March 2010
DOI: https://doi.org/10.1007/s10618-009-0146-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

COG: local decomposition for rare class analysis

Abstract

Access this article

Similar content being viewed by others

Discovering Minority Sub-clusters and Local Difficulty Factors from Imbalanced Data

Learning from Imbalanced Data Using Ensemble Methods and Cluster-Based Undersampling

KSMOTEEN: A Cluster Based Hybrid Sampling Model for Imbalance Class Data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

COG: local decomposition for rare class analysis

Abstract

Access this article

Similar content being viewed by others

Discovering Minority Sub-clusters and Local Difficulty Factors from Imbalanced Data

Learning from Imbalanced Data Using Ensemble Methods and Cluster-Based Undersampling

KSMOTEEN: A Cluster Based Hybrid Sampling Model for Imbalance Class Data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation