Automatically countering imbalance and its empirical relationship to cost

Chawla, Nitesh V.; Cieslak, David A.; Hall, Lawrence O.; Joshi, Ajay

doi:10.1007/s10618-008-0087-0

Automatically countering imbalance and its empirical relationship to cost

Published: 17 February 2008

Volume 17, pages 225–252, (2008)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Nitesh V. Chawla¹,
David A. Cieslak¹,
Lawrence O. Hall² &
…
Ajay Joshi²

929 Accesses
Explore all metrics

Abstract

Learning from imbalanced data sets presents a convoluted problem both from the modeling and cost standpoints. In particular, when a class is of great interest but occurs relatively rarely such as in cases of fraud, instances of disease, and regions of interest in large-scale simulations, there is a correspondingly high cost for the misclassification of rare events. Under such circumstances, the data set is often re-sampled to generate models with high minority class accuracy. However, the sampling methods face a common, but important, criticism: how to automatically discover the proper amount and type of sampling? To address this problem, we propose a wrapper paradigm that discovers the amount of re-sampling for a data set based on optimizing evaluation functions like the f-measure, Area Under the ROC Curve (AUROC), cost, cost-curves, and the cost dependent f-measure. Our analysis of the wrapper is twofold. First, we report the interaction between different evaluation and wrapper optimization functions. Second, we present a set of results in a cost- sensitive environment, including scenarios of unknown or changing cost matrices. We also compared the performance of the wrapper approach versus cost-sensitive learning methods—MetaCost and the Cost-Sensitive Classifiers—and found the wrapper to outperform the cost-sensitive classifiers in a cost-sensitive environment. Lastly, we obtained the lowest cost per test example compared to any result we are aware of for the KDD-99 Cup intrusion detection data set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Review on the Issue of Class Imbalance in Predictive Modelling

Imbalanced Data Classification Using Hybrid Under-Sampling with Cost-Sensitive Learning Method

Evolutionary Cost-Sensitive Balancing: A Generic Method for Imbalanced Classification Problems

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Amor NB, Benferhat S, Elouedi Z (2004) Naive bayes vs. decision trees in intrusion detection systems. In: Proceedings of the ACM symposium on applied computing, pp 420–424
Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2005) Ensembles of classifiers from spatially disjoint data. In: Proceedings of the sixth international conference on multiple classifier systems, pp 196–205
Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6(1): 20–29
Article Google Scholar
Blake CL, Newman DJ, Hettich S, Merz CJ (1998) UCI repository of machine learning databases. URL: http://www.ics.uci.edu/~mlearn/MLRepository.html
Bowyer KW, Hall LO, Chawla NV, Moore TE (2000) A parallel decision tree builder for mining very large visualization datasets. In: Proceedings of the IEEE International conference on systems, man and cybernetics
Breiman L (1996) Bagging predictors. Machine Learn 24(2): 123–140
MATH MathSciNet Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intel Res 16: 321–357
MATH Google Scholar
Chawla NV, Hall LO, Joshi A (2005) Wrapper-based computation and evaluation of sampling methods for imbalanced datasets. In: KDD workshop: utility-based data mining
Chawla NV, Japkowicz N, Kołcz A (eds) (2003) Proceedings of the ICML’2003 workshop on learning from imbalanced data sets
Chawla NV, Japkowicz N, Kolcz A (2004) Editorial: learning from imbalanced datasets. SIGKDD Explorations 6(1): 1–6
Article Google Scholar
Cieslak D, Chawla NV (2006) Calibration and power of PETs on unbalanced datasets. TR 2006-12, Department of Computer Science and Engineering, University of Notre Dame
Cohen WW (1995a) Fast effective rule induction. In Prieditis A, Russell S (eds) 12th International conference on machine learning, Morgan Kaufmann, Tahoe City, CA, pp 115–123
Cohen WW (1995b) Learning to classify English text with ILP methods. In: 5th International workshop on inductive logic programming, pp 3–24
Dietterich T, Margineantu D, Provost F, Turney P (eds) (2000) Proceedings of the ICML’2000 workshop on cost-sensitive learning
Domingos P (1999) MetaCost: a general method for making classifiers cost-sensitive. In: Knowledge discovery and data mining, pp 155–164
Drummond C, Holte R (2003) C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In: Proceedings of the ICML’03 workshop on learning from imbalanced data sets
Drummond C, Holte RC (2006) Cost curves: an improved method for visualizing classifier performance. Machine Learn 65(1): 95–130
Article Google Scholar
Dumais S, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: Seventh international conference on information and knowledge management, pp 148–155
Elkan C (1999) Results of the KDD’99 classifier learning contest. http://www.cse.ucsd.edu/~elkan/clresults.html
Elkan C (2001) The foundations of cost-sensitive learning. In: Proceedings of the seventeenth international joint conference on artificial intelligence, pp 973–978
Esposito F, Malerba D, Semeraro G (1994) Multistrategy learning for document recognition. Appl Artif Intel 8: 33–84
Article Google Scholar
Ferri C, Flach P, Orallo J, Lachice N (eds) (2004) First workshop on ROC analysis in AI. ECAI
Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Machine Learn 30(2–3): 195–215
Article Google Scholar
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one sided selection. In: Proceedings of the fourteenth international conference on machine learning. Morgan Kaufmann, Nashville, Tennesse, pp 179–186
Lewis D, Ringuette M (1994) A comparison of two learning algorithms for text categorization. In: 3rd Annual symposium on document analysis and information retrieval, pp 81–93
Ling C, Li C (1998) Data mining for direct marketing problems and solutions. In: Proceedings of the fourth international conference on knowledge discovery and data mining (KDD-98). AAAI Press, New York, NY, pp 73–79
Maloof M (2003) Learning when data sets are imbalanced and when costs are unequal and unknown. In: Proceedings of the ICML’03 workshop on learning from imbalanced data sets
Mladenic D, Grobelnik M (1999) Feature selection for unbalanced class distribution and naive bayes. In: ICML, pp 258–267
Provost FJ, Domingos P (2003) Tree induction for probability-based ranking. Machine Learn 52(3): 199–215
Article MATH Google Scholar
Provost FJ, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. In: Fifteenth international conference on machine learning, pp 445–453
Quinlan JR (1993) Programs for machine learning. Morgan Kaufmann
Sabhnani MR, Serpen G (2003) Application of machine learning algorithms to KDD intrusion detection dataset with misuse detection context. In: Proceedings of the international conference on machine learning: models, technologies, and applications, pp 209–215
Thain D, Tannenbaum T, Livny M (2005) Distributed computing in practice: the condor experience. Concur Comput Pract Exp 17: 323–356
Article Google Scholar
Weiss G, McCarthy K, Zabar B (2007) Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs? In: DMIN, pp 35–41
Weiss G, Provost F (2003) Learning when training data are costly: the effect of class distribution on tree induction. J Artif Intel Res 19: 315–354
MATH Google Scholar
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann
Woods K, Doss C, Bowyer KW, Solka J, Priebe C, Kegelmeyer WP (1993) Comparative evaluation of pattern recognition techniques for detection of microcalcifications in mammography. Int J Pattern Recog Artif Intel 7(6): 1417–1436
Article Google Scholar
Zadrozny B, Elkan C (2001) Learning and making decisions when costs and probabilities are both unknown. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data Mining, pp 204–213
Zadrozny B, Langford J, Abe N (2003) Cost-sensitive learning by cost-proportionate example weighting. In: ICDM, pp 435–442
Zhou Z, Liu X (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowledge Data Eng 18(1): 63–77
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, 46556, USA
Nitesh V. Chawla & David A. Cieslak
Department of Computer Science and Engineering, University of South Florida, Tampa, FL, 33620-5399, USA
Lawrence O. Hall & Ajay Joshi

Authors

Nitesh V. Chawla
View author publications
You can also search for this author inPubMed Google Scholar
David A. Cieslak
View author publications
You can also search for this author inPubMed Google Scholar
Lawrence O. Hall
View author publications
You can also search for this author inPubMed Google Scholar
Ajay Joshi
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Nitesh V. Chawla.

Additional information

Responsible editor: Gary M. Weiss.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chawla, N.V., Cieslak, D.A., Hall, L.O. et al. Automatically countering imbalance and its empirical relationship to cost. Data Min Knowl Disc 17, 225–252 (2008). https://doi.org/10.1007/s10618-008-0087-0

Download citation

Received: 11 November 2006
Accepted: 08 January 2008
Published: 17 February 2008
Issue Date: October 2008
DOI: https://doi.org/10.1007/s10618-008-0087-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatically countering imbalance and its empirical relationship to cost

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Comprehensive Review on the Issue of Class Imbalance in Predictive Modelling

Imbalanced Data Classification Using Hybrid Under-Sampling with Cost-Sensitive Learning Method

Evolutionary Cost-Sensitive Balancing: A Generic Method for Imbalanced Classification Problems

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now