Skip to main content
Log in

Training and assessing classification rules with imbalanced data

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

The problem of modeling binary responses by using cross-sectional data has been addressed with a number of satisfying solutions that draw on both parametric and nonparametric methods. However, there exist many real situations where one of the two responses (usually the most interesting for the analysis) is rare. It has been largely reported that this class imbalance heavily compromises the process of learning, because the model tends to focus on the prevalent class and to ignore the rare events. However, not only the estimation of the classification model is affected by a skewed distribution of the classes, but also the evaluation of its accuracy is jeopardized, because the scarcity of data leads to poor estimates of the model’s accuracy. In this work, the effects of class imbalance on model training and model assessing are discussed. Moreover, a unified and systematic framework for dealing with the problem of imbalanced classification is proposed, based on a smoothed bootstrap re-sampling technique. The proposed technique is founded on a sound theoretical basis and an extensive empirical study shows that it outperforms the main other remedies to face imbalanced learning problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to unbalanced datasets. In: Boulicaut JF, Esposito F, Giannotti F, Pedreschi D, eds. Lecture Notes in Computer Science, Proceedings of 15th European conference on machine learning, ECML, Springer, Pisa, 3201:39–50

  • Asuncion A, Newman DJ (2007) UCI machine learning repository http://www.ics.uci.edu/~mlearn/MLRepository.html. University of California, School of Inf. and Comput. Sci., Irvine

  • Barandela R, SÃnchez JS, GarcÃá1a V, Rangel E (2003) Strategies for learning in class imbalance problems. Patt Recognit 36: 849–851

    Google Scholar 

  • Batista G, Prati R, Monard M (2004) A study of the behaviour of several methods for balancing machine learning training data. SIGKDD Explor 6(1): 20–29

    Article  Google Scholar 

  • Batuwita R, Palade V (2010) FSVM-CIL: fuzzy support vector machines for class imbalance learning. IEEE Trans Fuzzy Syst 18(3): 558–571

    Article  Google Scholar 

  • Bowman AW, Azzalini A (1997) Applied smoothing techniques for data analysis: Kernel approach with S-plus illustrations. Oxford University Press, Oxford

    MATH  Google Scholar 

  • Breiman L (1996) Bagging predictors. Mach Learn 24: 123–140

    MATH  MathSciNet  Google Scholar 

  • Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth International Group, Belmont, CA

    MATH  Google Scholar 

  • Burez J, Vanden Poel D (2009) Handling class imbalance in customer churn prediction. Expert Syst Appl 36: 4626–4636

    Article  Google Scholar 

  • Chawla NV (2003) C4.5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. Proceedings of the ICML’03 Workshop on Class Imbalances

  • Chawla NV, Bowyer KW, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16: 321–357

    MATH  Google Scholar 

  • Chernick M, Murthy V, Nealy C (1985) Application of bootstrap and other resampling methods: evaluation of classifier performance. Pattern Recogn Lett 3: 167–178

    Article  Google Scholar 

  • Cieslak D, Chawla N (2008) Learning decision trees for unbalanced data. Lect. Notes in Comput. Sci. 5211: 241–256

    Article  Google Scholar 

  • Cramer JS (1999) Predictive performance of binary logit models in unbalanced samples. The Statistician 48: 85–94

    Google Scholar 

  • Davis J, Goadrich M (2006) The relationship between Precision-Recall and ROC curves. In: Cohen W, Moore A, eds. Proceedings of the 23rd International Conference on Machine Learning, ACM Press, Pittsburgh, PA, pp 233–240

  • Demsar J (2006) Statistical comparison of classifiers over multiple data sets. J Mach Learn Res 7(7): 1–30

    MATH  MathSciNet  Google Scholar 

  • Drummond C, Holte RC (2006) Cost curves: an improved method for visualizing classifier performance. Mach Learn 65(1): 95–130

    Article  Google Scholar 

  • Efron B, Tibshirani R (1993) An introduction to the bootstrap. Chapman and Hall, New York

    Book  MATH  Google Scholar 

  • Eitrich T, Kless A, Druska C, Meyer W, Grotendorst J (2007) Classification of highly unbalanced CYP450 data of drugs using cost sensitive mach learning techniques. J Chem Inform Model 47(1): 92–103

    Article  Google Scholar 

  • Estabrooks A, Taeho J, Japkovicz N (2004) A multiple resampling method for learning form imbalanced data sets. Comput Intell 20: 18–36

    Article  Google Scholar 

  • Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging, boosting, and hybrid-based approaches. IEEE Trans Syst, Man, Cybern, C 42: 463–484

    Article  Google Scholar 

  • García S, Derrac J, Triguero I, Carmona CJ, Herrera F (2012) Evolutionary-based selection of generalized instances for imbalanced classification. Knowl Based Syst 25: 3–12

    Article  Google Scholar 

  • Guo H, Viktor HL (2004) Boosting with data generation: improving the classification of hard to learn examples. SIGKDD Explor 6(1): 30–39

    Article  Google Scholar 

  • Hand D (2006) Classifier technology and the illusion of progress. Stat Sci 21(1): 1–14

    Article  MATH  MathSciNet  Google Scholar 

  • Hand D, Vinciotti V (2003) Choosing K for two-class nearest neighbour classifiers with unbalanced classes. Patt Recognit Lett 24: 1555–1562

    Article  MATH  Google Scholar 

  • He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng, 21(9)

  • Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data An J 6

  • Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. SIGKDD Explor 6(1): 40–49

    Article  MathSciNet  Google Scholar 

  • Khoshgoftaar TM, Golawala M, Van Hulse J (2007) An empirical study of learning from imbalanced data using random forest. Proceedings of the 19th IEEE international conference on tools with artif intelligence, vol 2, Washington, DC

  • Khoshgoftaar TM, Van Hulse J, Napolitano A (2011) Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Trans on Syst, Man, Cybern.-Part A: Syst Humans 41(3): 552– 568

    Article  Google Scholar 

  • King EN, Ryan TP (2002) A preliminary investigation of maximum likelihood logistic regression versus exact logistic regression. Am Stat 56: 163–170

    Article  MathSciNet  Google Scholar 

  • King G, Zeng L (2001) Logistic regression in rare events data. Political Anal 9: 137–163

    Article  Google Scholar 

  • Kotsiantis S, Kanellopoulos D, Pintelas P (2006) Handling imbalanced datasets:a review. GESTS International Transactions on Computer Science and Engineering, vol 30

  • Kukar M, Kononenko I (1998) Cost-sensitive learning with neural networks. Proceedings of the 13th European conference on artificial intelligence, Wiley, New York, pp 445–449

  • Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. Proceedings of the 14th international conference on machine learning. ICML, Nashville, pp 179–186

  • Lee S (2000) Noisy replication in skewed binary classification. Comput Stat Data An 34: 165–191

    Article  MATH  Google Scholar 

  • Lee S (1999) Regularization in skewed binary classification. Comput Stat 14: 277–292

    Article  MATH  Google Scholar 

  • Lin Y, Lee Y, Wahba G (2002) Support vector machines for classification in nonstandard situations. Mach Learn 46: 191–202

    Article  MATH  Google Scholar 

  • Liu Y, Chawla NV, Harper MP, Shriberg E, Stolcke A (2006) A study in machine learning from imbalanced data for sentence boundary detection in speech. Comput Speech & Lang 20: 468–494

    Article  Google Scholar 

  • Mazurowski MA (2008) Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Netw 21: 427–436

    Article  Google Scholar 

  • McCarthy K, Zabar B, Weiss G (2005) Does cost-sensitive learning beat sampling for classifying rare classes? Proceedings of the 1st international workshop on utility-based data mining, ACM Press, New York, pp 69–77

  • Mease D, Wyner A, Buja A (2007) Boosted classification trees and class probability-quantile estimation. J Mach Learn Res 8: 409–439

    MATH  Google Scholar 

  • Oommen T, BaiseL Vogel R (2011) Sampling bias and class imbalance in maximum-likelihood logistic regression. Math Geosci 43: 99–120

    Article  MATH  Google Scholar 

  • Pavón R, Laza R, Reboiro-Jato M, Fdez-Riverola F (2011) Assessing the impact of class-imbalanced data for classifying relevant/irrelevant medline documents. Adv Intell Soft Comput 93: 345–353

    Article  Google Scholar 

  • Percannella G, Soda P, Vento M (2011) Mitotic HEp-2 cells recognition under class skew. Lecture Notes in Computer Science (including Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp 353–362

  • Riddle P, Segal R, Etzioni O (1994) Representation design and brute-force induction in a Boeing manufacturing domain. Appl Artif Intell 8: 125–147

    Article  Google Scholar 

  • Schiavo RA, Hand DJ (2000) Ten more years of error rate research. Int Stat Rev 68(3): 295–310

    MATH  Google Scholar 

  • Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall, New York

    Book  MATH  Google Scholar 

  • Ström F, Koker R (2011) A parallel neural network approach to prediction of Parkinson’s Disease. Expert Syst Appl 38(10): 12470–12474

    Article  Google Scholar 

  • Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Patt Recogn 40(12): 3358–3378

    Article  MATH  Google Scholar 

  • Sun Y, Wong AKC, Kamel MS (2009) Classification of imbalanced data: a review. Int J Patt Recogn Artif Intell 23(4): 687–719

    Article  Google Scholar 

  • Ting KM (2002) An instance-weighting method to induce cost-sensitive trees. IEEE Trans Knowl Data Eng 14(3): 659–665

    Article  Google Scholar 

  • Thomas J, Jouve P, Nicoloyannis N (2006) Optimisation and evaluation of random forests for imbalanced datasets. Lecture Notes in Computer Science, Springer 4203: 622–631

    Article  Google Scholar 

  • Veropoulos K, Campbell C, Cristianini N (1999) Controlling the sensitivity of support vector machines. Proceedings of the international joint conference on artificial intelligence, Stockholm, pp 55–60

  • Wasikowski M, Chen XW (2010) Combating the small sample class imbalance problem using feature selection. IEEE Trans Knowl Data Eng 22(10): 1388–1400

    Article  Google Scholar 

  • Wehberg S, Schumacher M (2004) A comparison of nonparametric error rate estimation methods in classification problems. Biom J 46(1): 35–47

    Article  MathSciNet  Google Scholar 

  • Weiss GM (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor. Newsletter 6(1)

  • Weiss GM, Provost F (2001) The effect of class distribution on classifier learning: an empirical study. Technical report, ML-TR-44, Department of Computer Science, Rutgers University, New Jersey

  • Wu XLJ, Zhou Z (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans: On Syst., Man, Cybern., B 39: 539–550

    Google Scholar 

  • Yen S, Lee Y (2006) Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. Intelligent Control and Automation. Series: Lecture Notes in Control and Information Sciences, pp 731–740

  • Zhou Z, Liu X (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1): 63–77

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giovanna Menardi.

Additional information

Responsible editor: Chih-Jen Lin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Menardi, G., Torelli, N. Training and assessing classification rules with imbalanced data. Data Min Knowl Disc 28, 92–122 (2014). https://doi.org/10.1007/s10618-012-0295-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-012-0295-5

Keywords

Mathematical Subject Classifications (2000)

Navigation