Training and assessing classification rules with imbalanced data

Menardi, Giovanna; Torelli, Nicola

doi:10.1007/s10618-012-0295-5

Training and assessing classification rules with imbalanced data

Published: 30 October 2012

Volume 28, pages 92–122, (2014)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Giovanna Menardi¹ &
Nicola Torelli²

5080 Accesses
408 Citations
3 Altmetric
Explore all metrics

Abstract

The problem of modeling binary responses by using cross-sectional data has been addressed with a number of satisfying solutions that draw on both parametric and nonparametric methods. However, there exist many real situations where one of the two responses (usually the most interesting for the analysis) is rare. It has been largely reported that this class imbalance heavily compromises the process of learning, because the model tends to focus on the prevalent class and to ignore the rare events. However, not only the estimation of the classification model is affected by a skewed distribution of the classes, but also the evaluation of its accuracy is jeopardized, because the scarcity of data leads to poor estimates of the model’s accuracy. In this work, the effects of class imbalance on model training and model assessing are discussed. Moreover, a unified and systematic framework for dealing with the problem of imbalanced classification is proposed, based on a smoothed bootstrap re-sampling technique. The proposed technique is founded on a sound theoretical basis and an extensive empirical study shows that it outperforms the main other remedies to face imbalanced learning problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to unbalanced datasets. In: Boulicaut JF, Esposito F, Giannotti F, Pedreschi D, eds. Lecture Notes in Computer Science, Proceedings of 15th European conference on machine learning, ECML, Springer, Pisa, 3201:39–50
Asuncion A, Newman DJ (2007) UCI machine learning repository http://www.ics.uci.edu/~mlearn/MLRepository.html. University of California, School of Inf. and Comput. Sci., Irvine
Barandela R, SÃnchez JS, GarcÃá1a V, Rangel E (2003) Strategies for learning in class imbalance problems. Patt Recognit 36: 849–851
Google Scholar
Batista G, Prati R, Monard M (2004) A study of the behaviour of several methods for balancing machine learning training data. SIGKDD Explor 6(1): 20–29
Article Google Scholar
Batuwita R, Palade V (2010) FSVM-CIL: fuzzy support vector machines for class imbalance learning. IEEE Trans Fuzzy Syst 18(3): 558–571
Article Google Scholar
Bowman AW, Azzalini A (1997) Applied smoothing techniques for data analysis: Kernel approach with S-plus illustrations. Oxford University Press, Oxford
MATH Google Scholar
Breiman L (1996) Bagging predictors. Mach Learn 24: 123–140
MATH MathSciNet Google Scholar
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth International Group, Belmont, CA
MATH Google Scholar
Burez J, Vanden Poel D (2009) Handling class imbalance in customer churn prediction. Expert Syst Appl 36: 4626–4636
Article Google Scholar
Chawla NV (2003) C4.5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. Proceedings of the ICML’03 Workshop on Class Imbalances
Chawla NV, Bowyer KW, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16: 321–357
MATH Google Scholar
Chernick M, Murthy V, Nealy C (1985) Application of bootstrap and other resampling methods: evaluation of classifier performance. Pattern Recogn Lett 3: 167–178
Article Google Scholar
Cieslak D, Chawla N (2008) Learning decision trees for unbalanced data. Lect. Notes in Comput. Sci. 5211: 241–256
Article Google Scholar
Cramer JS (1999) Predictive performance of binary logit models in unbalanced samples. The Statistician 48: 85–94
Google Scholar
Davis J, Goadrich M (2006) The relationship between Precision-Recall and ROC curves. In: Cohen W, Moore A, eds. Proceedings of the 23rd International Conference on Machine Learning, ACM Press, Pittsburgh, PA, pp 233–240
Demsar J (2006) Statistical comparison of classifiers over multiple data sets. J Mach Learn Res 7(7): 1–30
MATH MathSciNet Google Scholar
Drummond C, Holte RC (2006) Cost curves: an improved method for visualizing classifier performance. Mach Learn 65(1): 95–130
Article Google Scholar
Efron B, Tibshirani R (1993) An introduction to the bootstrap. Chapman and Hall, New York
Book MATH Google Scholar
Eitrich T, Kless A, Druska C, Meyer W, Grotendorst J (2007) Classification of highly unbalanced CYP450 data of drugs using cost sensitive mach learning techniques. J Chem Inform Model 47(1): 92–103
Article Google Scholar
Estabrooks A, Taeho J, Japkovicz N (2004) A multiple resampling method for learning form imbalanced data sets. Comput Intell 20: 18–36
Article Google Scholar
Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging, boosting, and hybrid-based approaches. IEEE Trans Syst, Man, Cybern, C 42: 463–484
Article Google Scholar
García S, Derrac J, Triguero I, Carmona CJ, Herrera F (2012) Evolutionary-based selection of generalized instances for imbalanced classification. Knowl Based Syst 25: 3–12
Article Google Scholar
Guo H, Viktor HL (2004) Boosting with data generation: improving the classification of hard to learn examples. SIGKDD Explor 6(1): 30–39
Article Google Scholar
Hand D (2006) Classifier technology and the illusion of progress. Stat Sci 21(1): 1–14
Article MATH MathSciNet Google Scholar
Hand D, Vinciotti V (2003) Choosing K for two-class nearest neighbour classifiers with unbalanced classes. Patt Recognit Lett 24: 1555–1562
Article MATH Google Scholar
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng, 21(9)
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data An J 6
Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. SIGKDD Explor 6(1): 40–49
Article MathSciNet Google Scholar
Khoshgoftaar TM, Golawala M, Van Hulse J (2007) An empirical study of learning from imbalanced data using random forest. Proceedings of the 19th IEEE international conference on tools with artif intelligence, vol 2, Washington, DC
Khoshgoftaar TM, Van Hulse J, Napolitano A (2011) Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Trans on Syst, Man, Cybern.-Part A: Syst Humans 41(3): 552– 568
Article Google Scholar
King EN, Ryan TP (2002) A preliminary investigation of maximum likelihood logistic regression versus exact logistic regression. Am Stat 56: 163–170
Article MathSciNet Google Scholar
King G, Zeng L (2001) Logistic regression in rare events data. Political Anal 9: 137–163
Article Google Scholar
Kotsiantis S, Kanellopoulos D, Pintelas P (2006) Handling imbalanced datasets:a review. GESTS International Transactions on Computer Science and Engineering, vol 30
Kukar M, Kononenko I (1998) Cost-sensitive learning with neural networks. Proceedings of the 13th European conference on artificial intelligence, Wiley, New York, pp 445–449
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. Proceedings of the 14th international conference on machine learning. ICML, Nashville, pp 179–186
Lee S (2000) Noisy replication in skewed binary classification. Comput Stat Data An 34: 165–191
Article MATH Google Scholar
Lee S (1999) Regularization in skewed binary classification. Comput Stat 14: 277–292
Article MATH Google Scholar
Lin Y, Lee Y, Wahba G (2002) Support vector machines for classification in nonstandard situations. Mach Learn 46: 191–202
Article MATH Google Scholar
Liu Y, Chawla NV, Harper MP, Shriberg E, Stolcke A (2006) A study in machine learning from imbalanced data for sentence boundary detection in speech. Comput Speech & Lang 20: 468–494
Article Google Scholar
Mazurowski MA (2008) Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Netw 21: 427–436
Article Google Scholar
McCarthy K, Zabar B, Weiss G (2005) Does cost-sensitive learning beat sampling for classifying rare classes? Proceedings of the 1st international workshop on utility-based data mining, ACM Press, New York, pp 69–77
Mease D, Wyner A, Buja A (2007) Boosted classification trees and class probability-quantile estimation. J Mach Learn Res 8: 409–439
MATH Google Scholar
Oommen T, BaiseL Vogel R (2011) Sampling bias and class imbalance in maximum-likelihood logistic regression. Math Geosci 43: 99–120
Article MATH Google Scholar
Pavón R, Laza R, Reboiro-Jato M, Fdez-Riverola F (2011) Assessing the impact of class-imbalanced data for classifying relevant/irrelevant medline documents. Adv Intell Soft Comput 93: 345–353
Article Google Scholar
Percannella G, Soda P, Vento M (2011) Mitotic HEp-2 cells recognition under class skew. Lecture Notes in Computer Science (including Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp 353–362
Riddle P, Segal R, Etzioni O (1994) Representation design and brute-force induction in a Boeing manufacturing domain. Appl Artif Intell 8: 125–147
Article Google Scholar
Schiavo RA, Hand DJ (2000) Ten more years of error rate research. Int Stat Rev 68(3): 295–310
MATH Google Scholar
Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall, New York
Book MATH Google Scholar
Ström F, Koker R (2011) A parallel neural network approach to prediction of Parkinson’s Disease. Expert Syst Appl 38(10): 12470–12474
Article Google Scholar
Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Patt Recogn 40(12): 3358–3378
Article MATH Google Scholar
Sun Y, Wong AKC, Kamel MS (2009) Classification of imbalanced data: a review. Int J Patt Recogn Artif Intell 23(4): 687–719
Article Google Scholar
Ting KM (2002) An instance-weighting method to induce cost-sensitive trees. IEEE Trans Knowl Data Eng 14(3): 659–665
Article Google Scholar
Thomas J, Jouve P, Nicoloyannis N (2006) Optimisation and evaluation of random forests for imbalanced datasets. Lecture Notes in Computer Science, Springer 4203: 622–631
Article Google Scholar
Veropoulos K, Campbell C, Cristianini N (1999) Controlling the sensitivity of support vector machines. Proceedings of the international joint conference on artificial intelligence, Stockholm, pp 55–60
Wasikowski M, Chen XW (2010) Combating the small sample class imbalance problem using feature selection. IEEE Trans Knowl Data Eng 22(10): 1388–1400
Article Google Scholar
Wehberg S, Schumacher M (2004) A comparison of nonparametric error rate estimation methods in classification problems. Biom J 46(1): 35–47
Article MathSciNet Google Scholar
Weiss GM (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor. Newsletter 6(1)
Weiss GM, Provost F (2001) The effect of class distribution on classifier learning: an empirical study. Technical report, ML-TR-44, Department of Computer Science, Rutgers University, New Jersey
Wu XLJ, Zhou Z (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans: On Syst., Man, Cybern., B 39: 539–550
Google Scholar
Yen S, Lee Y (2006) Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. Intelligent Control and Automation. Series: Lecture Notes in Control and Information Sciences, pp 731–740
Zhou Z, Liu X (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1): 63–77
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Scienze Statistiche, Università degli Studi di Padova, via C. Battisti, 241, Padova, Italy
Giovanna Menardi
Dipartimento di Scienze Economiche, Aziendali, Matematiche e statistiche ”Bruno de Finetti”, Università degli Studi di Trieste, Trieste, Italy
Nicola Torelli

Authors

Giovanna Menardi
View author publications
You can also search for this author in PubMed Google Scholar
Nicola Torelli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giovanna Menardi.

Additional information

Responsible editor: Chih-Jen Lin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Menardi, G., Torelli, N. Training and assessing classification rules with imbalanced data. Data Min Knowl Disc 28, 92–122 (2014). https://doi.org/10.1007/s10618-012-0295-5

Download citation

Received: 13 September 2010
Accepted: 12 October 2012
Published: 30 October 2012
Issue Date: January 2014
DOI: https://doi.org/10.1007/s10618-012-0295-5

Keywords

Mathematical Subject Classifications (2000)

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Training and assessing classification rules with imbalanced data

Abstract

Access this article

Similar content being viewed by others

Regression Analysis for Imbalanced Binary Data: Multi-dimensional Case

Resampling strategies for imbalanced regression: a survey and empirical analysis

ImbalancedLearningRegression - A Python Package to Tackle the Imbalanced Regression Problem

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Mathematical Subject Classifications (2000)

Navigation

Training and assessing classification rules with imbalanced data

Abstract

Access this article

Similar content being viewed by others

Regression Analysis for Imbalanced Binary Data: Multi-dimensional Case

Resampling strategies for imbalanced regression: a survey and empirical analysis

ImbalancedLearningRegression - A Python Package to Tackle the Imbalanced Regression Problem

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematical Subject Classifications (2000)

Search

Navigation