Applying Resampling Methods for Imbalanced Datasets to Not So Imbalanced Datasets

Arbelaitz, Olatz; Gurrutxaga, Ibai; Muguerza, Javier; Pérez, Jesús María

doi:10.1007/978-3-642-40643-0_12

Olatz Arbelaitz²⁶,
Ibai Gurrutxaga²⁶,
Javier Muguerza²⁶ &
…
Jesús María Pérez²⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8109))

Included in the following conference series:

Conference of the Spanish Association for Artificial Intelligence

1702 Accesses

Abstract

Many efforts have been done recently proposing new intelligent resampling methods as a way to solve class imbalance problems; one of the main challenges of the machine learning community nowadays. Usually the purpose of these methods is to balance the classes. However, there are works in the literature showing that those methods can also be suitable to change the class distribution of not so imbalanced and even balanced databases, to a distribution different to 50% and significantly improve the outcome of the learning process. The aim of this paper is to analyse which resampling methods are the most competitive in this context. Experiments have been performed using 29 databases, 8 different resampling methods and two learning algorithms, and have been evaluated using AUC performance metric and statistical tests. The results show that SMOTE, the well-known intelligent resampling method, is one of the best candidates to be used, improving the results obtained by some of its variants that are successful in the context of class imbalance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Effectiveness of Basic and Advanced Sampling Strategies on the Classification of Imbalanced Data. A Comparative Study Using Classical and Novel Metrics

An empirical study on the joint impact of feature selection and data resampling on imbalance classification

Article 23 June 2022

When is Undersampling Effective in Unbalanced Classification Tasks?

References

Albisua, I., Arbelaitz, O., Gurrutxaga, I., Martín, J.I., Muguerza, J., Pérez, J.M., Perona, I.: Obtaining optimal class distribution for decision trees: Comparative analysis of CTC and C4.5. In: Meseguer, P., Mandow, L., Gasca, R.M. (eds.) CAEPIA 2009. LNCS, vol. 5988, pp. 101–110. Springer, Heidelberg (2010)
Chapter Google Scholar
Albisua, I., Arbelaitz, O., Gurrutxaga, I., Lasarguren, A., Muguerza, J., Pérez, J.M.: The Quest for the Optimal Class Distribution: an Approach for Enhancing the Effectiveness of Learning via Resampling Methods for imbalanced data sets. Progress in Artificial Intelligence 2(1), 45–63 (2013)
Article Google Scholar
Bache, K., Lichman, M.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA (2013), http://archive.ics.uci.edu/ml
Google Scholar
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6, 20–29 (2004)
Article Google Scholar
Berry, M.J.A., Linoff, G.: Astering Data Mining. The Art and Science of Customer Relationship Management. Willey (2000)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
MATH Google Scholar
Cohen, W.W.: Fast effective rule induction. In: Proc. of the Twelfth International Conference on Machine Learning, pp. 115–123 (1995)
Google Scholar
Demšar, J.: Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research 7, 1–30 (2006)
MATH Google Scholar
Estabrooks, A., Jo, T.J., Japkowicz, N.: A Multiple Resampling Method for Learning from Imbalanced Data Sets. Computational Intelligence 20(1), 18–36 (2004)
Article MathSciNet Google Scholar
Fawcett, T.: ROC Graphs: Notes and Practical Considerations for Researchers. HP Laboratories (2004)
Google Scholar
Fernández, A., García, S., Herrera, F.: Addressing the Classification with Imbalanced Data: Open Problems and New Challenges on Class Distribution. In: Corchado, E., Kurzyński, M., Woźniak, M. (eds.) HAIS 2011, Part I. LNCS, vol. 6678, pp. 1–10. Springer, Heidelberg (2011)
Chapter Google Scholar
Frank, E., Witten, I.: Generating accurate rule sets without global optimization. In: Shavlik, J. (ed.) Proc. of the 15th Int. Conference on Machine Learning, pp. 144–151 (1998)
Google Scholar
García, S., Fernández, A., Herrera, F.: Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems. Applied Soft Computing 9, 1304–1314 (2009)
Article Google Scholar
García, S., Herrera, F.: An Extension on “Statistical Comparisons of Classifiers over Multiple Data Sets” for all Pairwise Comparisons. Journal of Machine Learning Reserarch 9, 2677–2694 (2008)
MATH Google Scholar
García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental Analysis of Power. Information Sciences 180, 2044–2064 (2010)
Article Google Scholar
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005)
Chapter Google Scholar
He, H., Garcia, E.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
Japkowicz, N., Stephen, S.: The Class Imbalance Problem: A Systematic Study. Intelligent Data Analysis Journal 6(5), 429–449 (2002)
MATH Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Mateo (1993)
Google Scholar
Weiss, G.M., Provost, F.: Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research 19, 315–354 (2003)
MATH Google Scholar
Wilson, D.R., Martínez, T.R.: Reduction Techniques for Exemplar-Based Learning Algorithms. Machine Learning 38(3), 257–286 (2000)
Article MATH Google Scholar
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowledge and Information Systems 14, 1–37 (2008)
Article Google Scholar
Yang, Q., Wu, X.: 10 challenging problems in data mining research. International Journal of Information Technology & Decision Making 5(4), 597–604 (2006)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Faculty, University of the Basque Country (UPV/EHU), Manuel Lardizabal 1, 20018, Donostia, Spain
Olatz Arbelaitz, Ibai Gurrutxaga, Javier Muguerza & Jesús María Pérez

Authors

Olatz Arbelaitz
View author publications
You can also search for this author in PubMed Google Scholar
Ibai Gurrutxaga
View author publications
You can also search for this author in PubMed Google Scholar
Javier Muguerza
View author publications
You can also search for this author in PubMed Google Scholar
Jesús María Pérez
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Universidad Politécnica de Madrid, 28660, Madrid, Spain
Concha Bielza
Universidad de Almería, 04120, Almería, Spain
Antonio Salmerón
Universdad de A Coruña, 15071, A Coruña, Spain
Amparo Alonso-Betanzos
Universidad Complutense de Madrid, 28040, Madrid, Spain
J. Ignacio Hidalgo
Universidad de Jaén, 23071, Jaén, Spain
Luis Martínez
Universidad Pablo de Olavide, 41013, Sevilla, Spain
Alicia Troncoso
Universidad de Salamanca, 37008, Salamanca, Spain
Emilio Corchado & Juan M. Corchado &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J.M. (2013). Applying Resampling Methods for Imbalanced Datasets to Not So Imbalanced Datasets. In: Bielza, C., et al. Advances in Artificial Intelligence. CAEPIA 2013. Lecture Notes in Computer Science(), vol 8109. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40643-0_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-40643-0_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40642-3
Online ISBN: 978-3-642-40643-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics