Abstract
Many efforts have been done recently proposing new intelligent resampling methods as a way to solve class imbalance problems; one of the main challenges of the machine learning community nowadays. Usually the purpose of these methods is to balance the classes. However, there are works in the literature showing that those methods can also be suitable to change the class distribution of not so imbalanced and even balanced databases, to a distribution different to 50% and significantly improve the outcome of the learning process. The aim of this paper is to analyse which resampling methods are the most competitive in this context. Experiments have been performed using 29 databases, 8 different resampling methods and two learning algorithms, and have been evaluated using AUC performance metric and statistical tests. The results show that SMOTE, the well-known intelligent resampling method, is one of the best candidates to be used, improving the results obtained by some of its variants that are successful in the context of class imbalance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Albisua, I., Arbelaitz, O., Gurrutxaga, I., Martín, J.I., Muguerza, J., Pérez, J.M., Perona, I.: Obtaining optimal class distribution for decision trees: Comparative analysis of CTC and C4.5. In: Meseguer, P., Mandow, L., Gasca, R.M. (eds.) CAEPIA 2009. LNCS, vol. 5988, pp. 101–110. Springer, Heidelberg (2010)
Albisua, I., Arbelaitz, O., Gurrutxaga, I., Lasarguren, A., Muguerza, J., Pérez, J.M.: The Quest for the Optimal Class Distribution: an Approach for Enhancing the Effectiveness of Learning via Resampling Methods for imbalanced data sets. Progress in Artificial Intelligence 2(1), 45–63 (2013)
Bache, K., Lichman, M.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA (2013), http://archive.ics.uci.edu/ml
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6, 20–29 (2004)
Berry, M.J.A., Linoff, G.: Astering Data Mining. The Art and Science of Customer Relationship Management. Willey (2000)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
Cohen, W.W.: Fast effective rule induction. In: Proc. of the Twelfth International Conference on Machine Learning, pp. 115–123 (1995)
Demšar, J.: Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research 7, 1–30 (2006)
Estabrooks, A., Jo, T.J., Japkowicz, N.: A Multiple Resampling Method for Learning from Imbalanced Data Sets. Computational Intelligence 20(1), 18–36 (2004)
Fawcett, T.: ROC Graphs: Notes and Practical Considerations for Researchers. HP Laboratories (2004)
Fernández, A., García, S., Herrera, F.: Addressing the Classification with Imbalanced Data: Open Problems and New Challenges on Class Distribution. In: Corchado, E., Kurzyński, M., Woźniak, M. (eds.) HAIS 2011, Part I. LNCS, vol. 6678, pp. 1–10. Springer, Heidelberg (2011)
Frank, E., Witten, I.: Generating accurate rule sets without global optimization. In: Shavlik, J. (ed.) Proc. of the 15th Int. Conference on Machine Learning, pp. 144–151 (1998)
García, S., Fernández, A., Herrera, F.: Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems. Applied Soft Computing 9, 1304–1314 (2009)
García, S., Herrera, F.: An Extension on “Statistical Comparisons of Classifiers over Multiple Data Sets” for all Pairwise Comparisons. Journal of Machine Learning Reserarch 9, 2677–2694 (2008)
García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental Analysis of Power. Information Sciences 180, 2044–2064 (2010)
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005)
He, H., Garcia, E.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Japkowicz, N., Stephen, S.: The Class Imbalance Problem: A Systematic Study. Intelligent Data Analysis Journal 6(5), 429–449 (2002)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Mateo (1993)
Weiss, G.M., Provost, F.: Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research 19, 315–354 (2003)
Wilson, D.R., Martínez, T.R.: Reduction Techniques for Exemplar-Based Learning Algorithms. Machine Learning 38(3), 257–286 (2000)
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowledge and Information Systems 14, 1–37 (2008)
Yang, Q., Wu, X.: 10 challenging problems in data mining research. International Journal of Information Technology & Decision Making 5(4), 597–604 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J.M. (2013). Applying Resampling Methods for Imbalanced Datasets to Not So Imbalanced Datasets. In: Bielza, C., et al. Advances in Artificial Intelligence. CAEPIA 2013. Lecture Notes in Computer Science(), vol 8109. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40643-0_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-40643-0_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40642-3
Online ISBN: 978-3-642-40643-0
eBook Packages: Computer ScienceComputer Science (R0)