Abstract
In classification tasks with imbalanced datasets the distribution of examples between the classes is uneven. However, it is not the imbalance itself which hinders the performance, but there are other related intrinsic data characteristics which have a significance in the final accuracy. Among all, the overlapping between the classes is possibly the most significant one for a correct discrimination between the classes.
In this contribution we develop a novel proposal to deal with the former problem developing a multi-objective evolutionary algorithm that optimizes both the number of variables and instances of the problem. Feature selection will allow to simplify the overlapping areas easing the generation of rules to distinguish between the classes, whereas instance selection of samples from both classes will address the imbalance itself by finding the most appropriate class distribution for the learning task, as well as removing noise and difficult borderline examples.
Our experimental results, carried out using C4.5 decision tree as baseline classifier, show that this approach is very promising. Our proposal outperforms, with statistical differences, the results obtained with the SMOTE + ENN oversampling technique, which was shown to be a baseline methodology for classification with imbalanced datasets.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Barandela, R., Sánchez, J.S., García, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recogn. 36(3), 849–851 (2003)
Batista, G., Prati, R.C., Monard, M.C.: A study of the behaviour of several methods for balancing machine learning training data. SIGKDD Explor. 6(1), 20–29 (2004)
Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: Nsga-II. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002)
Denil, M., Trappenberg, T.: Overlap versus imbalance. In: Farzindar, A., Kešelj, V. (eds.) Canadian AI 2010. LNCS, vol. 6085, pp. 220–231. Springer, Heidelberg (2010)
Domingos, P.: Metacost: A general method for making classifiers cost-sensitive. In: Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining (KDD 1999), pp. 155–164 (1999)
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Ho, T., Basu, M.: Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 289–300 (2002)
López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250(20), 113–141 (2013)
Luengo, J., Fernández, A., García, S., Herrera, F.: Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft Comput. 15(10), 1909–1936 (2011)
Quinlan, J.: C4.5: Programs for Machine Learning. Morgan Kauffmann, San Francisco (1993)
Sáez, J., Luengo, J., Stefanowski, J., Herrera, F.: Smote-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf. Sci. 291, 184–203 (2015)
Sheskin, D.: Handbook of Parametric and Nonparametric Statistical Procedures. Chapman & Hall/CRC, Boca Raton (2006)
Acknowledgments
This work was supported by the Spanish Ministry of Science and Technology under projects TIN-2011-28488, TIN-2012-33856; the Andalusian Research Plans P11-TIC-7765 and P10-TIC-6858; and both the University of Jaén and Caja Rural Provincial de Jaén under project UJA2014/06/15.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Fernández, A., del Jesus, M.J., Herrera, F. (2015). Addressing Overlapping in Classification with Imbalanced Datasets: A First Multi-objective Approach for Feature and Instance Selection. In: Jackowski, K., Burduk, R., Walkowiak, K., Wozniak, M., Yin, H. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2015. IDEAL 2015. Lecture Notes in Computer Science(), vol 9375. Springer, Cham. https://doi.org/10.1007/978-3-319-24834-9_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-24834-9_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24833-2
Online ISBN: 978-3-319-24834-9
eBook Packages: Computer ScienceComputer Science (R0)