Instance Selection for Class Imbalanced Problems by Means of Selecting Instances More than Once

Pérez-Rodríguez, Javier; de Haro-García, Aida; García-Pedrajas, Nicolás

doi:10.1007/978-3-642-25274-7_11

Javier Pérez-Rodríguez²²,
Aida de Haro-García²² &
Nicolás García-Pedrajas²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7023))

Included in the following conference series:

Conference of the Spanish Association for Artificial Intelligence

1276 Accesses

Abstract

Although many more complex learning algorithms exist, k-nearest neighbor (k-NN) is still one of the most successful classifiers in real-world applications. One of the ways of scaling up the k-nearest neighbors classifier to deal with huge datasets is instance selection. Due to the constantly growing amount of data in almost any pattern recognition task, we need more efficient instance selection algorithms, which must achieve larger reductions while maintaining the accuracy of the selected subset.

However, most instance selection method do not work well in class imbalanced problems. Most algorithms tend to remove too many instances from the minority class. In this paper we present a way to improve instance selection for class imbalanced problems by allowing the algorithms to select instances more than once. In this way, the fewer instances of the minority can cover more portions of the space, and the same testing error of the standard approach can be obtained faster and with fewer instances. No other constraint is imposed on the instance selection method.

An extensive comparison using 40 datasets from the UCI Machine Learning Repository shows the usefulness of our approach compared with the established method of evolutionary instance selection. Our method is able to, in the worst case, match the error obtained by standard instance selection with a larger reduction and shorter execution time.

This work was supported in part by the Project TIN2008-03151 of the Spanish Ministry of Science and Innovation and the project P09-TIC-4623 of the Junta de Andalucía.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baluja, S.: Population-based incremental learning. Technical Report CMU-CS-94-163, Carnegie Mellon University, Pittsburgh (1994)
Google Scholar
Barandela, R., Sánchez, J.L., García, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recognition 36, 849–851 (2003)
Article Google Scholar
Basri, R., Hassner, T., Zelnik-Manor, L.: Approximate nearest subspace search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 1–13 (2010)
Google Scholar
Brighton, H., Mellish, C.: Advances in instance selection for instance-based learning algorithms. Data Mining and Knowledge Discovery 6, 153–172 (2002)
Article MathSciNet MATH Google Scholar
Brodley, C.E.: Recursive automatic bias selection for classifier construction. Machine Learning 20(1/2), 63–94 (1995)
Article Google Scholar
Cano, J.R., Herrera, F., Lozano, M.: Using evolutionary algorithms as instance selection for data reduction in KDD: An experimental study. IEEE Transactions on Evolutionary Computation 7(6), 561–575 (2003)
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
MATH Google Scholar
de Haro-García, A., García Pedrajas, N.: A divide-and-conquer recursive approach for scaling up instance selection algorithms. Data Mining and Knowledge Discovery 18(3), 392–418 (2009)
Article MathSciNet Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, 1–30 (2006)
MathSciNet MATH Google Scholar
Eshelman, L.J.: The CHC Adaptive Search Algorithm: How to Have Safe Search When Engaging in Nontraditional Genetic Recombination. Morgan Kauffman, San Mateo (1990)
Google Scholar
Frank, A., Asuncion, A.: Uci machine learning repository (2010)
Google Scholar
Franti, P., Virmajoki, O., Hautamaki, V.: Fast agglomerative clustering using a k-nearest neighbor graph. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(11), 1875–1881 (2006)
Article Google Scholar
Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. In: Proc. of the Thirteenth International Conference on Machine Learning, Bari, Italy, pp. 148–156 (1996)
Google Scholar
Fu, Z., Robles-Kelly, A., Zhou, J.: Milis: Multiple instance learning with instance selection. IEEE Transactions on Pattern Analysis and Machine Intelligence (in press, 2011)
Google Scholar
García, S., Herrera, F.: Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy. Evolutionary Computation 17(3), 275–306 (2009)
Article Google Scholar
García-Osorio, C., de Haro-García, A., García-Pedrajas, N.: Democratic instance selection: a linear complexity instance selection algorithm based on classifier ensemble concepts. Artificial Intelligence 174, 410–441 (2010)
Article MathSciNet Google Scholar
García-Pedrajas, N., Romero del Castillo, J.A., Ortiz-Boyer, D.: A cooperative coevolutionary algorithm for instance selection for instance-based learning. Machine Learning 78, 381–420 (2010)
Article MathSciNet Google Scholar
García, S., Fernández, A., Herrera, F.: Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems. Applied Soft Computing 9, 1304–1314 (2009)
Article Google Scholar
Gates, G.W.: The reduced nearest neighbor rule. IEEE Transactions on Information Theory 18(3), 431–433 (1972)
Article Google Scholar
Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison–Wesley, Reading (1989)
MATH Google Scholar
Kubat, M., Holte, R., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Machine Learning 30, 195–215 (1998)
Article Google Scholar
Liu, H., Motoda, H.: On issues of instance selection. Data Mining and Knowledge Discovery 6, 115–130 (2002)
Article MathSciNet Google Scholar
Liu, J., Hu, Q., Yu, D.: A comparative study on rough set based class imbalance learning. Knowledge-Based Systems 21, 753–763 (2008)
Article Google Scholar
Louis, S.J., Li, G.: Combining robot control strategies using genetic algorithms with memory. In: Angeline, P.J., McDonnell, J.R., Reynolds, R.G., Eberhart, R. (eds.) EP 1997. LNCS, vol. 1213, pp. 431–442. Springer, Heidelberg (1997)
Chapter Google Scholar
Marchiori, E.: Class conditional nearest neighbor for large margin instance selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(2), 364–370 (2010)
Article Google Scholar
Maudes-Raedo, J., Rodríguez-Díez, J.J., García-Osorio, C.: Disturbing neighbors diversity for decision forest. In: Valentini, G., Okun, O. (eds.) Workshop on Supervised and Unsupervised Ensemble Methods and Their Applications (SUEMA 2008), Patras, Grecia, pp. 67–71 (July 2008)
Google Scholar
Samet, H.: K-nearest neighbor finding using maxnearestdist. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(2), 243–252 (2008)
Article Google Scholar
Sun, Y., Kamel, M.S., Wong, A.K.C., Wang, Y.: Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition 40, 3358–3378 (2007)
Article MATH Google Scholar
Whitley, D.: The GENITOR algorithm and selective pressure. In: Proc 3rd International Conf. on Genetic Algorithms, pp. 116–121. Morgan Kaufmann Publishers, Los Altos (1989)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing and Numerical Analysis, University of Córdoba, Spain
Javier Pérez-Rodríguez, Aida de Haro-García & Nicolás García-Pedrajas

Authors

Javier Pérez-Rodríguez
View author publications
You can also search for this author in PubMed Google Scholar
Aida de Haro-García
View author publications
You can also search for this author in PubMed Google Scholar
Nicolás García-Pedrajas
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science School, University of the Basque Country, PÂº Manuel de Lardizabal 1, 20018, Donostia-San Sebastian, Spain
Jose A. Lozano
Computing Systems Department, University of Castilla-La Mancha, Campus Universitario s/n, 02071, Albacete, Spain
José A. Gámez
Dep. Statistics, O.R. and Computation, University of La Laguna, 38271, La Laguna, S.C. Tenerife, Spain
José A. Moreno

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pérez-Rodríguez, J., de Haro-García, A., García-Pedrajas, N. (2011). Instance Selection for Class Imbalanced Problems by Means of Selecting Instances More than Once. In: Lozano, J.A., Gámez, J.A., Moreno, J.A. (eds) Advances in Artificial Intelligence. CAEPIA 2011. Lecture Notes in Computer Science(), vol 7023. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25274-7_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-25274-7_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25273-0
Online ISBN: 978-3-642-25274-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics