Skip to main content
Log in

Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling

  • Focus
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

In the classification framework there are problems in which the number of examples per class is not equitably distributed, formerly known as imbalanced data sets. This situation is a handicap when trying to identify the minority classes, as the learning algorithms are not usually adapted to such characteristics. An usual approach to deal with the problem of imbalanced data sets is the use of a preprocessing step. In this paper we analyze the usefulness of the data complexity measures in order to evaluate the behavior of undersampling and oversampling methods. Two classical learning methods, C4.5 and PART, are considered over a wide range of imbalanced data sets built from real data. Specifically, oversampling techniques and an evolutionary undersampling one have been selected for the study. We extract behavior patterns from the results in the data complexity space defined by the measures, coding them as intervals. Then, we derive rules from the intervals that describe both good or bad behaviors of C4.5 and PART for the different preprocessing approaches, thus obtaining a complete characterization of the data sets and the differences between the oversampling and undersampling results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. http://keel.es.

References

  • Alcalá-Fdez J, Sánchez L, García S, del Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J, Rivas VM, Fernández JC, Herrera F (2009) KEEL: A software tool to assess evolutionary algorithms to data mining problems. Soft Comput 13(3):307–318

    Article  Google Scholar 

  • Asuncion A, Newman D (2007) UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html

  • Barandela R, Sánchez JS, García V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recognit 36(3):849–851

    Article  Google Scholar 

  • Basu M, Ho TK (2006) Data complexity in pattern recognition (advanced information and knowledge processing). Springer-Verlag New York, Inc., Secaucus

    Book  Google Scholar 

  • Batista GEAPA, Prati RC, Monard MC (2004) A study of the behaviour of several methods for balancing machine learning training data. SIGKDD Explor 6(1):20–29

    Article  Google Scholar 

  • Baumgartner R, Somorjai RL (2006) Data complexity assessment in undersampled classification of high-dimensional biomedical data. Pattern Recognit Lett 12:1383–1389

    Article  Google Scholar 

  • Bernadó-Mansilla E, Ho TK (2005) Domain of competence of XCS classifier system in complexity measurement space. IEEE Trans Evol Comput 9(1):82–104

    Article  Google Scholar 

  • Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159

    Article  Google Scholar 

  • Brazdil P, Giraud-Carrier C, Soares C, Vilalta R (2009) Metalearning: applications to data mining. Cognitive Technologies, Springer.http://10.255.0.115/pub/2009/BGSV09

  • Celebi M, Kingravi H, Uddin B, Iyatomi H, Aslandogan Y, Stoecker W, Moss R (2007) A methodological approach to the classification of dermoscopy images. Comput Med Imaging Graphics 31(6):362–373

    Article  Google Scholar 

  • Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    MATH  Google Scholar 

  • Chawla NV, Japkowicz N, Kolcz A (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor 6(1):1–6

    Article  Google Scholar 

  • Diamantini C, Potena D (2009) Bayes vector quantizer for class-imbalance problem. IEEE Trans Knowl Data Eng 21(5):638–651

    Article  Google Scholar 

  • Domingos P (1999) Metacost: a general method for making classifiers cost sensitive. In: Advances in neural networks, Int J Pattern Recognit Artif Intell, pp 155–164

  • Dong M, Kothari R (2003) Feature subset selection using a new definition of classificabilty. Pattern Recognit Lett 24:1215–1225

    Article  MATH  Google Scholar 

  • Drown DJ, Khoshgoftaar TM, Seliya N (2009) Evolutionary sampling and software quality modeling of high-assurance systems. IEEE Trans Syst Man Cybern A 39(5):1097–1107

    Article  Google Scholar 

  • Eshelman LJ (1991) Foundations of genetic algorithms, chap The CHC adaptive search algorithm: how to safe search when engaging in nontraditional genetic recombination, pp 265–283

  • Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36

    Article  MathSciNet  Google Scholar 

  • Fernández A, García S, del Jesus MJ, Herrera F (2008) A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets Syst 159(18):2378–2398

    Article  Google Scholar 

  • Frank E, Witten IH (1998) Generating accurate rule sets without global optimization. In: ICML ’98: Proceedings of the fifteenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, pp 144–151

  • García S, Herrera F (2009a) Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol Comput 17(3):275–306

    Article  Google Scholar 

  • García S, Fernández A, Herrera F (2009b) Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems. Appl Soft Comput 9(4):1304–1314

    Article  Google Scholar 

  • García S, Cano JR, Bernadó-Mansilla E, Herrera F (2009c) Diagnose of effective evolutionary prototype selection using an overlapping measure. Int J Pattern Recognit Artif Intell 23(8):2378–2398

    Article  Google Scholar 

  • García V, Mollineda R, Sánchez JS (2008) On the k–NN performance in a challenging scenario of imbalance and overlapping. Pattern Anal Appl 11(3–4):269–280

    Google Scholar 

  • He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  • Ho TK, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24(3):289–300

    Article  Google Scholar 

  • Hoekstra A, Duin RP (1996) On the nonlinearity of pattern classifiers. In: ICPR ’96: Proceedings of the international conference on pattern recognition (ICPR ’96) Volume IV-Volume 7472, IEEE Computer Society, Washington, DC, pp 271–275

  • Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310

    Article  Google Scholar 

  • Kalousis A (2002) Algorithm selection via meta-learning. PhD thesis, Université de Geneve

  • Kilic K, Uncu O, Türksen IB (2007) Comparison of different strategies of utilizing fuzzy clustering in structure identification. Inform Sci 177(23):5153–5162

    Article  MATH  Google Scholar 

  • Kim SW, Oommen BJ (2009) On using prototype reduction schemes to enhance the computation of volume-based inter-class overlap measures. Pattern Recognit 42(11):2695–2704

    Article  MATH  Google Scholar 

  • Li Y, Member S, Dong M, Kothari R, Member S (2005) Classifiability-based omnivariate decision trees. IEEE Trans Neural Netw 16(6):1547–1560

    Article  Google Scholar 

  • Lu WZ, Wang D (2008) Ground-level ozone prediction by support vector machine approach with a cost-sensitive classification scheme. Sci Total Environ 395(2–3):109–116

    Google Scholar 

  • Luengo J, Herrera F (2010) Domains of competence of fuzzy rule based classification systems with data complexity measures: A case of study using a fuzzy hybrid genetic based machine learning method. Fuzzy Sets Syst 161(1):3–19

    Article  MathSciNet  Google Scholar 

  • Mazurowski M, Habas P, Zurada J, Lo J, Baker J, Tourassi G (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw 21(2–3):427–436

    Article  Google Scholar 

  • Mollineda RA, Sánchez JS, Sotoca JM (2005) Data characterization for effective prototype selection. In: First edition of the Iberian conference on pattern recognition and image analysis (IbPRIA 2005), Lecture Notes in Computer Science 3523, pp 27–34

  • Orriols-Puig A, Bernadó-Mansilla E (2008) Evolutionary rule-based systems for imbalanced data sets. Soft Comput 13(3):213–225

    Article  Google Scholar 

  • Peng X, King I (2008) Robust BMPM training based on second-order cone programming and its application in medical diagnosis. Neural Netw 21(2–3):450–457

    Article  Google Scholar 

  • Pfahringer B, Bensusan H, Giraud-Carrier CG (2000) Meta-learning by landmarking various learning algorithms. In: ICML ’00: Proceedings of the seventeenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, pp 743–750

  • Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann Publishers, San Mateo–California

    Google Scholar 

  • Sánchez J, Mollineda R, Sotoca J (2007) An analysis of how training data complexity affects the nearest neighbor classifiers. Pattern Anal Appl 10(3):189–201

    Article  MathSciNet  Google Scholar 

  • Singh S (2003) Multiresolution estimates of classification complexity. IEEE Trans Pattern Anal Mach Intell 25(12):1534–1539

    Article  Google Scholar 

  • Su CT, Hsiao YH (2007) An evaluation of the robustness of MTS for imbalanced data. IEEE Trans Knowl Data Eng 19(10):1321–1332

    Article  Google Scholar 

  • Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost–sensitive boosting for classification of imbalanced data. Pattern Recognit 40:3358–3378

    Article  MATH  Google Scholar 

  • Sun Y, Wong AKC, Kamel MS (2009) Classification of imbalanced data: A review. Int J Pattern Recognit Artif Intell 23(4):687–719

    Article  Google Scholar 

  • Tang Y, Zhang YQ, Chawla N (2009) SVMs modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern B Cybern 39(1):281–288

    Article  Google Scholar 

  • Williams D, Myers V, Silvious M (2009) Mine classification with imbalanced data. IEEE Geosci Remote Sens Lett 6(3):528–532

    Article  Google Scholar 

  • Yang Q, Wu X (2006) 10 challenging problems in data mining research. Int J Inform Tech Decis Mak 5(4):597–604

    Article  Google Scholar 

  • Zhou ZH, Liu XY (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1):63–77

    Article  Google Scholar 

Download references

Acknowledgments

This work has been supported by the Spanish Ministry of Education and Science under Project TIN2008-06681-C06-(01 and 02). J. Luengo holds a FPU scholarship from Spanish Ministry of Education.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Julián Luengo.

Appendices

Appendix 1: Figures with the intervals of PART and C4.5

In this appendix, the figures sorted by the F1, N4 and L3 data complexity measures are depicted. We have used a two-column representation for the figures, so in each row we present the results for C4.5 and PART for the same case of type of preprocessing and data complexity measure used.

  • Figures from 13, 14, 15, 16, 17, 18 represents the figures for the case of SMOTE preprocessing.

  • Figures from 19, 20, 21, 22, 23, 24 represents the figures for the case of SMOTE-ENN preprocessing.

  • Figures from 25, 26, 27, 28, 29, 30 represents the figures for the case of EUSCHC preprocessing.

Fig. 13
figure 13

C4.5 with SMOTE AUC in Training/Test sorted by F1

Fig. 14
figure 14

PART with SMOTE AUC in Training/Test sorted by F1

Fig. 15
figure 15

C4.5 with SMOTE AUC in Training/Test sorted by N4

Fig. 16
figure 16

PART with SMOTE AUC in Training/Test sorted by N4

Fig. 17
figure 17

C4.5 with SMOTE AUC in Training/Test sorted by L3

Fig. 18
figure 18

PART with SMOTE AUC in Training/Test sorted by L3

Fig. 19
figure 19

C4.5 with SMOTE-ENN AUC in Training/Test sorted by F1

Fig. 20
figure 20

PART with SMOTE-ENN AUC in Training/Test sorted by F1

Fig. 21
figure 21

C4.5 with SMOTE-ENN AUC in Training/Test sorted by N4

Fig. 22
figure 22

PART with SMOTE-ENN AUC in Training/Test sorted by N4

Fig. 23
figure 23

C4.5 with SMOTE-ENN AUC in Training/Test sorted by L3

Fig. 24
figure 24

PART with SMOTE-ENN AUC in Training/Test sorted by L3

Fig. 25
figure 25

C4.5 with EUSCHC AUC in Training/Test sorted by F1

Fig. 26
figure 26

PART with EUSCHC AUC in Training/Test sorted by F1

Fig. 27
figure 27

C4.5 with EUSCHC AUC in Training/Test sorted by N4

Fig. 28
figure 28

PART with EUSCHC AUC in Training/Test sorted by N4

Fig. 29
figure 29

C4.5 with EUSCHC AUC in Training/Test sorted by L3

Fig. 30
figure 30

PART with EUSCHC AUC in Training/Test sorted by L3

Appendix 2: Tables of results

In this appendix we present the average AUC results for C4.5 and PART in Tables 15 and 16 respectively.

Table 15 Average AUC results for C4.5
Table 16 Average AUC results for PART

Rights and permissions

Reprints and permissions

About this article

Cite this article

Luengo, J., Fernández, A., García, S. et al. Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft Comput 15, 1909–1936 (2011). https://doi.org/10.1007/s00500-010-0625-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-010-0625-8

Keywords

Navigation