Skip to main content
Log in

On the joint-effect of class imbalance and overlap: a critical review

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Current research on imbalanced data recognises that class imbalance is aggravated by other data intrinsic characteristics, among which class overlap stands out as one of the most harmful. The combination of these two problems creates a new and difficult scenario for classification tasks and has been discussed in several research works over the past two decades. In this paper, we argue that despite some insightful information can be derived from related research, the joint-effect of class overlap and imbalance is still not fully understood, and advocate for the need to move towards a unified view of the class overlap problem in imbalanced domains. To that end, we start by performing a thorough analysis of existing literature on the joint-effect of class imbalance and overlap, elaborating on important details left undiscussed on the original papers, namely the impact of data domains with different characteristics and the behaviour of classifiers with distinct learning biases. This leads to the hypothesis that class overlap comprises multiple representations, which are important to accurately measure and analyse in order to provide a full characterisation of the problem. Accordingly, we devise two novel taxonomies, one for class overlap measures and the other for class overlap-based approaches, both resonating with the distinct representations of class overlap identified. This paper therefore presents a global and unique view on the joint-effect of class imbalance and overlap, from precursor work to recent developments in the field. It meticulously discusses some concepts taken as implicit in previous research, explores new perspectives in light of the limitations found, and presents new ideas that will hopefully inspire researchers to move towards a unified view on the problem and the development of suitable strategies for imbalanced and overlapped domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26

Similar content being viewed by others

Notes

  1. The reader may find supporting information in the supplementary material online at https://student.dei.uc.pt/~miriams/pdf-files/AIR_2021_Appendix.pdf.

  2. The interested reader may find detailed information on the performance of each classifier in the supplementary material provided online at https://student.dei.uc.pt/~miriams/pdf-files/AIR_2021_Appendix.pdf.

  3. https://github.com/miriamspsantos/pycol.

  4. https://archive.ics.uci.edu.

  5. https://www.kaggle.com.

  6. http://keel.es.

  7. https://www.openml.org.

  8. https://github.com/miriamspsantos/datagenerator.

  9. https://github.com/miriamspsantos/open-source-imbalance-overlap.

  10. https://github.com/miriamspsantos/pycol.

References

  • Abdi L, Hashemi S (2015) To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans Knowl Data Eng 28(1):238–251

    Article  Google Scholar 

  • Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. In: European conference on machine learning. Springer, pp 39–50

  • Alejo R, Valdovinos RM, García V, Pacheco-Sanchez JH (2013) A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios. Pattern Recogn Lett 34(4):380–388

    Article  Google Scholar 

  • Anwar N, Jones G, Ganesh S (2014) Measurement of data complexity for classification problems with unbalanced data. Stat Anal Data Min ASA Data Sci J 7(3):194–211

    Article  MathSciNet  MATH  Google Scholar 

  • Armano G, Tamponi E (2016) Experimenting multiresolution analysis for identifying regions of different classification complexity. Pattern Anal Appl 19(1):129–137

    Article  MathSciNet  Google Scholar 

  • Barandela R, Valdovinos RM, Sánchez JS (2003) New applications of ensembles of classifiers. Pattern Anal Appl 6(3):245–256

    Article  MathSciNet  Google Scholar 

  • Barella VH, Costa EP, Carvalho A, Pl F (2014) Clusteross: a new undersampling method for imbalanced learning. In: Proceedings of the 3th Brazilian conference on intelligent systems. Academic Press

  • Barella VH, Garcia LP, de Souto MP, Lorena AC, de Carvalho A (2018) Data complexity measures for imbalanced classification tasks. In: 2018 international joint conference on neural networks (IJCNN). IEEE, pp 1–8

  • Barella VH, Garcia LP, de Souto MC, Lorena AC, de Carvalho AC (2021) Assessing the data complexity of imbalanced datasets. Inf Sci 553:83–109

    Article  MathSciNet  MATH  Google Scholar 

  • Barua S, Islam M, Yao X, Murase K (2014) Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425

    Article  Google Scholar 

  • Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29

    Article  Google Scholar 

  • Batuwita R, Palade V (2010) Fsvm-cil: fuzzy support vector machines for class imbalance learning. IEEE Trans Fuzzy Syst 18(3):558–571

    Article  Google Scholar 

  • Bi J, Zhang C (2018) An empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble learning scheme. Knowl Based Syst 158:81–93

    Article  Google Scholar 

  • Borsos Z, Lemnaru C, Potolea R (2018) Dealing with overlap and imbalance: a new metric and approach. Pattern Anal Appl 21(2):381–395

    Article  MathSciNet  Google Scholar 

  • Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140

    Article  MATH  Google Scholar 

  • Bunkhumpornpat C, Sinapiromsaran K (2017) Dbmute: density-based majority under-sampling technique. Knowl Inf Syst 50(3):827–850

    Article  Google Scholar 

  • Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 475–482

  • Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2011) Mute: majority under-sampling technique. In: 2011 8th international conference on information, communications and signal processing. IEEE, pp 1–4

  • Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) Dbsmote: density-based synthetic minority over-sampling technique. Appl Intell 36(3):664–684

    Article  Google Scholar 

  • Cao H, Li XL, Woon DYK, Ng SK (2013) Integrated oversampling for imbalanced time series classification. IEEE Trans Knowl Data Eng 25(12):2809–2822

    Article  Google Scholar 

  • Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  MATH  Google Scholar 

  • Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) Smoteboost: improving prediction of the minority class in boosting. In: European conference on principles of data mining and knowledge discovery. Springer, pp 107–119

  • Chen S (2017) An improved synthetic minority over-sampling technique for imbalanced data set learning. Degree thesis of Department of Information Engineering, National Tsing Hua University, pp 1–59

  • Chen S, He H, Garcia EA (2010) Ramoboost: ranked minority oversampling in boosting. IEEE Trans Neural Netw 21(10):1624–1642

    Article  Google Scholar 

  • Chen L, Fang B, Shang Z, Tang Y (2018) Tackling class overlap and imbalance problems in software defect prediction. Softw Qual J 26(1):97–125

    Article  Google Scholar 

  • Chen X, Zhang L, Wei X, Lu X (2021) An effective method using clustering-based adaptive decomposition and editing-based diversified oversamping for multi-class imbalanced datasets. Appl Intell 51(4):1918–1933

  • Cieslak DA, Chawla NV, Striegel A (2006) Combating imbalance in network intrusion datasets. In: GrC, Citeseer, pp 732–737

  • Cohen G, Hilario M, Sax H, Hugonnet S, Geissbuhler A (2006) Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med 37(1):7–18

    Article  Google Scholar 

  • Correia A, Soares C, Jorge A (2019) Dataset morphing to analyze the performance of collaborative filtering. In: International conference on discovery science. Springer, pp 29–39

  • Costa AJ, Santos MS, Soares C, Abreu PH (2020) Analysis of imbalance strategies recommendation using a meta-learning approach. In: 7th ICML workshop on automated machine learning (AutoML-ICML2020), pp 1–10

  • Cummins L (2013) Combining and choosing case base maintenance algorithms. PhD thesis, University College Cork

  • Das B, Krishnan NC, Cook DJ (2014a) Handling imbalanced and overlapping classes in smart environments prompting dataset. In: Data mining for service. Springer, pp 199–219

  • Das B, Krishnan NC, Cook DJ (2014b) Racog and wracog: two probabilistic oversampling techniques. IEEE Trans Knowl Data Eng 27(1):222–234

    Article  Google Scholar 

  • Das S, Datta S, Chaudhuri B (2018) Handling data irregularities in classification: foundations, trends, and future challenges. Pattern Recogn 81:674–693

    Article  Google Scholar 

  • de Melo VV, Lorena AC (2018) Using complexity measures to evolve synthetic classification datasets. In: 2018 International joint conference on neural networks (IJCNN). IEEE, pp 1–8

  • Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Trans Evol Comput 6(2):182–197

    Article  Google Scholar 

  • Denil M, Trappenberg T (2010) Overlap versus imbalance. In: Canadian conference on artificial intelligence. Springer, pp 220–231

  • Douzas G, Bacao F (2019) Geometric smote a geometrically enhanced drop-in replacement for smote. Inf Sci 501:118–135

    Article  Google Scholar 

  • Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Inf Sci 465:1–20

    Article  Google Scholar 

  • Eshelman LJ (1991) The chc adaptive search algorithm: How to have safe search when engaging in nontraditional genetic recombination. In: Foundations of genetic algorithms, vol 1. Elsevier, pp 265–283

  • Ester M, Kriegel HP, Sander J, Xu X et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96:226–231

    Google Scholar 

  • Fan Q, Wang Z, Li D, Gao D, Zha H (2017) Entropy-based fuzzy support vector machine for imbalanced datasets. Knowl Based Syst 115:87–99

    Article  Google Scholar 

  • Fernandes ER, de Carvalho AC (2019) Evolutionary inversion of class distribution in overlapping areas for multi-class imbalanced learning. Inf Sci 494:141–154

    Article  Google Scholar 

  • Fernández A, García S, Galar M, Prati R, Krawczyk B, Herrera F (2018a) Data Intrinsic Characteristics. Springer, Cham, pp 253–277

    Google Scholar 

  • Fernández A, García S, Galar M, Prati R, Krawczyk B, Herrera F (2018b) Ensemble Learning. Springer, Cham, pp 147–196

    Google Scholar 

  • Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F (2018c) Dimensionality reduction for imbalanced learning. In: Learning from imbalanced data sets. Springer, pp 227–251

  • Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F (2018d) Learning From Imbalanced Data Sets, vol 11. Springer, Berlin

    Book  Google Scholar 

  • Fernández A, Garcia S, Herrera F, Chawla NV (2018e) Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905

    Article  MathSciNet  MATH  Google Scholar 

  • França TR, Miranda PB, Prudêncio RB, Lorenaz AC, Nascimento AC (2020) A many-objective optimization approach for complexity-based data set generation. In: 2020 IEEE congress on evolutionary computation (CEC). IEEE, pp 1–8

  • Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139

    Article  MathSciNet  MATH  Google Scholar 

  • Friedman J, Hastie T, Tibshirani R et al (2000) Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann Stat 28(2):337–407

    Article  MATH  Google Scholar 

  • Fu GH, Wu YJ, Zong MJ, Yi LZ (2020) Feature selection and classification by minimizing overlap degree for class-imbalanced data in metabolomics. Chemom Intell Lab Syst 196:103906

    Article  Google Scholar 

  • Galar M, Fernández A, Barrenechea E, Bustince H, Herrera F (2013) Dynamic classifier selection for one-vs-one strategy: avoiding non-competent classifiers. Pattern Recogn 46(12):3412–3424

    Article  Google Scholar 

  • Galar M, Fernández A, Barrenechea E, Herrera F (2015) Drcw-ovo: distance-based relative competence weighting combination for one-vs-one strategy in multi-class problems. Pattern Recogn 48(1):28–42

    Article  Google Scholar 

  • García S, Herrera F (2009) Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol Comput 17(3):275–306

    Article  MathSciNet  Google Scholar 

  • García V, Alejo R, Sánchez J, Sotoca J, Mollineda R (2006) Combined effects of class imbalance and class overlap on instance-based classification. In: International conference on intelligent data engineering and automated learning. Springer, pp 371–378

  • García V, Mollineda R, Sánchez J, Alejo R, Sotoca J (2007a) When overlapping unexpectedly alters the class imbalance effects. In: Iberian conference on pattern recognition and image analysis. Springer, pp 499–506

  • García V, Sánchez J, Mollineda R (2007b) An empirical study of the behavior of classifiers on imbalanced and overlapped data sets. In: Iberoamerican congress on pattern recognition. Springer, pp 397–406

  • García V, Mollineda R, Sánchez J (2008) On the k-nn performance in a challenging scenario of imbalance and overlapping. Pattern Anal Appl 11(3–4):269–280

    Article  MathSciNet  Google Scholar 

  • García V, Sánchez J, Marqués A, Florencia R, Rivera G (2020) Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data. Expert Syst Appl 158:113026

    Article  Google Scholar 

  • Greene J (2001) Feature subset selection using thornton’s separability index and its applicability to a number of sparse proximity-based classifiers. In: Proceedings of annual symposium of the pattern recognition association of South Africa

  • Guzmán-Ponce A, Valdovinos RM, Sánchez JS, Marcial-Romero JR (2020) A new under-sampling method to face class overlap and imbalance. Appl Sci 10(15):5164

    Article  Google Scholar 

  • Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239

    Article  Google Scholar 

  • Han H, Wang WY, Mao BH (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, pp 878–887

  • Hart P (1968) The condensed nearest neighbor rule (corresp.). IEEE Trans Inf Theory 14(3):515–516

    Article  Google Scholar 

  • He H, Bai Y, Garcia E, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE, pp 1322–1328

  • Ho T, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24(3):289–300

    Article  Google Scholar 

  • Huttenlocher DP, Klanderman GA, Rucklidge WJ (1993) Comparing images using the hausdorff distance. IEEE Trans Pattern Anal Mach Intell 15(9):850–863

    Article  Google Scholar 

  • Jain A, Duin R, Mao J (2000) Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intell 22(1):4–37

    Article  Google Scholar 

  • Japkowicz N (2001) Concept-learning in the presence of between-class and within-class imbalances. In: Conference of the Canadian society for computational studies of intelligence. Springer, pp 67–77

  • Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. ACM SIGKDD Explor Newsl 6(1):40–49

    Article  Google Scholar 

  • Kang S, Cho S, Kang P (2015) Constructing a multi-class classifier using one-against-one approach with different binary classifiers. Neurocomputing 149:677–682

    Article  Google Scholar 

  • Kaur H, Pannu HS, Malhi AK (2019) A systematic review on imbalanced data challenges in machine learning: applications and solutions. ACM Comput Surv (CSUR) 52(4):1–36

    Google Scholar 

  • Kovács G (2019) An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Appl Soft Comput 83:105662

    Article  Google Scholar 

  • Koziarski M, Wozniak M (2017) Ccr: a combined cleaning and resampling algorithm for imbalanced data classification. Int J Appl Math Comput Sci 27(4):727–736

    Article  MathSciNet  MATH  Google Scholar 

  • Koziarski M, Krawczyk B, Wozniak M (2019) Radial-based oversampling for noisy imbalanced data classification. Neurocomputing 343:19–33

    Article  Google Scholar 

  • Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Progr. Artif. Intell. 5(4):221–232

    Article  Google Scholar 

  • Kubat M, Matwin S et al (1997) Addressing the curse of imbalanced training sets: one-sided selection. Icml Citeseer 97:179–186

    Google Scholar 

  • Lango M, Brzezinski D, Firlik S, Stefanowski J (2017) Discovering minority sub-clusters and local difficulty factors from imbalanced data. In: International conference on discovery science. Springer, pp 324–339

  • Lango M, Brzezinski D, Stefanowski J (2018) Imweights: classifying imbalanced data using local and neighborhood information. In: Second international workshop on learning with imbalanced domains: theory and applications, PMLR, pp 95–109

  • Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: Conference on artificial intelligence in medicine in Europe. Springer, pp 63–66

  • Lee HK, Kim SB (2018) An overlap-sensitive margin classifier for imbalanced and overlapping data. Expert Syst Appl 98:72–83

    Article  Google Scholar 

  • Leyva E, González A, Perez R (2014) A set of complexity measures designed for applying meta-learning to instance selection. IEEE Trans Knowl Data Eng 27(2):354–367

    Article  Google Scholar 

  • Li KS, Wang HR, Liu KH (2019) A novel error-correcting output codes algorithm based on genetic programming. Swarm Evol Comput 50:100564

    Article  Google Scholar 

  • Liu C (2008) Partial discriminative training for classification of overlapping classes in document analysis. IJDAR 11(2):53

    Article  Google Scholar 

  • Liu XY, Wu J, Zhou ZH (2008) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B (Cybern) 39(2):539–550

    Google Scholar 

  • López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141

    Article  Google Scholar 

  • Lorena AC, Costa IG, Spolaôr N, De Souto MC (2012) Analysis of complexity indices for classification problems: cancer gene expression data. Neurocomputing 75(1):33–42

    Article  Google Scholar 

  • Lorena AC, Garcia LP, Lehmann J, Souto MC, Ho TK (2019) How complex is your classification problem? A survey on measuring classification complexity. ACM Comput Surv (CSUR) 52(5):1–34

    Article  Google Scholar 

  • Luengo J, Fernández A, García S, Herrera F (2011) Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft Comput 15(10):1909–1936

    Article  Google Scholar 

  • MacCuish J, MacCuish N (2010) Clustering in Bioinformatics and Drug Discovery. CRC Press, London

    Book  MATH  Google Scholar 

  • Macià N, Bernadó-Mansilla E (2014) Towards uci+: a mindful repository design. Inf Sci 261:237–262

    Article  Google Scholar 

  • Malina W (2001) Two-parameter fisher criterion. IEEE Trans Syst Man Cybern Part B (Cybern) 31(4):629–636

    Article  Google Scholar 

  • Mani I, Zhang I (2003) knn approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of workshop on learning from imbalanced datasets, ICML United States, vol 126

  • Manukyan A, Ceyhan E (2016) Classification of imbalanced data with a geometric digraph family. J Mach Learn Res 17(1):6504–6543

    MathSciNet  MATH  Google Scholar 

  • Massie S, Craw S, Wiratunga N (2005) Complexity-guided case discovery for case based reasoning. AAAI 5:216–221

    Google Scholar 

  • Menzies T, Butcher A, Cok D, Marcus A, Layman L, Shull F, Turhan B, Zimmermann T (2012) Local versus global lessons for defect prediction and effort estimation. IEEE Trans Softw Eng 39(6):822–834

    Article  Google Scholar 

  • Mercier M, Santos M, Abreu P, Soares C, Soares J, Santos J (2018) Analysing the footprint of classifiers in overlapped and imbalanced contexts. In: International symposium on intelligent data analysis. Springer, pp 200–212

  • Muñoz MA, Villanova L, Baatar D, Smith-Miles K (2018) Instance spaces for machine learning classification. Mach Learn 107(1):109–147

    Article  MathSciNet  MATH  Google Scholar 

  • Napierala K, Stefanowski J (2016) Types of minority class examples and their influence on learning classifiers from imbalanced data. J Intell Inf Syst 46(3):563–597

    Article  Google Scholar 

  • Napierała K, Stefanowski J, Wilk S (2010) Learning from imbalanced data in presence of noisy and borderline examples. In: International conference on rough sets and current trends in computing. Springer, pp 158–167

  • Nekooeimehr I, Lai-Yuen SK (2016) Adaptive semi-unsupervised weighted oversampling (a-suwo) for imbalanced datasets. Expert Syst Appl 46:405–416

    Article  Google Scholar 

  • Oh S (2011) A new dataset evaluation method based on category overlap. Comput Biol Med 41(2):115–122

    Article  Google Scholar 

  • Orriols-Puig A, Macia N, Ho TK (2010) Documentation for the data complexity library in c++. Universitat Ramon Llull, La Salle 196:1–40

    Google Scholar 

  • Pascual-Triana JD, Charte D, Andrés Arroyo M, Fernández A, Herrera F (2021) Revisiting data complexity metrics based on morphology for overlap and imbalance: snapshot, new overlap number of balls metrics and singular problems prospect. Knowl Inf Syst 63(7):1961–1989

    Article  Google Scholar 

  • Prati RGB, Monard M (2004) Class imbalances versus class overlapping: an analysis of a learning system behavior. In: Mexican international conference on artificial intelligence. Springer, pp 312–321

  • Rivolli A, Garcia LP, Soares C, Vanschoren J, de Carvalho AC (2018) Characterizing classification datasets: a study of meta-features for meta-learning. arXiv:180810406

  • Sáez J, Luengo J, Stefanowski J, Herrera F (2015) Smote-ipf: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203

    Article  Google Scholar 

  • Sáez JA, Galar M, Krawczyk B (2019) Addressing the overlapping data problem in classification using the one-vs-one decomposition strategy. IEEE Access 7:83396–83411

    Article  Google Scholar 

  • Santos M, Abreu P, García-Laencina P, Simão A, Carvalho A (2015) A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J Biomed Inform 58:49–59

    Article  Google Scholar 

  • Santos M, Soares J, Abreu P, Araújo H, Santos J (2018) Cross-validation for imbalanced datasets: avoiding overoptimistic and overfitting approaches. IEEE Comput Intell Mag 13(3):59–76

    Article  Google Scholar 

  • Santoso B, Wijayanto H, Notodiputro KA, Sartono B (2018) K-neighbor over-sampling with cleaning data: a new approach to improve classification performance in data sets with class imbalance. Appl Math Sci 12(10):449–460

    Google Scholar 

  • Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2009) Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man, Cybern Part A Syst Hum 40(1):185–197

    Article  Google Scholar 

  • Selvaraj G, Kaliamurthi S, Kaushik A, Khan A, Wei Y, Cho W, Gu K, Wei D (2018) Identification of target gene and prognostic evaluation for lung adenocarcinoma using gene expression meta-analysis, network analysis and neural network algorithms. J Biomed Inform 86:120–134

    Article  Google Scholar 

  • Shilaskar S, Ghatol A, Chatur P (2017) Medical decision support system for extremely imbalanced datasets. Inf Sci 384:205–219

    Article  MathSciNet  Google Scholar 

  • Singh S (2003a) Multiresolution estimates of classification complexity. IEEE Trans Pattern Anal Mach Intell 25(12):1534–1539

    Article  Google Scholar 

  • Singh S (2003b) Prism-a novel framework for pattern recognition. Pattern Anal Appl 6(2):134–149

    Article  MathSciNet  Google Scholar 

  • Singh D, Gosain A, Saha A (2020) Weighted k-nearest neighbor based data complexity metrics for imbalanced datasets. Stat Anal Data Min ASA Data Sci J 13(4):394–404

    Article  MathSciNet  MATH  Google Scholar 

  • Slowik A, Kwasnicka H (2020) Evolutionary algorithms and their applications to engineering problems. Neural Comput Appl 32(16):12363–12379

  • Smith MR, Martinez T, Giraud-Carrier C (2014) An instance level analysis of data complexity. Mach Learn 95(2):225–256

    Article  MathSciNet  MATH  Google Scholar 

  • Sotoca JM, Sanchez J, Mollineda RA (2005) A review of data complexity measures and their applicability to pattern classification problems. Actas del III Taller Nacional de Mineria de Datos y Aprendizaje TAMIDA, pp 77–83

  • Sotoca JM, Mollineda RA, Sánchez JS (2006) A meta-learning framework for pattern classication by means of data complexity measures. Inteligencia Artificial Revista Iberoamericana de Inteligencia Artificial 10(29):31–38

    Google Scholar 

  • Sowah RA, Agebure MA, Mills GA, Koumadi KM, Fiawoo SY (2016) New cluster undersampling technique for class imbalance learning. Int J Mach Learn Comput 6(3):205

    Article  Google Scholar 

  • Stefanowski J (2013) Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. In: Emerging paradigms in machine learning. Springer, pp 277–306

  • Stefanowski J (2016) Dealing with data difficulty factors while learning from imbalanced data. In: Challenges in computational statistics and data mining. Springer, pp 333–363

  • Stefanowski J, Wilk S (2008) Selective pre-processing of imbalanced data for improving classification performance. In: International conference on data warehousing and knowledge discovery. Springer, pp 283–292

  • Tang Y, Gao J (2007) Improved classification for problem involving overlapping patterns. IEICE Trans Inf Syst 90(11):1787–1795

    Article  Google Scholar 

  • Tang W, Mao K, Mak LO, Ng GW (2010) Classification for overlapping classes using optimized overlapping region detection and soft decision. In: 2010 13th international conference on information fusion. IEEE, pp 1–8

  • Thornton C (1998) Separability is a learner’s best friend. In: 4th Neural computation and psychology workshop, London, 9–11 April 1997. Springer, pp 40–46

  • Tomek I (1976) Two modifications of cnn. IEEE Trans Syst Man Commun 6:769–772

    MathSciNet  MATH  Google Scholar 

  • Vorraboot P, Rasmequan S, Chinnasarn K, Lursinsap C (2015) Improving classification rate constrained to imbalanced data between overlapped and non-overlapped regions by hybrid algorithms. Neurocomputing 152:429–443

    Article  Google Scholar 

  • Vuttipittayamongkol P, Elyan E (2020a) Improved overlap-based undersampling for imbalanced dataset classification with application to epilepsy and Parkinson’s disease. Int J Neural Syst 30(08):2050043

  • Vuttipittayamongkol P, Elyan E (2020b) Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf Sci 509:47–70.

    Article  Google Scholar 

  • Vuttipittayamongkol P, Elyan E, Petrovski A, Jayne C (2018) Overlap-based undersampling for improving imbalanced data classification. In: International conference on intelligent data engineering and automated learning. Springer, pp 689–697

  • Vuttipittayamongkol P, Elyan E, Petrovski A (2020) On the class overlap problem in imbalanced data classification. Knowl Based Syst 106631

  • Van der Walt CM, Barnard E (2007) Measures for the characterisation of pattern-recognition data sets. In: 18th Annual symposium of the pattern recognition association of South Africa

  • Van der Walt CM, et al. (2008) Data measures that characterise classification problems. PhD thesis, University of Pretoria

  • Wang BX, Japkowicz N (2010) Boosting support vector machines for imbalanced data sets. Knowl Inf Syst 25(1):1–20

    Article  Google Scholar 

  • Wang S, Yao X (2009) Diversity analysis on imbalanced data sets by using ensemble models. In: 2009 IEEE symposium on computational intelligence and data mining. IEEE, pp 324–331

  • Wang S, Yao X (2013) Using class imbalance learning for software defect prediction. IEEE Trans Reliab 62(2):434–443

    Article  Google Scholar 

  • Wei J, Huang H, Yao L, Hu Y, Fan Q, Huang D (2020a) Ia-suwo: an improving adaptive semi-unsupervised weighted oversampling for imbalanced classification problems. Knowl Based Syst 203:106116

    Article  Google Scholar 

  • Wei J, Huang H, Yao L, Hu Y, Fan Q, Huang D (2020b) Ni-mwmote: an improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems. Expert Syst Appl 158:113504

    Article  Google Scholar 

  • Weng CG, Poon J (2006) A data complexity analysis on imbalanced datasets and an alternative imbalance recovering strategy. In: 2006 IEEE/WIC/ACM international conference on web intelligence (WI 2006 main conference proceedings) (WI’06). IEEE, pp 270–276

  • Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 3:408–421

    Article  MathSciNet  MATH  Google Scholar 

  • Wojciechowski S, Wilk S (2017) Difficulty factors and preprocessing in imbalanced data sets: an experimental study on artificial data. Found Comput Decis Sci 42(2):149–176

    Article  MATH  Google Scholar 

  • Wozniak M, Grana M, Corchado E (2014) A survey of multiple classifier systems as hybrid systems. Inf Fusion 16:3–17

    Article  Google Scholar 

  • Xiong H, Wu J, Liu L (2010) classification with classoverlapping: a systematic study. In: Proceedings of the 1st international conference on E-Business intelligence (ICEBI2010). Atlantis Press

  • Yan Y, Liu R, Ding Z, Du X, Chen J, Zhang Y (2019) A parameter-free cleaning method for smote in imbalanced classification. IEEE Access 7:23537–23548

    Article  Google Scholar 

  • Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727

    Article  Google Scholar 

  • Zhu C, Wang Z (2017) Entropy-based matrix learning machine for imbalanced data sets. Pattern Recogn Lett 88:72–80

    Article  Google Scholar 

  • Zhu T, Lin Y, Liu Y (2017) Synthetic minority oversampling technique for multiclass imbalance problems. Pattern Recogn 72:327–340

    Article  Google Scholar 

  • Zhu T, Lin Y, Liu Y (2020a) Improving interpolation-based oversampling for imbalanced data learning. Knowl-Based Syst 187:104826

    Article  Google Scholar 

  • Zhu Y, Yan Y, Zhang Y, Zhang Y (2020b) Ehso: evolutionary hybrid sampling in overlapping scenarios for imbalanced learning. Neurocomputing 417:333–346

    Article  Google Scholar 

Download references

Acknowledgements

This work is funded by national funds through the FCT-Foundation for Science and Technology, I.P., within the scope of the project CISUC-UID/CEC/00326/2020 and by European Social Fund, through the Regional Operational Program Centro 2020. This work is also partially supported by Andalusian frontier regional project A-TIC-434-UGR20 and by the Spanish Ministry of Science and Technology under project PID2020-119478GB-I00 including European Regional Development Funds. This work was also partially funded by the project Safe Cities-Inovação para Construir Cidades Seguras, with the reference POCI-01-0247-FEDER-041435, co-funded by the European Regional Development Fund (ERDF), through the Operational Programme for Competitiveness and Internationalization (COMPETE 2020), under the PORTUGAL 2020 Partnership Agreement. The work is further supported by the FCT Research Grant SFRH/BD/138749/2018.

Author information

Authors and Affiliations

Authors

Contributions

MSS Conceptualisation, Methodology, Literature Search, Investigation, Formal Analysis, Writing—Original Draft, Writing—Review and Editing, Visualisation. PHA Conceptualisation, Validation, Writing—Review and Editing, Supervision. NJ Validation, Writing—Review and Editing. AF Validation, Writing—Review and Editing. CS Validation, Writing—Review and Editing. SW Validation, Writing—Review and Editing. JS Writing—Review and Editing.

Corresponding author

Correspondence to Miriam Seoane Santos.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Code availability

https://github.com/miriamspsantos/pycol.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Santos, M.S., Abreu, P.H., Japkowicz, N. et al. On the joint-effect of class imbalance and overlap: a critical review. Artif Intell Rev 55, 6207–6275 (2022). https://doi.org/10.1007/s10462-022-10150-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-022-10150-3

Keywords

Navigation