On the joint-effect of class imbalance and overlap: a critical review

Santos, Miriam Seoane; Abreu, Pedro Henriques; Japkowicz, Nathalie; Fernández, Alberto; Soares, Carlos; Wilk, Szymon; Santos, João

doi:10.1007/s10462-022-10150-3

On the joint-effect of class imbalance and overlap: a critical review

Published: 24 March 2022

Volume 55, pages 6207–6275, (2022)
Cite this article

Artificial Intelligence Review Aims and scope Submit manuscript

Miriam Seoane Santos ORCID: orcid.org/0000-0002-5912-963X¹,
Pedro Henriques Abreu¹,
Nathalie Japkowicz²,
Alberto Fernández³,
Carlos Soares⁴,
Szymon Wilk⁵ &
…
João Santos^6,7

1879 Accesses
26 Citations
1 Altmetric
Explore all metrics

Abstract

Current research on imbalanced data recognises that class imbalance is aggravated by other data intrinsic characteristics, among which class overlap stands out as one of the most harmful. The combination of these two problems creates a new and difficult scenario for classification tasks and has been discussed in several research works over the past two decades. In this paper, we argue that despite some insightful information can be derived from related research, the joint-effect of class overlap and imbalance is still not fully understood, and advocate for the need to move towards a unified view of the class overlap problem in imbalanced domains. To that end, we start by performing a thorough analysis of existing literature on the joint-effect of class imbalance and overlap, elaborating on important details left undiscussed on the original papers, namely the impact of data domains with different characteristics and the behaviour of classifiers with distinct learning biases. This leads to the hypothesis that class overlap comprises multiple representations, which are important to accurately measure and analyse in order to provide a full characterisation of the problem. Accordingly, we devise two novel taxonomies, one for class overlap measures and the other for class overlap-based approaches, both resonating with the distinct representations of class overlap identified. This paper therefore presents a global and unique view on the joint-effect of class imbalance and overlap, from precursor work to recent developments in the field. It meticulously discusses some concepts taken as implicit in previous research, explores new perspectives in light of the limitations found, and presents new ideas that will hopefully inspire researchers to move towards a unified view on the problem and the development of suitable strategies for imbalanced and overlapped domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

Vitor Werner de Vargas, Jorge Arthur Schneider Aranda, … Jorge Luis Victória Barbosa

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

Article Open access 15 November 2019

Jesper E. van Engelen & Holger H. Hoos

Notes

The reader may find supporting information in the supplementary material online at https://student.dei.uc.pt/~miriams/pdf-files/AIR_2021_Appendix.pdf.
The interested reader may find detailed information on the performance of each classifier in the supplementary material provided online at https://student.dei.uc.pt/~miriams/pdf-files/AIR_2021_Appendix.pdf.
https://github.com/miriamspsantos/pycol.
https://archive.ics.uci.edu.
https://www.kaggle.com.
http://keel.es.
https://www.openml.org.
https://github.com/miriamspsantos/datagenerator.
https://github.com/miriamspsantos/open-source-imbalance-overlap.
https://github.com/miriamspsantos/pycol.

References

Abdi L, Hashemi S (2015) To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans Knowl Data Eng 28(1):238–251
Article Google Scholar
Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. In: European conference on machine learning. Springer, pp 39–50
Alejo R, Valdovinos RM, García V, Pacheco-Sanchez JH (2013) A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios. Pattern Recogn Lett 34(4):380–388
Article Google Scholar
Anwar N, Jones G, Ganesh S (2014) Measurement of data complexity for classification problems with unbalanced data. Stat Anal Data Min ASA Data Sci J 7(3):194–211
Article MathSciNet MATH Google Scholar
Armano G, Tamponi E (2016) Experimenting multiresolution analysis for identifying regions of different classification complexity. Pattern Anal Appl 19(1):129–137
Article MathSciNet Google Scholar
Barandela R, Valdovinos RM, Sánchez JS (2003) New applications of ensembles of classifiers. Pattern Anal Appl 6(3):245–256
Article MathSciNet Google Scholar
Barella VH, Costa EP, Carvalho A, Pl F (2014) Clusteross: a new undersampling method for imbalanced learning. In: Proceedings of the 3th Brazilian conference on intelligent systems. Academic Press
Barella VH, Garcia LP, de Souto MP, Lorena AC, de Carvalho A (2018) Data complexity measures for imbalanced classification tasks. In: 2018 international joint conference on neural networks (IJCNN). IEEE, pp 1–8
Barella VH, Garcia LP, de Souto MC, Lorena AC, de Carvalho AC (2021) Assessing the data complexity of imbalanced datasets. Inf Sci 553:83–109
Article MathSciNet MATH Google Scholar
Barua S, Islam M, Yao X, Murase K (2014) Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425
Article Google Scholar
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29
Article Google Scholar
Batuwita R, Palade V (2010) Fsvm-cil: fuzzy support vector machines for class imbalance learning. IEEE Trans Fuzzy Syst 18(3):558–571
Article Google Scholar
Bi J, Zhang C (2018) An empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble learning scheme. Knowl Based Syst 158:81–93
Article Google Scholar
Borsos Z, Lemnaru C, Potolea R (2018) Dealing with overlap and imbalance: a new metric and approach. Pattern Anal Appl 21(2):381–395
Article MathSciNet Google Scholar
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Article MATH Google Scholar
Bunkhumpornpat C, Sinapiromsaran K (2017) Dbmute: density-based majority under-sampling technique. Knowl Inf Syst 50(3):827–850
Article Google Scholar
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 475–482
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2011) Mute: majority under-sampling technique. In: 2011 8th international conference on information, communications and signal processing. IEEE, pp 1–4
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) Dbsmote: density-based synthetic minority over-sampling technique. Appl Intell 36(3):664–684
Article Google Scholar
Cao H, Li XL, Woon DYK, Ng SK (2013) Integrated oversampling for imbalanced time series classification. IEEE Trans Knowl Data Eng 25(12):2809–2822
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article MATH Google Scholar
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) Smoteboost: improving prediction of the minority class in boosting. In: European conference on principles of data mining and knowledge discovery. Springer, pp 107–119
Chen S (2017) An improved synthetic minority over-sampling technique for imbalanced data set learning. Degree thesis of Department of Information Engineering, National Tsing Hua University, pp 1–59
Chen S, He H, Garcia EA (2010) Ramoboost: ranked minority oversampling in boosting. IEEE Trans Neural Netw 21(10):1624–1642
Article Google Scholar
Chen L, Fang B, Shang Z, Tang Y (2018) Tackling class overlap and imbalance problems in software defect prediction. Softw Qual J 26(1):97–125
Article Google Scholar
Chen X, Zhang L, Wei X, Lu X (2021) An effective method using clustering-based adaptive decomposition and editing-based diversified oversamping for multi-class imbalanced datasets. Appl Intell 51(4):1918–1933
Cieslak DA, Chawla NV, Striegel A (2006) Combating imbalance in network intrusion datasets. In: GrC, Citeseer, pp 732–737
Cohen G, Hilario M, Sax H, Hugonnet S, Geissbuhler A (2006) Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med 37(1):7–18
Article Google Scholar
Correia A, Soares C, Jorge A (2019) Dataset morphing to analyze the performance of collaborative filtering. In: International conference on discovery science. Springer, pp 29–39
Costa AJ, Santos MS, Soares C, Abreu PH (2020) Analysis of imbalance strategies recommendation using a meta-learning approach. In: 7th ICML workshop on automated machine learning (AutoML-ICML2020), pp 1–10
Cummins L (2013) Combining and choosing case base maintenance algorithms. PhD thesis, University College Cork
Das B, Krishnan NC, Cook DJ (2014a) Handling imbalanced and overlapping classes in smart environments prompting dataset. In: Data mining for service. Springer, pp 199–219
Das B, Krishnan NC, Cook DJ (2014b) Racog and wracog: two probabilistic oversampling techniques. IEEE Trans Knowl Data Eng 27(1):222–234
Article Google Scholar
Das S, Datta S, Chaudhuri B (2018) Handling data irregularities in classification: foundations, trends, and future challenges. Pattern Recogn 81:674–693
Article Google Scholar
de Melo VV, Lorena AC (2018) Using complexity measures to evolve synthetic classification datasets. In: 2018 International joint conference on neural networks (IJCNN). IEEE, pp 1–8
Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Trans Evol Comput 6(2):182–197
Article Google Scholar
Denil M, Trappenberg T (2010) Overlap versus imbalance. In: Canadian conference on artificial intelligence. Springer, pp 220–231
Douzas G, Bacao F (2019) Geometric smote a geometrically enhanced drop-in replacement for smote. Inf Sci 501:118–135
Article Google Scholar
Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Inf Sci 465:1–20
Article Google Scholar
Eshelman LJ (1991) The chc adaptive search algorithm: How to have safe search when engaging in nontraditional genetic recombination. In: Foundations of genetic algorithms, vol 1. Elsevier, pp 265–283
Ester M, Kriegel HP, Sander J, Xu X et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96:226–231
Google Scholar
Fan Q, Wang Z, Li D, Gao D, Zha H (2017) Entropy-based fuzzy support vector machine for imbalanced datasets. Knowl Based Syst 115:87–99
Article Google Scholar
Fernandes ER, de Carvalho AC (2019) Evolutionary inversion of class distribution in overlapping areas for multi-class imbalanced learning. Inf Sci 494:141–154
Article Google Scholar
Fernández A, García S, Galar M, Prati R, Krawczyk B, Herrera F (2018a) Data Intrinsic Characteristics. Springer, Cham, pp 253–277
Google Scholar
Fernández A, García S, Galar M, Prati R, Krawczyk B, Herrera F (2018b) Ensemble Learning. Springer, Cham, pp 147–196
Google Scholar
Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F (2018c) Dimensionality reduction for imbalanced learning. In: Learning from imbalanced data sets. Springer, pp 227–251
Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F (2018d) Learning From Imbalanced Data Sets, vol 11. Springer, Berlin
Book Google Scholar
Fernández A, Garcia S, Herrera F, Chawla NV (2018e) Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905
Article MathSciNet MATH Google Scholar
França TR, Miranda PB, Prudêncio RB, Lorenaz AC, Nascimento AC (2020) A many-objective optimization approach for complexity-based data set generation. In: 2020 IEEE congress on evolutionary computation (CEC). IEEE, pp 1–8
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139
Article MathSciNet MATH Google Scholar
Friedman J, Hastie T, Tibshirani R et al (2000) Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann Stat 28(2):337–407
Article MATH Google Scholar
Fu GH, Wu YJ, Zong MJ, Yi LZ (2020) Feature selection and classification by minimizing overlap degree for class-imbalanced data in metabolomics. Chemom Intell Lab Syst 196:103906
Article Google Scholar
Galar M, Fernández A, Barrenechea E, Bustince H, Herrera F (2013) Dynamic classifier selection for one-vs-one strategy: avoiding non-competent classifiers. Pattern Recogn 46(12):3412–3424
Article Google Scholar
Galar M, Fernández A, Barrenechea E, Herrera F (2015) Drcw-ovo: distance-based relative competence weighting combination for one-vs-one strategy in multi-class problems. Pattern Recogn 48(1):28–42
Article Google Scholar
García S, Herrera F (2009) Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol Comput 17(3):275–306
Article MathSciNet Google Scholar
García V, Alejo R, Sánchez J, Sotoca J, Mollineda R (2006) Combined effects of class imbalance and class overlap on instance-based classification. In: International conference on intelligent data engineering and automated learning. Springer, pp 371–378
García V, Mollineda R, Sánchez J, Alejo R, Sotoca J (2007a) When overlapping unexpectedly alters the class imbalance effects. In: Iberian conference on pattern recognition and image analysis. Springer, pp 499–506
García V, Sánchez J, Mollineda R (2007b) An empirical study of the behavior of classifiers on imbalanced and overlapped data sets. In: Iberoamerican congress on pattern recognition. Springer, pp 397–406
García V, Mollineda R, Sánchez J (2008) On the k-nn performance in a challenging scenario of imbalance and overlapping. Pattern Anal Appl 11(3–4):269–280
Article MathSciNet Google Scholar
García V, Sánchez J, Marqués A, Florencia R, Rivera G (2020) Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data. Expert Syst Appl 158:113026
Article Google Scholar
Greene J (2001) Feature subset selection using thornton’s separability index and its applicability to a number of sparse proximity-based classifiers. In: Proceedings of annual symposium of the pattern recognition association of South Africa
Guzmán-Ponce A, Valdovinos RM, Sánchez JS, Marcial-Romero JR (2020) A new under-sampling method to face class overlap and imbalance. Appl Sci 10(15):5164
Article Google Scholar
Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Article Google Scholar
Han H, Wang WY, Mao BH (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, pp 878–887
Hart P (1968) The condensed nearest neighbor rule (corresp.). IEEE Trans Inf Theory 14(3):515–516
Article Google Scholar
He H, Bai Y, Garcia E, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE, pp 1322–1328
Ho T, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24(3):289–300
Article Google Scholar
Huttenlocher DP, Klanderman GA, Rucklidge WJ (1993) Comparing images using the hausdorff distance. IEEE Trans Pattern Anal Mach Intell 15(9):850–863
Article Google Scholar
Jain A, Duin R, Mao J (2000) Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intell 22(1):4–37
Article Google Scholar
Japkowicz N (2001) Concept-learning in the presence of between-class and within-class imbalances. In: Conference of the Canadian society for computational studies of intelligence. Springer, pp 67–77
Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. ACM SIGKDD Explor Newsl 6(1):40–49
Article Google Scholar
Kang S, Cho S, Kang P (2015) Constructing a multi-class classifier using one-against-one approach with different binary classifiers. Neurocomputing 149:677–682
Article Google Scholar
Kaur H, Pannu HS, Malhi AK (2019) A systematic review on imbalanced data challenges in machine learning: applications and solutions. ACM Comput Surv (CSUR) 52(4):1–36
Google Scholar
Kovács G (2019) An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Appl Soft Comput 83:105662
Article Google Scholar
Koziarski M, Wozniak M (2017) Ccr: a combined cleaning and resampling algorithm for imbalanced data classification. Int J Appl Math Comput Sci 27(4):727–736
Article MathSciNet MATH Google Scholar
Koziarski M, Krawczyk B, Wozniak M (2019) Radial-based oversampling for noisy imbalanced data classification. Neurocomputing 343:19–33
Article Google Scholar
Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Progr. Artif. Intell. 5(4):221–232
Article Google Scholar
Kubat M, Matwin S et al (1997) Addressing the curse of imbalanced training sets: one-sided selection. Icml Citeseer 97:179–186
Google Scholar
Lango M, Brzezinski D, Firlik S, Stefanowski J (2017) Discovering minority sub-clusters and local difficulty factors from imbalanced data. In: International conference on discovery science. Springer, pp 324–339
Lango M, Brzezinski D, Stefanowski J (2018) Imweights: classifying imbalanced data using local and neighborhood information. In: Second international workshop on learning with imbalanced domains: theory and applications, PMLR, pp 95–109
Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: Conference on artificial intelligence in medicine in Europe. Springer, pp 63–66
Lee HK, Kim SB (2018) An overlap-sensitive margin classifier for imbalanced and overlapping data. Expert Syst Appl 98:72–83
Article Google Scholar
Leyva E, González A, Perez R (2014) A set of complexity measures designed for applying meta-learning to instance selection. IEEE Trans Knowl Data Eng 27(2):354–367
Article Google Scholar
Li KS, Wang HR, Liu KH (2019) A novel error-correcting output codes algorithm based on genetic programming. Swarm Evol Comput 50:100564
Article Google Scholar
Liu C (2008) Partial discriminative training for classification of overlapping classes in document analysis. IJDAR 11(2):53
Article Google Scholar
Liu XY, Wu J, Zhou ZH (2008) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B (Cybern) 39(2):539–550
Google Scholar
López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141
Article Google Scholar
Lorena AC, Costa IG, Spolaôr N, De Souto MC (2012) Analysis of complexity indices for classification problems: cancer gene expression data. Neurocomputing 75(1):33–42
Article Google Scholar
Lorena AC, Garcia LP, Lehmann J, Souto MC, Ho TK (2019) How complex is your classification problem? A survey on measuring classification complexity. ACM Comput Surv (CSUR) 52(5):1–34
Article Google Scholar
Luengo J, Fernández A, García S, Herrera F (2011) Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft Comput 15(10):1909–1936
Article Google Scholar
MacCuish J, MacCuish N (2010) Clustering in Bioinformatics and Drug Discovery. CRC Press, London
Book MATH Google Scholar
Macià N, Bernadó-Mansilla E (2014) Towards uci+: a mindful repository design. Inf Sci 261:237–262
Article Google Scholar
Malina W (2001) Two-parameter fisher criterion. IEEE Trans Syst Man Cybern Part B (Cybern) 31(4):629–636
Article Google Scholar
Mani I, Zhang I (2003) knn approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of workshop on learning from imbalanced datasets, ICML United States, vol 126
Manukyan A, Ceyhan E (2016) Classification of imbalanced data with a geometric digraph family. J Mach Learn Res 17(1):6504–6543
MathSciNet MATH Google Scholar
Massie S, Craw S, Wiratunga N (2005) Complexity-guided case discovery for case based reasoning. AAAI 5:216–221
Google Scholar
Menzies T, Butcher A, Cok D, Marcus A, Layman L, Shull F, Turhan B, Zimmermann T (2012) Local versus global lessons for defect prediction and effort estimation. IEEE Trans Softw Eng 39(6):822–834
Article Google Scholar
Mercier M, Santos M, Abreu P, Soares C, Soares J, Santos J (2018) Analysing the footprint of classifiers in overlapped and imbalanced contexts. In: International symposium on intelligent data analysis. Springer, pp 200–212
Muñoz MA, Villanova L, Baatar D, Smith-Miles K (2018) Instance spaces for machine learning classification. Mach Learn 107(1):109–147
Article MathSciNet MATH Google Scholar
Napierala K, Stefanowski J (2016) Types of minority class examples and their influence on learning classifiers from imbalanced data. J Intell Inf Syst 46(3):563–597
Article Google Scholar
Napierała K, Stefanowski J, Wilk S (2010) Learning from imbalanced data in presence of noisy and borderline examples. In: International conference on rough sets and current trends in computing. Springer, pp 158–167
Nekooeimehr I, Lai-Yuen SK (2016) Adaptive semi-unsupervised weighted oversampling (a-suwo) for imbalanced datasets. Expert Syst Appl 46:405–416
Article Google Scholar
Oh S (2011) A new dataset evaluation method based on category overlap. Comput Biol Med 41(2):115–122
Article Google Scholar
Orriols-Puig A, Macia N, Ho TK (2010) Documentation for the data complexity library in c++. Universitat Ramon Llull, La Salle 196:1–40
Google Scholar
Pascual-Triana JD, Charte D, Andrés Arroyo M, Fernández A, Herrera F (2021) Revisiting data complexity metrics based on morphology for overlap and imbalance: snapshot, new overlap number of balls metrics and singular problems prospect. Knowl Inf Syst 63(7):1961–1989
Article Google Scholar
Prati RGB, Monard M (2004) Class imbalances versus class overlapping: an analysis of a learning system behavior. In: Mexican international conference on artificial intelligence. Springer, pp 312–321
Rivolli A, Garcia LP, Soares C, Vanschoren J, de Carvalho AC (2018) Characterizing classification datasets: a study of meta-features for meta-learning. arXiv:180810406
Sáez J, Luengo J, Stefanowski J, Herrera F (2015) Smote-ipf: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203
Article Google Scholar
Sáez JA, Galar M, Krawczyk B (2019) Addressing the overlapping data problem in classification using the one-vs-one decomposition strategy. IEEE Access 7:83396–83411
Article Google Scholar
Santos M, Abreu P, García-Laencina P, Simão A, Carvalho A (2015) A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J Biomed Inform 58:49–59
Article Google Scholar
Santos M, Soares J, Abreu P, Araújo H, Santos J (2018) Cross-validation for imbalanced datasets: avoiding overoptimistic and overfitting approaches. IEEE Comput Intell Mag 13(3):59–76
Article Google Scholar
Santoso B, Wijayanto H, Notodiputro KA, Sartono B (2018) K-neighbor over-sampling with cleaning data: a new approach to improve classification performance in data sets with class imbalance. Appl Math Sci 12(10):449–460
Google Scholar
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2009) Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man, Cybern Part A Syst Hum 40(1):185–197
Article Google Scholar
Selvaraj G, Kaliamurthi S, Kaushik A, Khan A, Wei Y, Cho W, Gu K, Wei D (2018) Identification of target gene and prognostic evaluation for lung adenocarcinoma using gene expression meta-analysis, network analysis and neural network algorithms. J Biomed Inform 86:120–134
Article Google Scholar
Shilaskar S, Ghatol A, Chatur P (2017) Medical decision support system for extremely imbalanced datasets. Inf Sci 384:205–219
Article MathSciNet Google Scholar
Singh S (2003a) Multiresolution estimates of classification complexity. IEEE Trans Pattern Anal Mach Intell 25(12):1534–1539
Article Google Scholar
Singh S (2003b) Prism-a novel framework for pattern recognition. Pattern Anal Appl 6(2):134–149
Article MathSciNet Google Scholar
Singh D, Gosain A, Saha A (2020) Weighted k-nearest neighbor based data complexity metrics for imbalanced datasets. Stat Anal Data Min ASA Data Sci J 13(4):394–404
Article MathSciNet MATH Google Scholar
Slowik A, Kwasnicka H (2020) Evolutionary algorithms and their applications to engineering problems. Neural Comput Appl 32(16):12363–12379
Smith MR, Martinez T, Giraud-Carrier C (2014) An instance level analysis of data complexity. Mach Learn 95(2):225–256
Article MathSciNet MATH Google Scholar
Sotoca JM, Sanchez J, Mollineda RA (2005) A review of data complexity measures and their applicability to pattern classification problems. Actas del III Taller Nacional de Mineria de Datos y Aprendizaje TAMIDA, pp 77–83
Sotoca JM, Mollineda RA, Sánchez JS (2006) A meta-learning framework for pattern classication by means of data complexity measures. Inteligencia Artificial Revista Iberoamericana de Inteligencia Artificial 10(29):31–38
Google Scholar
Sowah RA, Agebure MA, Mills GA, Koumadi KM, Fiawoo SY (2016) New cluster undersampling technique for class imbalance learning. Int J Mach Learn Comput 6(3):205
Article Google Scholar
Stefanowski J (2013) Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. In: Emerging paradigms in machine learning. Springer, pp 277–306
Stefanowski J (2016) Dealing with data difficulty factors while learning from imbalanced data. In: Challenges in computational statistics and data mining. Springer, pp 333–363
Stefanowski J, Wilk S (2008) Selective pre-processing of imbalanced data for improving classification performance. In: International conference on data warehousing and knowledge discovery. Springer, pp 283–292
Tang Y, Gao J (2007) Improved classification for problem involving overlapping patterns. IEICE Trans Inf Syst 90(11):1787–1795
Article Google Scholar
Tang W, Mao K, Mak LO, Ng GW (2010) Classification for overlapping classes using optimized overlapping region detection and soft decision. In: 2010 13th international conference on information fusion. IEEE, pp 1–8
Thornton C (1998) Separability is a learner’s best friend. In: 4th Neural computation and psychology workshop, London, 9–11 April 1997. Springer, pp 40–46
Tomek I (1976) Two modifications of cnn. IEEE Trans Syst Man Commun 6:769–772
MathSciNet MATH Google Scholar
Vorraboot P, Rasmequan S, Chinnasarn K, Lursinsap C (2015) Improving classification rate constrained to imbalanced data between overlapped and non-overlapped regions by hybrid algorithms. Neurocomputing 152:429–443
Article Google Scholar
Vuttipittayamongkol P, Elyan E (2020a) Improved overlap-based undersampling for imbalanced dataset classification with application to epilepsy and Parkinson’s disease. Int J Neural Syst 30(08):2050043
Vuttipittayamongkol P, Elyan E (2020b) Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf Sci 509:47–70.
Article Google Scholar
Vuttipittayamongkol P, Elyan E, Petrovski A, Jayne C (2018) Overlap-based undersampling for improving imbalanced data classification. In: International conference on intelligent data engineering and automated learning. Springer, pp 689–697
Vuttipittayamongkol P, Elyan E, Petrovski A (2020) On the class overlap problem in imbalanced data classification. Knowl Based Syst 106631
Van der Walt CM, Barnard E (2007) Measures for the characterisation of pattern-recognition data sets. In: 18th Annual symposium of the pattern recognition association of South Africa
Van der Walt CM, et al. (2008) Data measures that characterise classification problems. PhD thesis, University of Pretoria
Wang BX, Japkowicz N (2010) Boosting support vector machines for imbalanced data sets. Knowl Inf Syst 25(1):1–20
Article Google Scholar
Wang S, Yao X (2009) Diversity analysis on imbalanced data sets by using ensemble models. In: 2009 IEEE symposium on computational intelligence and data mining. IEEE, pp 324–331
Wang S, Yao X (2013) Using class imbalance learning for software defect prediction. IEEE Trans Reliab 62(2):434–443
Article Google Scholar
Wei J, Huang H, Yao L, Hu Y, Fan Q, Huang D (2020a) Ia-suwo: an improving adaptive semi-unsupervised weighted oversampling for imbalanced classification problems. Knowl Based Syst 203:106116
Article Google Scholar
Wei J, Huang H, Yao L, Hu Y, Fan Q, Huang D (2020b) Ni-mwmote: an improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems. Expert Syst Appl 158:113504
Article Google Scholar
Weng CG, Poon J (2006) A data complexity analysis on imbalanced datasets and an alternative imbalance recovering strategy. In: 2006 IEEE/WIC/ACM international conference on web intelligence (WI 2006 main conference proceedings) (WI’06). IEEE, pp 270–276
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 3:408–421
Article MathSciNet MATH Google Scholar
Wojciechowski S, Wilk S (2017) Difficulty factors and preprocessing in imbalanced data sets: an experimental study on artificial data. Found Comput Decis Sci 42(2):149–176
Article MATH Google Scholar
Wozniak M, Grana M, Corchado E (2014) A survey of multiple classifier systems as hybrid systems. Inf Fusion 16:3–17
Article Google Scholar
Xiong H, Wu J, Liu L (2010) classification with classoverlapping: a systematic study. In: Proceedings of the 1st international conference on E-Business intelligence (ICEBI2010). Atlantis Press
Yan Y, Liu R, Ding Z, Du X, Chen J, Zhang Y (2019) A parameter-free cleaning method for smote in imbalanced classification. IEEE Access 7:23537–23548
Article Google Scholar
Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727
Article Google Scholar
Zhu C, Wang Z (2017) Entropy-based matrix learning machine for imbalanced data sets. Pattern Recogn Lett 88:72–80
Article Google Scholar
Zhu T, Lin Y, Liu Y (2017) Synthetic minority oversampling technique for multiclass imbalance problems. Pattern Recogn 72:327–340
Article Google Scholar
Zhu T, Lin Y, Liu Y (2020a) Improving interpolation-based oversampling for imbalanced data learning. Knowl-Based Syst 187:104826
Article Google Scholar
Zhu Y, Yan Y, Zhang Y, Zhang Y (2020b) Ehso: evolutionary hybrid sampling in overlapping scenarios for imbalanced learning. Neurocomputing 417:333–346
Article Google Scholar

Download references

Acknowledgements

This work is funded by national funds through the FCT-Foundation for Science and Technology, I.P., within the scope of the project CISUC-UID/CEC/00326/2020 and by European Social Fund, through the Regional Operational Program Centro 2020. This work is also partially supported by Andalusian frontier regional project A-TIC-434-UGR20 and by the Spanish Ministry of Science and Technology under project PID2020-119478GB-I00 including European Regional Development Funds. This work was also partially funded by the project Safe Cities-Inovação para Construir Cidades Seguras, with the reference POCI-01-0247-FEDER-041435, co-funded by the European Regional Development Fund (ERDF), through the Operational Programme for Competitiveness and Internationalization (COMPETE 2020), under the PORTUGAL 2020 Partnership Agreement. The work is further supported by the FCT Research Grant SFRH/BD/138749/2018.

Author information

Authors and Affiliations

Department of Informatics Engineering, Centre for Informatics and Systems of the University of Coimbra, University of Coimbra, Coimbra, Portugal
Miriam Seoane Santos & Pedro Henriques Abreu
Department of Computer Science, American University, Washington, DC, 20016, USA
Nathalie Japkowicz
Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain
Alberto Fernández
Fraunhofer Portugal AICOS and LIACC, Faculdade de Engenharia, Universidade do Porto, Porto, Portugal
Carlos Soares
Institute of Computing Science, Poznan University of Technology, Poznan, Poland
Szymon Wilk
IPO-Porto Research Centre (CI-IPOP), Porto, Portugal
João Santos
Instituto de Ciências Biomédicas Abel Salazar da Universidade do Porto, Porto, Portugal
João Santos

Authors

Miriam Seoane Santos
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Henriques Abreu
View author publications
You can also search for this author in PubMed Google Scholar
Nathalie Japkowicz
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Fernández
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Soares
View author publications
You can also search for this author in PubMed Google Scholar
Szymon Wilk
View author publications
You can also search for this author in PubMed Google Scholar
João Santos
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

MSS Conceptualisation, Methodology, Literature Search, Investigation, Formal Analysis, Writing—Original Draft, Writing—Review and Editing, Visualisation. PHA Conceptualisation, Validation, Writing—Review and Editing, Supervision. NJ Validation, Writing—Review and Editing. AF Validation, Writing—Review and Editing. CS Validation, Writing—Review and Editing. SW Validation, Writing—Review and Editing. JS Writing—Review and Editing.

Corresponding author

Correspondence to Miriam Seoane Santos.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Code availability

https://github.com/miriamspsantos/pycol.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Santos, M.S., Abreu, P.H., Japkowicz, N. et al. On the joint-effect of class imbalance and overlap: a critical review. Artif Intell Rev 55, 6207–6275 (2022). https://doi.org/10.1007/s10462-022-10150-3

Download citation

Published: 24 March 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s10462-022-10150-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the joint-effect of class imbalance and overlap: a critical review

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Code availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On the joint-effect of class imbalance and overlap: a critical review

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Code availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation