Skip to main content
Log in

RFCL: A new under-sampling method of reducing the degree of imbalance and overlap

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Imbalanced data are often encountered in every aspect of our lives, such as medical science, Internet, finance, and surveillance. Learning from imbalanced data which is also called the imbalanced learning problem is still a big challenge and deserves more attention. In this paper, we focus on overlap, which is one of the most important inherent factors that hinder learning from imbalanced data well. We put forward the overlapping degree (OD), and grouped data sets into two types, high OD (HOD) and low OD (LOD). The experimental results found that LOD data sets can achieve good results without any under-sampling algorithm, though some of them have high degree of imbalance, and the under-sampling algorithm does not improve the results very much. A new under-sampling algorithm, random forest cleaning rule (RFCL), was proposed to remove the majority class instances that cross the given new classification boundary which is a margin’s threshold. The degree of overlap and imbalance will be decreased in this way. This threshold is searched by maximizing the F1-score of the final classifier. Experimental results show that RFCL outperforms seven classic and two latest under-sampling methods in terms of F1-score and area under the curve, whether using random forest or support vector machine as the final classifier.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66

    Google Scholar 

  2. Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29

    Article  Google Scholar 

  3. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  4. Breiman L (2017) Classification and regression trees. Routledge, London

    Book  Google Scholar 

  5. Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11:131–167

    Article  Google Scholar 

  6. Bylander T, Hanzlik D (1999) Estimating generalization error using out-of-bag estimates. AAAI/IAAI 1999:321–327

    Google Scholar 

  7. Cano A, Zafra A, Ventura S (2013) Weighted data gravitation classification for standard and imbalanced data. IEEE Trans Cybern 43(6):1672–1687

    Article  Google Scholar 

  8. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357

    Article  Google Scholar 

  9. Chen X, Kang Q, Zhou M, Wei Z (2016) A novel under-sampling algorithm based on iterative-partitioning filters for imbalanced classification. In: CASE 2016, IEEE, pp 490–494

  10. Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml

  11. Feng W, Huang W, Ren J (2018) Class imbalance ensemble learning based on the margin theory. Appl Sci-Basel 8(5):815

    Article  Google Scholar 

  12. Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701

    Article  Google Scholar 

  13. Frigge M, Hoaglin DC, Iglewicz B (1989) Some implementations of the boxplot. Am Stat 43(1):50–54

    Google Scholar 

  14. Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern C 42(4):463–484

    Article  Google Scholar 

  15. Gamberger D, Lavrač N, Džeroski S (1996) Noise elimination in inductive concept learning: a case study in medical diagnosis. In: ALT ’96. Springer, Berlin, pp 199–212

  16. He H, Garcia EA (2008) Learning from imbalanced data. IEEE Trans Knowl Data 9:1263–1284

    Google Scholar 

  17. Huda S, Yearwood J, Jelinek HF, Hassan MM, Fortino G, Buckland M (2016) A hybrid feature selection with ensemble classification for imbalanced healthcare data: a case study for brain tumor diagnosis. IEEE Access 4:9145–9154

    Article  Google Scholar 

  18. Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449

    Article  Google Scholar 

  19. Johnson BA, Tateishi R, Hoan NT (2013) A hybrid pansharpening approach and multiscale object-based image analysis for mapping diseased pine and oak trees. Int J Remote Sens 34(20):6969–6982

    Article  Google Scholar 

  20. Kang Q, Chen X, Li S, Zhou M (2017) A noise-filtered under-sampling scheme for imbalanced classification. IEEE Trans Syst Man Cybern 47(12):4263–4274

    Google Scholar 

  21. Khoshgoftaar TM, Rebours P (2007) Improving software quality prediction by noise filtering techniques. J Comput Sci Technol 22(3):387–396

    Article  Google Scholar 

  22. Khoshgoftaar TM, Zhong S, Joshi V (2005) Noise elimination with ensemble-classifier filtering for software quality estimation. Intell Data Anal 9(1):3–27

    Article  Google Scholar 

  23. Kohavi R, et al (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI 1995, Montreal, Canada, vol 14, pp 1137–1145

  24. Kubat M, Matwin S, et al. (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: ICML 1997, Nashville, USA, vol 97, pp 179–186

  25. Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: AIME 2001. Springer, Berlin, pp 63–66

  26. Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern B 39(2):539–550

    Article  Google Scholar 

  27. Lucas D, Klein R, Tannahill J, Ivanova D, Brandon S, Domyancic D, Zhang Y (2013) Failure analysis of parameter-induced simulation crashes in climate models. Geosci Model Dev 6(4):1157–1171

    Article  Google Scholar 

  28. Mani I, Zhang I (2003) knn approach to unbalanced data distributions: a case study involving information extraction. In: AAAI 2003, vol 126

  29. Mansouri K, Ringsted T, Ballabio D, Todeschini R, Consonni V (2013) Quantitative structure-activity relationship models for ready biodegradability of chemicals. J Chem Inf Model 53(4):867–878

    Article  Google Scholar 

  30. Nemenyi P (1963) Distribution-tree multiple comparison. PhD thesis

  31. Quinlan JR (1996) Bagging, boosting, and c4.5. In: AAAI/IAAI 1996

  32. Sáez JA, Luengo J, Stefanowski J, Herrera F (2014) Managing borderline and noisy examples in imbalanced classification by combining smote with ensemble filtering. In: IDEAL 2016. Springer, Berlin, pp 61–68

  33. Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) Smote-ipf: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inform Sci 291:184–203

    Article  Google Scholar 

  34. Schapire RE, Freund Y, Bartlett P, Lee WS et al (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann Stat 26(5):1651–1686

    MathSciNet  MATH  Google Scholar 

  35. Sikora M, Wróbel Ł (2010) Application of rule induction algorithms for analysis of data collected by seismic hazard monitoring systems in coal mines. Arch Min Sci 55(1):91–114

    Google Scholar 

  36. Smith MR, Martinez T (2015) Using classifier diversity to handle label noise. In: IJCNN 2015. IEEE, pp 1–8

  37. Strobl C, Boulesteix AL, Augustin T (2007) Unbiased split selection for classification trees based on the Gini index. Comput Stat Data Anal 52(1):483–501

    Article  MathSciNet  Google Scholar 

  38. Tang Y, Zhang YQ, Chawla NV, Krasser S (2009) Svms modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern B 39(1):281–288

    Article  Google Scholar 

  39. Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6:769–772

    MathSciNet  MATH  Google Scholar 

  40. Tsai CF, Lin WC, Hu YH, Yao GT (2019) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inform Sci 477:47–54

    Article  Google Scholar 

  41. Vannucci M, Colla V (2018) Self-organizing-maps based undersampling for the classification of unbalanced datasets. In: IJCNN 2018. IEEE, pp 1–6

  42. Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 3:408–421

    Article  MathSciNet  Google Scholar 

  43. Wu Q, Ye Y, Zhang H, Ng MK, Ho SS (2014) Forestexter: an efficient random forest algorithm for imbalanced text categorization. Knowl-Based Syst 67:105–116

    Article  Google Scholar 

  44. Xue JH, Hall P (2015) Why does rebalancing class-unbalanced data improve auc for linear discriminant analysis? IEEE Trans Pattern Anal 37(5):1109–1112

    Article  Google Scholar 

  45. Yang Q, Wu X (2006) 10 challenging problems in data mining research. Int J Inform Technol Decis 5(04):597–604

    Article  Google Scholar 

  46. Yeh IC, Yang KJ, Ting TM (2009) Knowledge discovery on rfm model using bernoulli sequence. Expert Syst Appl 36(3):5866–5871

    Article  Google Scholar 

  47. Zhang D, Islam MM, Lu G (2012) A review on automatic image annotation techniques. Pattern Recogn 45(1):346–362

    Article  Google Scholar 

  48. Zhu M, Xia J, Jin X, Yan M, Cai G, Yan J, Ning G (2018) Class weights random forest algorithm for processing class imbalanced medical data. IEEE Access 6:4641–4652

    Article  Google Scholar 

  49. Zhu X, Wu X, Chen Q (2003) Eliminating class noise in large datasets. In: ICML-03, pp 920–927

Download references

Acknowledgements

Thanks to the data sets provided by the UCI repository. Also thanks to R language and the authors of different packages.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zuoquan Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, R., Zhang, Z. & Wang, D. RFCL: A new under-sampling method of reducing the degree of imbalance and overlap. Pattern Anal Applic 24, 641–654 (2021). https://doi.org/10.1007/s10044-020-00929-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-020-00929-x

Keywords

Mathematics Subject Classification

Navigation