Skip to main content
Log in

An imbalanced ensemble learning method based on dual clustering and stage-wise hybrid sampling

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Imbalanced data classification remains a research hotspot and a challenging problem in the field of machine learning. The challenge of imbalanced learning lies not only in class imbalance problem, but also in the class overlapping problem which is complex. However, most of the existing algorithms mainly focus on the former. The limitation prevents the existing methods from breaking through. To address this limitation, this paper proposes an ensemble algorithm based on dual clustering and stage-wise hybrid sampling (DCSHS) to address both class imbalance and class overlapping problems. The DCSHS has three main parts: projection clustering combination framework (PCC), stage-wise hybrid sampling (SHS) and envelope clustering transfer mapping mechanism (CTM). PCC is to create multiple subsets through projective clustering. SHS is to identify the overlapping region of each subset and conduct hybrid sampling. CTM is to explore more information of samples in each subset by combining the clustering and transfer learning. At first, we design a PCC framework guided by Davies-Bouldin clustering effectiveness index (DBI), which is used to obtain high-quality clusters and combine them to obtain a set of cross-complete subsets (CCS) with low overlapping. Secondly, according to the characteristics of subset classes, a SHS algorithm is designed to realize the de-overlapping and balancing of subsets. Finally, an envelope clustering transfer mapping mechanism (CTM) is constructed for all processed subsets by means of transfer learning, thereby reducing class overlapping and explore structural information of samples. Weak classifiers are trained on the balanced subsets, and fused as all the imbalanced ensemble algorithms did. The major advantage of our algorithm is that it can exploit the intersectionality of the CCS to realize the soft elimination of overlapping majority samples, and learn as much information of overlapping samples as possible, thereby enhancing the class overlapping while class balancing. In the experimental section, more than 30 public datasets and over ten representative algorithms are chosen for verification. The experimental results show that the DCSHS is significantly best in terms of anti-overlapping, Recall, F1-M, G-M, AUC, and diversity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig.8

Similar content being viewed by others

Data availability

The data and codes can be found at: (https://pan.baidu.com/s/1M0N39gEIc4bK2qwg9EYTMQ, extraction code:1111).

References

  1. Mirzaei B, Nikpour B, Nezamabadi-pour H (2021) CDBH: A clustering and density-based hybrid approach for imbalanced data classification[J]. Expert Syst Appl 164:114035

    Google Scholar 

  2. Vuttipittayamongkol P, Elyan E, Petrovski A (2021) On the class overlapping problem in imbalanced data classification[J]. Knowl-Based Syst 212:106631

    Google Scholar 

  3. Santos M S, Abreu P H, Japkowicz N, et al. (2022) On the joint-effect of class imbalance and overlap: a critical review[J]. Artificial Intelligence Review 1–69

  4. Li Z, Huang M, Liu G et al (2021) A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud detection[J]. Expert Syst Appl 175:114750

    Google Scholar 

  5. Yuan BW, Luo XG, Zhang ZL et al (2021) A novel density-based adaptive k nearest neighbor method for dealing with overlapping problem in imbalanced datasets[J]. Neural Comput Appl 33:4457–4481

    Google Scholar 

  6. Vuttipittayamongkol P, Elyan E (2020) Improved overlap-based undersampling for imbalanced dataset classification with application to epilepsy and parkinson’s disease[J]. Int J Neural Syst 30(08):2050043

    Google Scholar 

  7. Mayabadi S, Saadatfar H (2022) Two density-based sampling approaches for imbalanced and overlapping data[J]. Knowl-Based Syst 241:108217

    Google Scholar 

  8. Vuttipittayamongkol P, Elyan E. (2020) Overlap-based undersampling method for classification of imbalanced medical datasets[C]//IFIP International Conference on Artificial Intelligence Applications and Innovations. Springer, Cham 358–369

  9. Vuttipittayamongkol P, Elyan E (2020) Neighbourhood-based undersampling approach for handling imbalanced and overlapped data[J]. Inf Sci 509:47–70

    Google Scholar 

  10. Dai Q, Liu J, Liu Y (2022) Multi-granularity relabeled under-sampling algorithm for imbalanced data[J]. Appl Soft Comput 124:109083

    Google Scholar 

  11. Zhou J, Pedrycz W, Gao C et al (2021) Robust jointly sparse fuzzy clustering with neighborhood structure preservation[J]. IEEE Trans Fuzzy Syst 30(4):1073–1087

    Google Scholar 

  12. Xie X, Liu H, Zeng S et al (2021) A novel progressively undersampling method based on the density peaks sequence for imbalanced data[J]. Knowl-Based Syst 213:106689

    Google Scholar 

  13. Wang X, Wang H, Wang Y (2020) A density weighted fuzzy outlier clustering approach for class imbalanced learning[J]. Neural Comput Appl 32:13035–13049

    Google Scholar 

  14. Tsai CF, Lin WC, Hu YH et al (2019) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection[J]. Inf Sci 477:47–54

    Google Scholar 

  15. Chawla NV, Bowyer KW, Hall LO et al (2002) SMOTE: Synthetic Minority Over-sampling Technique[J]. Journal of Artificial Intelligence Research 16(1):321–357

    MATH  Google Scholar 

  16. Soltanzadeh P, Hashemzadeh M (2021) RCSMOTE: Range-Controlled synthetic minority over-sampling technique for handling the class imbalance problem[J]. Inf Sci 542:92–111

    MathSciNet  MATH  Google Scholar 

  17. Ladeira Marques M, MoraesVillela S, Hasenclever Borges CC (2020) Large margin classifiers to generate synthetic data for imbalanced datasets[J]. Appl Intell 50(11):3678–3694

    Google Scholar 

  18. Liang XW, Jiang AP, Li T et al (2020) LR-SMOTE—An improved unbalanced data set oversampling based on K-means and SVM[J]. Knowl-Based Syst 196:105845

    Google Scholar 

  19. Xu Z, Shen D, Nie T et al (2021) A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data[J]. Inf Sci 572:574–589

    MathSciNet  Google Scholar 

  20. Ming ZA, Tong L, Rui Z et al (2020) Conditional Wasserstein generative adversarial network-gradient penalty-based approach to alleviating imbalanced data classification [J]. Inf Sci 512:1009–1023

    Google Scholar 

  21. Yue ZA, Zga B (2020) Gaussian Discriminative Analysis aided GAN for imbalanced big data augmentation and fault classification[J]. J Process Control 92:271–287

    Google Scholar 

  22. Min Z, Zou B, Wei F, et al. (2016) Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data[C]//Online Analysis & Computing Science. IEEE

  23. Gao X, Ren B, Zhang H et al (2020) An ensemble imbalanced classification method based on model dynamic selection driven by data partition hybrid sampling[J]. Expert Syst Appl 160:113660

    Google Scholar 

  24. Xu Z, Shen D, Nie T et al (2020) A hybrid sampling algorithm combining M-SMOTE and ENN based on random forest for medical imbalanced data[J]. J Biomed Inform 107:103465

    Google Scholar 

  25. Liu CL, Chang YH (2022) Learning from imbalanced data with deep density hybrid sampling[J]. IEEE Transactions on Systems, Man, and Cybernetics: Systems 52(11):7065–7077

    Google Scholar 

  26. Huang S, Chen H, Li T et al (2022) Feature selection via minimizing global redundancy for imbalanced data[J]. Appl Intell 52(8):8685–8707

    Google Scholar 

  27. Tao X, Li Q, Guo W et al (2019) Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification[J]. Inf Sci 487:31–56

    MathSciNet  Google Scholar 

  28. Zheng W, Zhao H (2020) Cost-sensitive hierarchical classification for imbalance classes[J]. Appl Intell 50(8):2328–2338

    MathSciNet  Google Scholar 

  29. Ren Z, Zhu Y, Kang W et al (2022) Adaptive cost-sensitive learning: Improving the convergence of intelligent diagnosis models under imbalanced data[J]. Knowl-Based Syst 241:108296

    Google Scholar 

  30. Yang K, Yu Z, Wen X et al (2019) Hybrid classifier ensemble for imbalanced data[J]. IEEE transactions on neural networks and learning systems 31(4):1387–1400

    MathSciNet  Google Scholar 

  31. Guo Y, Feng J, Jiao B et al (2021) Manifold cluster-based evolutionary ensemble imbalance learning[J]. Comput Ind Eng 159:107523

    Google Scholar 

  32. Sun J, Lang J, Fujita H et al (2018) Imbalanced enterprise credit evaluation with DTE-SBD: Decision tree ensemble based on SMOTE and bagging with differentiated sampling rates[J]. Inf Sci 425:76–91

    MathSciNet  Google Scholar 

  33. Zhao J, Jin J, Chen S et al (2020) A weighted hybrid ensemble method for classifying imbalanced data[J]. Knowl-Based Syst 203:106087

    Google Scholar 

  34. Niu K, Zhang Z, Liu Y et al (2020) Resampling ensemble model based on data distribution for imbalanced credit risk evaluation in P2P lending[J]. Inf Sci 536:120–134

    MathSciNet  Google Scholar 

  35. Tao X, Zheng Y, Chen W et al (2022) SVDD-based weighted oversampling technique for imbalanced and overlapped dataset learning[J]. Inf Sci 588:13–51

    Google Scholar 

  36. Zhu Y, Yan Y, Zhang Y et al (2020) EHSO: Evolutionary Hybrid Sampling in overlapping scenarios for imbalanced learning[J]. Neurocomputing 417:333–346

    Google Scholar 

  37. Fu GH, Wu YJ, Zong MJ et al (2020) Feature selection and classification by minimizing overlap degree for class-imbalanced data in metabolomics[J]. Chemom Intell Lab Syst 196:103906

    Google Scholar 

  38. Zhou F, Gao S, Ni L et al (2022) Dynamic self-paced sampling ensemble for highly imbalanced and class-overlapped data classification[J]. Data Min Knowl Disc 36(5):1601–1622

    MathSciNet  Google Scholar 

  39. Fernandes ERQ, de Carvalho AC (2019) Evolutionary inversion of class distribution in overlapping areas for multi-class imbalanced learning[J]. Inf Sci 494:141–154

    Google Scholar 

  40. Chen X, Zhang L, Wei X et al (2021) An effective method using clustering-based adaptive decomposition and editing-based diversified oversamping for multi-class imbalanced datasets[J]. Appl Intell 51:1918–1933

    Google Scholar 

  41. Yuan BW, Zhang ZL, Luo XG et al (2021) OIS-RF: A novel overlap and imbalance sensitive random forest[J]. Eng Appl Artif Intell 104:104355

    Google Scholar 

  42. Lee HK, Kim SB (2018) An overlap-sensitive margin classifier for imbalanced and overlapping data[J]. Expert Syst Appl 98:72–83

    Google Scholar 

  43. Hartigan J A, Wong M A. (1979) A K-means Clustering Algorithm[J]. Applied Statistics 28(1)

  44. Wilson Dennis L (2007) Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. [J]. IEEE Transactions on Systems Man and Cybernetics 2(3):408–421

    MathSciNet  MATH  Google Scholar 

  45. Chen Y, Song S, Li S et al (2019) A graph embedding framework for maximum mean discrepancy-based domain adaptation algorithms[J]. IEEE Trans Image Process 29:199–213

    MathSciNet  MATH  Google Scholar 

  46. Pan S J, Kwok J T, Yang Q. (2008) Transfer Learning via Dimensionality Reduction[C]//Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, AAAI 2008, Chicago, Illinois, USA, July 13–17, 2008. DBLP

  47. Wei W, Dai H, Liang W (2020) Regularized least squares locality preserving projections with applications to image recognition[J]. Neural Netw 128:322–330

    MATH  Google Scholar 

  48. Tao X, Li Q, Guo W et al (2020) Adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering[J]. Inf Sci 519:43–73

    MathSciNet  MATH  Google Scholar 

  49. Feng S, Zhao C, Fu P (2020) A cluster-based hybrid sampling approach for imbalanced data classification[J]. Rev Sci Instrum 91(5):055101

    Google Scholar 

  50. Sun Z, Song Q, Zhu X et al (2015) A novel ensemble method for classifying imbalanced data[J]. Pattern Recogn 48(5):1623–1637

    Google Scholar 

  51. Ren J, Wang Y, Mao M et al (2022) Equalization ensemble for large scale highly imbalanced data classification[J]. Knowl-Based Syst 242:108295

    Google Scholar 

  52. Y. Xu, Z. Yu, C. L. P. Chen and Z. Liu, (2021) Adaptive Subspace Optimization Ensemble Method for High-Dimensional Imbalanced Data Classification[J]. IEEE Transactions on Neural Networks and Learning Systems, Early Access https://doi.org/10.1109/TNNLS.2021.3106306

Download references

Acknowledgements

We are grateful for the support of the National Natural Science Foundation of China NSFC (No. U21A20448 and 61771080); Natural Science Foundation of Chongqing (cstc2020jcyj-msxmX0100, cstc2020jscx-gksb0010, cstc2020jscx-msxm0369); Basic and Advanced Research Project in Chongqing (cstc2020jscx-fyzx0212, cstc2020jscx-msxm0369, cstc2020jcyj-msxmX0523); Chongqing Social Science Planning Project (2018YBYY133); and Special Project of Improving Scientific and Technological Innovation Ability of the Army Medical University (2019XLC3055).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Pin Wang or Yongming Li.

Ethics declarations

Consent for publication

Not applicable.

Conflict of interest

None.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

See Table 7.

Table 7 Classification performance of different algorithms using SVM on the datasets with low and high IR. The superscript m1-m4 represents the methods of oversampling, undersampling, hybrid sampling and ensemble learning (see Section 4.1.1)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, F., Wang, B., Wang, P. et al. An imbalanced ensemble learning method based on dual clustering and stage-wise hybrid sampling. Appl Intell 53, 21167–21191 (2023). https://doi.org/10.1007/s10489-023-04650-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04650-0

Keywords

Navigation