Skip to main content
Log in

Dynamic self-paced sampling ensemble for highly imbalanced and class-overlapped data classification

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Datasets with imbalanced class distribution are available in various real-world applications. A great number of approaches has been proposed to address the class imbalance challenge, but most of these models perform poorly when datasets are characterized with high class imbalance, class overlap and low data quality. In this study, we propose an effective meta-framework for high imbalance overlapped classification, called DAPS (DynAmic self-Paced sampling enSemble), which (1) leverages reasonable and effective sampling to maximize the utilization of informative instances and to avoid serious information loss and (2) assigns proper instance weights to address the issues of noisy data. Furthermore, most of the existing canonical classifiers (e.g. Decision Tree, Random Forest) can be integrated in DAPS. The comprehensive experimental results on both synthetic and three real-world datasets show that the DAPS model could obtain considerable improvements in F1-score when compared to a broad range of published models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. The code is available at https://github.com/ZhouF-ECNU/DAPS.

  2. Due to space limitation, we only report precision and recall results on real-world datasets. AUPRC (i.e., the area under the precision-recall curve) does not properly reflect the performance of our model, as DAPS chooses 0.5 as threshold to optimize predictions.

  3. https://ai.ppdai.com/.

References

  • Asuncion A, Newman D (2007) UCI machine learning repository

  • Cao K, Wei C, Gaidon A, Arechiga N, Ma T (2019) Learning imbalanced datasets with label-distribution-aware margin loss. In: Proceedings of the 33rd international conference on neural information processing systems, pp 1567–1578

  • Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: European conference on principles of data mining and knowledge discovery. Springer, pp 107–119

  • Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  Google Scholar 

  • Chen T, He T, Benesty M, Khotilovich V, Tang Y (2015) Xgboost: extreme gradient boosting. R package version 0.4-2, pp 1–4

  • Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55:119–139

    Article  MathSciNet  Google Scholar 

  • Friedman JH (2002) Stochastic gradient boosting. Comput Stat 38:367–378

    MathSciNet  MATH  Google Scholar 

  • Gónzalez S, Garcia S, Lázaro M, Figueiras-Vidal AR, Herrera F (2017) Class switching according to nearest enemy distance for learning from highly imbalanced data-sets. Pattern Recognit 70:12–24

    Article  Google Scholar 

  • He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks, pp 1322–1328

  • Kumar MP, Packer B, Koller D (2010) Self-paced learning for latent variable models. In: NIPS, pp 1189–1197

  • Last F, Douzas G, Bacao F (2017) Oversampling for imbalanced learning based on k-means and smote. arXiv preprint arXiv:1711.00837

  • Liu XY, Wu J, Zhou ZH (2008) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern 539–550

  • Liu XY, Zhou ZH (2006) The influence of class imbalance on cost-sensitive learning: an empirical study. In: International conference on data mining. IEEE, pp 970–974

  • Liu Z, Cao W, Gao Z, Bian J, Chen H, Chang Y, Liu T (2020) Self-paced ensemble for highly imbalanced massive data classification. In: IEEE 36th international conference on data engineering, pp 841–852

  • Lu C, Ke H, Zhang G, Mei Y, Xu H (2019) An improved weighted extreme learning machine for imbalanced data classification. Memetic Comput 11:27–34

    Article  Google Scholar 

  • O’Brien R, Ishwaran H (2019) A random forests quantile classifier for class imbalanced data. Pattern Recogn 90:232–249

    Article  Google Scholar 

  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  • Peng M, Zhang Q, Xing X, Gui T, Huang X, Jiang YG, Ding K, Chen Z (2019) Trainable undersampling for class-imbalance learning. In: Proceedings of the AAAI conference on artificial intelligence, pp 4707–4714

  • Pozzolo AD, Boracchi G, Caelen O, Alippi C, Bontempi G (2017) Credit card fraud detection: a realistic modeling and a novel learning strategy. IEEE Trans Neural Netw Learn Syst 29:3784–3797

    Google Scholar 

  • Seiffert C, Khoshgoftaar TM, Van HJ, Napolitano A (2009) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Hum 40:185–197

    Article  Google Scholar 

  • Vuttipittayamongkol P, Elyan E (2020) Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf Sci 509:47–70

    Article  Google Scholar 

  • Vuttipittayamongkol P, Elyan E (2020b) Overlap-based undersampling method for classification of imbalanced medical datasets. In: IFIP international conference on artificial intelligence applications and innovations. Springer, pp 358–369

  • Vuttipittayamongkol P, Elyan E, Petrovski A, Jayne C (2018) Overlap-based undersampling for improving imbalanced data classification. In: International conference on intelligent data engineering and automated learning. Springer, pp 689–697

  • Wallace BC, Small K, Brodley C, Trikalinos TA (2011) Class imbalance, redux. In: 2011 IEEE 11th international conference on data mining. IEEE, pp 754–763

  • Wang Y, Gan W, Yang J, Wu W, Yan J (2019) Dynamic curriculum learning for imbalanced data classification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5017–5026

  • Wang S, Yao X (2009)Diversity analysis on imbalanced data sets by using ensemble models. In: 2009 IEEE symposium on computational intelligence and data mining. IEEE, pp 324–331

  • Wei W, Li J, Cao L, Ou Y, Chen J (2013) Effective detection of sophisticated online banking fraud on extremely imbalanced data. WWW, pp 449–475

  • Wu F, Jing XY, Shan S, Zuo W, Yang JY (2017) Multiset feature learning for highly imbalanced data classification. In: Proceedings of the AAAI conference on artificial intelligence, vol 31

  • Yuan X, Xie L, Abouelenien M (2018) A regularized ensemble framework of deep learning for cancer detection from multi-class, imbalanced training data. Pattern Recogn 77:160–172

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fang Zhou.

Additional information

Responsible editor: Albrecht Zimmermann and Peggy Cellier.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research was supported in part by NSFC Grant 61902127 and Natural Science Foundation of Shanghai 19ZR1415700.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, F., Gao, S., Ni, L. et al. Dynamic self-paced sampling ensemble for highly imbalanced and class-overlapped data classification. Data Min Knowl Disc 36, 1601–1622 (2022). https://doi.org/10.1007/s10618-022-00838-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-022-00838-z

Keywords

Navigation