Skip to main content
Log in

Ensemble learning via constraint projection and undersampling technique for class-imbalance problem

  • Focus
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Ensemble learning is an effective technique for the class-imbalance problem, and the key for obtaining a successful ensemble is to create individual base classifiers with high accuracy and diversity. In this paper, we propose a novel ensemble learning method via constraint projection and undersampling technique, constructing each base classifier through the following two steps: 1) constructing a set of pairwise constraints by undersampling examples from the minority/majority class set and learning a projection matrix from the pairwise constraint set and 2) undersampling the original training set to obtaining a new training set on which a base classifier is constructed in the new feature space defined by the projection matrix. For the first step, the projection matrix is mainly used to enhance the separability between the diverse class examples and thus to improve the performance of the base classifier, and the undersampling technique is used to create diverse sets of pairwise constraints to train diverse projection matrices, thus introducing diversity to base classifiers. For the second step, the undersampling technique aims to improve the performance of base classifiers on the minority class and further increase the diversity between the individual base classifiers. The experimental results show that the proposed method shows significantly better performance on the measures of recall, g-mean, f-measure and AUC than other state-of-the-art methods for 29 datasets with various data distributions and imbalance ratios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Alcalá-Fdez J, Fernandez A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) KEEL data-mining software tool: data set repository integration of algorithms and experimental analysis framework. J Multiple Valued Logic Soft Comput 17(2–3):255–287

    Google Scholar 

  • Ameta D (2017) Ensemble classifier approach in breast cancer detection and malignancy grading-a review, CoRR abs/1704.03801

  • Bao L, Juan C, Li J, Zhang Y (2016b) Boosted near-miss under-sampling on SVM ensembles for concept detection in large-scale imbalanced datasets. Neurocomputing 172:198–206

    Article  Google Scholar 

  • Barandela R, Valdovinos RM, Sánchez JS (2003) New applications of ensembles of classifiers. Pattern Anal Appl 6(3):245–256

    Article  MathSciNet  Google Scholar 

  • Barua S, Islam MM, Yao X, Murase K (2014) MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425

    Article  Google Scholar 

  • Branco P, Torgo L, Ribeiro RP (2015) A survey of predictive modelling under imbalanced distributions, CoRR abs/1505.01658 D

  • Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140

    MATH  Google Scholar 

  • Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MATH  Google Scholar 

  • Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Proceedings of the 13th Pacific-Asia conference on knowledge discovery and data mining(PAKDD), lecture notes in computer science, vol 5476. Springer, Bangkok, Thailand, pp 475–482

  • Castellanos FJ, Valero-Mas JJ, Calvo-Zaragoza J, Rico-Juan JR (2018) Oversampling imbalanced data in the string space. Pattern Recognit Lett 103:32–38

    Article  Google Scholar 

  • Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  MATH  Google Scholar 

  • Chawla N V, Lazarevic A, Hall LO, Bowyer K W (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: Proceedings of the 7th European conference on principles and practice of knowledge discovery in databases, lecture notes in computer science, vol 2838. Springer, Cavtat-Dubrovnik, Croatia, pp 107–119

  • Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  • Devi D, Biswas SK, Purkayastha B (2017) Redundancy-driven modified tomek-link based undersampling: a solution to class imbalance. Pattern Recognit Lett 93:3–12

    Article  Google Scholar 

  • Douzas G, Bação F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci 465:1–20

    Article  Google Scholar 

  • Estabrooks A, Japkowicz N (2001) A mixture-of-experts framework for learning from imbalanced data sets. In: Proceeding of the 4th international conference on advances in intelligent data analysis, lecture notes in computer science, vol 2189. Springer, Cascais, Portugal, pp 34–43

  • Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F (2018) Learning from imbalanced data sets. Springer, Berlin, pp 1–377

    Book  Google Scholar 

  • Frayman Y, Ming Ting K, Wang L (1999) A fuzzy neural network for data mining: dealing with the problem of small disjuncts. In: Proceeding of international joint conference neural networks, pp 2490–2493. Washington, DC, USA

  • Freund Y, Schapire R E (1996) Experiments with a new boosting algorithm. In: Proceedings of the 13th international conference on machine learning, pp 148–15. Morgan Kaufmann, Bari, Italy

  • Fu X, Wang L, Chua K S (2002) Training RBF neural networks on unbalanced data. In: Proceedings of the 9th international conference on neural information processing (ICONIP 2002), Singapore, pp 1016–1020

  • Galar M, Fernández A, Tartas EB, Sola HB, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C 42(4):463–484

    Article  Google Scholar 

  • García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180:2044–2064

    Article  Google Scholar 

  • García S, Zhang Z, Altalhi AH, Alshomrani S, Herrera F (2018) Dynamic ensemble selection for multi-class imbalanced datasets. Inf Sci 445–446:22–37

    Article  MathSciNet  Google Scholar 

  • García-Pddrajas N, García-Osorio C, Fyfe C (2007) Nonlinear boosting projections for ensemble construction. J Mach Learn Res 8:1–33

    MathSciNet  MATH  Google Scholar 

  • Guo H, Li Y, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239

    Article  Google Scholar 

  • Han H, Wang W, Mao B (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Proceedings of the 1st international conference on intelligent computing (part I), lecture notes in computer science, vol 3644. Springer, Hefei, China, pp 878–887

  • He H, Ma Y (2013) Editors, imbalanced learning: foundations, algorithms, and applications. Wiley, New York

    Book  MATH  Google Scholar 

  • Hido S, Kashima H, Takahashi Y (2009) Roughly balanced bagging for imbalanced data. Stat Anal Data Min 2(5–6):412–426

    Article  MathSciNet  Google Scholar 

  • Huang S, Wang H, Li T, Yang Y, Li T (2016) Constraint co-projections for semi-supervised co-clustering. IEEE Trans Cybern 46(12):3047–3058

    Article  Google Scholar 

  • Kang P, Cho S (2006) EUS SVMs: ensemble of under-sampled SVMs for data imbalance problems. In: Proceedings of the 13th international conference on neural information processing, part I, lecture notes in computer science, vol 4232. Springer, Hong Kong, China, pp 837–846

  • Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2–3):195–215

    Article  Google Scholar 

  • Li J (2008) A two-step rejection procedure for testing multiple hypotheses. J Stat Plan Inference 138:1521–1527

    Article  MathSciNet  MATH  Google Scholar 

  • Liu X Y, Wu J, Zhou Z H (2006) Exploratory under-sampling for class-imbalance learning. In: Proceedings of the 6th IEEE international conference on data mining (ICDM), Hong Kong, China, pp 965–969

  • Liu XY, Wu J, Zhou ZH (2009) Exploratory under-sampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B 39(2):965–969

    Google Scholar 

  • Lu W, Li Z, Chu J (2017) Adaptive ensemble undersampling-boost: a novel learning framework for imbalanced data. J Syst Softw 132:272–282

    Article  Google Scholar 

  • Martin PD (2011) Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J Mach Learn Technol 2(1):37–63

    Google Scholar 

  • Nguyen HM, Cooper EW, Kamei K (2011) Borderline over-sampling for imbalanced data classification. Int J Knowl Eng Soft Data Paradig 3(1):4–21

    Article  Google Scholar 

  • Provost F J, Weiss G M (2011) Learning when training data are costly: the effect of class distribution on tree induction, CoRR abs/1106.4557

  • Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Mateo

    Google Scholar 

  • Ren F, Cao P, Li W, Zhao D, Zaïane OR (2017) Ensemble based adaptive over-sampling method for imbalanced data learning in computer aided detection of microaneurysm. Comput Med Imaging Graph 55:54–67

    Article  Google Scholar 

  • Rodríguez JJ, Kuncheva LI, Alonso CJ (2006) Rotation forest: a new classifier ensemble method. IEEE Trans Pattern Anal Mach Intell 28(10):1619–1630

    Article  Google Scholar 

  • Rodríguez-Fdez I, Canosa A, Mucientes M, Bugarín A (2015) STAC: a web platform for the comparison of algorithms using statistical tests. In: Proceeding of the IEEE international conference on fuzzy systems, pp 1–8. FUZZ-IEEE, Istanbul, Turkey

  • Rokach L (2010) Ensemble-based classifiers. Artif Intell Rev 33(1–2):1–39

    Article  Google Scholar 

  • Schapire RE (1999) A brief introduction to boosting. In: Proceedings of the 16th international joint conference on artificial intelligence, pp 1401–1406. Morgan Kaufmann, Stockholm, Sweden

  • Seiffert C, Khoshgoftaar T, Hulse JV, Napolitano A (2010) Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A 40(1):185–197

    Article  Google Scholar 

  • Sun B, Chen H, Wang J, Xie H (2018a) Evolutionary under-sampling based bagging ensemble method for imbalanced data classification. Front Comput Sci 12(2):331–350

    Article  Google Scholar 

  • Sun J, Lang J, Fujita H, Li H (2018b) Imbalanced enterprise credit evaluation with DTE-SBD: decision tree ensemble based on SMOTE and bagging with differentiated sampling rates. Inf Sci 425:76–91

    Article  MathSciNet  Google Scholar 

  • Taft LM, Evans RS, Shyu CR, Egger MJ, Chawla N, Mitchell JA, Thornton SN, Bray B, Varner Michael W (2009) Countering imbalanced datasets to improve adverse drug event predictive models in labor and delivery. J Biomed Inform 42(2):356–364

    Article  Google Scholar 

  • Tao D, Tang X, Li X, Wu X (2006) Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Trans Pattern Anal Mach Intell 28(7):1088–1099

    Article  Google Scholar 

  • Wang L, Fu X (2005) Data mining with computational intelligence. Advanced information and knowledge processing. Springer, Berlin

    MATH  Google Scholar 

  • Wang S, Yao X (2009) Diversity analysis on imbalanced data sets by using ensemble models. In: Proceedings of the IEEE symposium on computational intelligence and data mining, part of the IEEE symposium series on computational intelligence, pp 324–331. IEEE, Nashville, TN, USA

  • Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng AFM, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37

    Article  Google Scholar 

  • Zhai J, Zhai M, Kang X (2014) Condensed fuzzy nearest neighbor methods based on fuzzy rough set technique. Intell Data Anal 18(3):429–447

    Article  Google Scholar 

  • Zhang D, Chen S, Zhou Z, Yang Q (2008) Constraint projections for ensemble learning. In: Proceedings of the 23rd AAAI conference on artificial intelligence, pp 758–763. AAAI Press, Chicago, Illinois, USA

  • Zhou J, Guo H, Wu C-A (2018) Ensemble based on constraint projection and under-sampling for imbalanced learning. In: Proceeding of the 14th international conference on natural computation, fuzzy systems and knowledge discovery, (ICNC-FSKD 2018), pp 724–731, IEEE, Huangshan, China

Download references

Acknowledgements

This work is supported in part by the National Natural Science Foundation of China (No. 61802329), in part by Project of Science and Technology Department of Henan Province (No. 182102210132), in part by the Innovation Team Support Plan of the University of Science and Technology of Henan Province (No. 19IRTSTHN014), and in part by Nanhu Scholars Program for Young Scholars of XYNU.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huaping Guo.

Ethics declarations

Conflict of interest

Huaping Guo declares that he has no conflict of interest. Jun Zhou declares that he has no conflict of interest. Chang-an Wu declares that he has no conflict of interest.

Additional information

Communicated by L. Wang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, H., Zhou, J. & Wu, Ca. Ensemble learning via constraint projection and undersampling technique for class-imbalance problem. Soft Comput 24, 4711–4727 (2020). https://doi.org/10.1007/s00500-019-04501-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-019-04501-6

Keywords

Navigation