Skip to main content
Log in

An Ensemble Tree Classifier for Highly Imbalanced Data Classification

  • Published:
Journal of Systems Science and Complexity Aims and scope Submit manuscript

Abstract

The performance of traditional imbalanced classification algorithms is degraded when dealing with highly imbalanced data. How to deal with highly imbalanced data is a difficult problem. In this paper, the authors propose an ensemble tree classifier for highly imbalanced data classification. The ensemble tree classifier is constructed with a complete binary tree structure. A mathematical model is established based on the features and classification performance of the classifier, and it is proven that the model parameters of the ensemble classifier can be solved by calculation. First, the AdaBoost method is used as the benchmark classifier to construct the tree structure model. Then, the classification cost of the model is calculated, and the quantitative mathematical description between the cost and features of the ensemble tree classifier model is obtained. Then, the cost of the classification model is transformed into an optimization problem, and the parameters of the integrated tree classifier are given through theoretical derivation. This approach is tested on several highly imbalanced datasets in different fields and takes the AUC (area under the curve) and F-measure as evaluation criteria. Compared with the traditional imbalanced classification algorithm, the ensemble tree classifier has better classification performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Wang X M, Hu M, Zhao Y L, et al., Credit scoring based on the set-valued identification method, Journal of Systems Science and Complexity, 2020, 33(5): 1297–1309.

    Article  Google Scholar 

  2. Sun A X, Lim E P, and Liu Y, On strategies for imbalanced text classification using SVM: A comparative study, Decision Support Systems, 2009, 48(1): 191–201.

    Article  Google Scholar 

  3. Xie L, Jia Y L, Xiao J, et al., GMDH-based outlier detection model in classification problems, Journal of Systems Science and Complexity, 2020, 33(5): 1516–1532.

    Article  Google Scholar 

  4. Burez J and Poel D V D, Handling class imbalance in customer churn prediction, Expert Systems with Applications, 2008, 36(3): 4626–4636.

    Article  Google Scholar 

  5. Brekke C and Solberg A H S, Oil spill detection by satellite remote sensing, Remote Sensing of Environment, 2005, 95(1): 1–13.

    Article  Google Scholar 

  6. Plant C, Bhm C, Tilg B, et al., Enhancing instance-based classification with local density: A new algorithm for classifying unbalanced biomedical data, Bioinformatics, 2006, 22(8): 981–988.

    Article  Google Scholar 

  7. Chen J D and Tang X J, The distributed representation for societal risk classification toward BBS posts, Journal of Systems Science and Complexity, 2017, 30(3): 113–130.

    Google Scholar 

  8. Song Q B, Guo Y C, and Shepperd M, A comprehensive investigation of the role of imbalanced learning for software defect prediction, IEEE Transactions on Software Engineering, 2019, 45(12): 1253–1269.

    Article  Google Scholar 

  9. Chawla N V, Bowyer K W, Hall L O, et al., SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 2002, 16(1): 321–357.

    Article  Google Scholar 

  10. Hui H, Wang W Y, and Mao B H, Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning, Proceedings of the 2005 International Conference on Advances in Intelligent Computing, 2005, 878–887.

  11. Loyola-Gonzlez O, Garca-Borroto M, Medina-Prez M A, et al., An empirical study of oversampling and undersampling methods for LCMine an emerging pattern based classifier, Mexican Conference on Pattern Recognition, 2013, 264–273.

  12. Batista G E, Prati R C, and Monard M C, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20–29.

    Article  Google Scholar 

  13. Castro C L and Braga A P, Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data, IEEE Transactions on Neural Networks and Learning Systems, 2013, 24(6): 888–899.

    Article  Google Scholar 

  14. Thai-Nghe N, Gantner Z, and Schmidt-Thieme L, Cost-sensitive learning methods for imbalanced data, International Joint Conference on Neural Networks, 2010, 1–8.

  15. Raskutti B and Kowalczyk A, Extreme re-balancing for SVMs: A case study, ACM Sigkdd Explorations Newsletter, 2004, 6(1): 60–69.

    Article  Google Scholar 

  16. Juszczak P and Duin R P W, Uncertainty sampling methods for one-class classifiers, Proceedings of ICML-03, Workshop on Learning with Imbalanced Data Sets II, 2003, 81–88.

  17. Chen Z, Duan J, Kang L, et al., A hybrid data-level ensemble to enable learning from highly imbalanced dataset, Information Sciences, 2021, 554: 157–176.

    Article  MathSciNet  Google Scholar 

  18. Yang P Y, Yoo P D, Fernando J, et al., Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics Applications, IEEE Transactions on Cybernetics, 2014, 44(3): 445–455.

    Article  Google Scholar 

  19. Ando S and Huang C Y, Deep over-sampling framework for classifying imbalanced data, ECML PKDD, 2017, 770–785.

  20. Zhang C, Tan K C, and Ren R, Training cost-sensitive deep belief networks on imbalance data problems, International Joint Conference on Neural Networks, 2016, 4362–4367.

  21. Hu J L, Lu J W, Tan Y P, et al., Deep transfer metric learning, IEEE Conference on Computer Vision and Pattern Recognition, 2015, 325–333.

  22. Dong Q, Gong S G, and Zhu X T, Class rectification hard mining for imbalanced deep learning, IEEE International Conference on Computer Vision, 2017, 1869–1878.

  23. Sahbi H and Geman D, A hierarchy of support vector machines for pattern detection, The Journal of Machine Learning Research, 2006, 7: 2087–2123.

    MathSciNet  MATH  Google Scholar 

  24. Viola P and Jones M, Robust real-time object detection, International Journal of Computer Vision, 2003, 57(2): 137–154.

    Article  Google Scholar 

  25. Zheng Z Y, Cai Y P, and Li Y, Oversampling method for imbalanced classification, Computing and Informatics, 2016, 34(5): 1017–1037.

    Google Scholar 

  26. Triguero I, Galar M, Vluymans S, et al., Evolutionary undersampling for imbalanced big data classification, IEEE Congress on Evolutionary Computation (CEC), 2015, 715–722.

  27. Domingos P, MetaCost: A general method for making classifiers cost-sensitive, Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, 155–164.

  28. Chen C, Liaw A, and Breiman L, Using random forest to learn imbalanced data, No. 666, Statistics Department, University of California at Berkeley, 2004.

  29. Chew H G, Bogner R E, and Lim C C, Dual v-support vector machine with error rate and training size biasing, IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001, 2: 1269–1272.

    Google Scholar 

  30. Huang K H and Lin H T, Cost-sensitive label embedding for multi-label classification, Machine Learning, 2017, 106(9–10): 1725–1746.

    Article  MathSciNet  Google Scholar 

  31. Lu H J, Yang L, Yan K, et al., A cost-sensitive rotation forest algorithm for gene expression data classification, Neurocomputing, 2016, 228: 270–276.

    Article  Google Scholar 

  32. Ayyagari M R, Classification of imbalanced datasets using one-class SVM, k-nearest neighbors and CART algorithm, International Journal of Advanced Computer Science and Applications, 2020, 11(11): 1–5.

    Article  Google Scholar 

  33. Zhou Z H and Liu X Y, Training cost-sensitive neural networks with methods addressing the class imbalance problem, IEEE Transactions on Knowledge and Data Engineering, 2006, 18(1): 63–77.

    Article  Google Scholar 

  34. Liu X Y, Wu J, and Zhou Z H, Exploratory undersampling for class-imbalance learning, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2009, 39(2): 539–550.

    Article  Google Scholar 

  35. Galar M, Fernandez A, Barrenechea E, et al., A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(4): 463–484.

    Article  Google Scholar 

  36. Wang S, Minku L L, and Yao X, Resampling-based ensemble methods for online class imbalance learning, IEEE Transactions on Knowledge and Data Engineering, 2015, 27(5): 1356–1368.

    Article  Google Scholar 

  37. Dubey R, Zhou J Y, Wang Y L, et al., Analysis of sampling techniques for imbalanced data: An n = 648 ADNI study, Neuroimage, 2014, 87: 220–241.

    Article  Google Scholar 

  38. Jeatrakul P, Wong K W, and Fung C C, Classification of imbalanced data by combining the complementary neural network and SMOTE algorithm, 17th International Conference on Neural Information Processing, 2010, 152–159.

  39. Yan Y L, Chen M, Shyu M L, et al., Deep learning for imbalanced multimedia data classification, IEEE International Symposium on Multimedia, 2016, 483–488.

  40. Huang C, Li Y N, Loy C C, et al., Learning deep representation for imbalanced classification, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, 5375–5384.

  41. Khan S H, Hayat M, Bennamoun M, et al., Cost sensitive learning of deep feature representations from imbalanced data, IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(8): 3573–3587.

    Article  Google Scholar 

  42. Dong Q, Gong S G, and Zhu X T, Imbalanced deep learning by minority class incremental Rectification, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(6): 1367–1381.

    Article  Google Scholar 

  43. Cao X B, Qiao H, and Keane J, A low-cost pedestrian-detection system with a single optical camera, IEEE Transactions on Intelligent Transportation Systems, 2008, 9(1): 58–67.

    Article  Google Scholar 

  44. Liu X Y, Li Q Q, and Zhou Z H, Learning imbalanced multi-class data with optimal dichotomy weights, International Conference on Data Mining, 2013, 478–487.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhong Wang.

Additional information

This work was supported by the National Natural Science Foundation of China under Grant No. 61976198, the Natural Science Research Key Project for Colleges and Universities of Anhui Province under Grant No. KJ2019A0726, and the High-level Scientific Research Foundation for the Introduction of Talent of Hefei Normal University under Grant No. 2020RCJJ44.

This paper was recommended for publication by Editor ZHANG Xinyu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shi, P., Wang, Z. An Ensemble Tree Classifier for Highly Imbalanced Data Classification. J Syst Sci Complex 34, 2250–2266 (2021). https://doi.org/10.1007/s11424-021-1038-8

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11424-021-1038-8

Keywords

Navigation