Skip to main content
Log in

Fuzzy entropy and fuzzy support-based boosting random forests for imbalanced data

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Datasets with skewed class distribution bring difficulties to learning algorithms of pattern classification. The undersampling methods mostly consider the imbalanced ratio and rarely consider the distribution of the original dataset. Also, many algorithms separate the resampling of imbalanced data from classifier training, which may lead to the loss of important information and degradation of classifier performance. To address the mentioned problems, this paper proposes a boosting random forest based on fuzzy entropy and fuzzy support (FESBoost). The proposed algorithm mainly includes two parts, static undersampling and training of ensemble classifiers. First, the attenuation function and shared k-nearest neighbor algorithm are used to construct the global class entropy based on which the area where the majority samples are located is divided into a safe area and a boundary area. Second, density peak clustering (DPCA) is used to select representative samples of the safe area, and this process represents static resampling. Finally, the classifier is trained based on the boosting framework. Since the dataset is not balanced after static undersampling, before each iteration of the algorithm, data are undersampled again based on the global class entropy and the average class support. The number of undersampled samples depends on the number of iterations and the imbalance ratio. In the FESBoost algorithm, methods of static and dynamic resampling are used. Static resampling reduces the imbalance ratio of data and overlap between classes, as well as the classifier training cost. Based on data distribution and the possibility of sample misclassification, dynamic resampling updates the majority samples. The superiority of the proposed algorithm is verified experimentally on 9 synthetic datasets and 34 KEEL datasets. The proposed algorithm is also compared with seven algorithms, and the results show that the proposed algorithm has better generalization performance than other compared algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Zhou L (2013) Performance of corporate bankruptcy prediction models on imbalanced dataset: The Effect of sampling methods. Knowl-Based Syst 41:16–25

    Article  Google Scholar 

  2. Lee YH, Hu PJH, Cheng TH et al (2013) A preclustering-based ensemble learning technique for acute appendicitis diagnoses. Artif Intell Med 58(2):115–124

    Article  Google Scholar 

  3. Seiffert C, Khoshgoftaar TM, Van Hulse J et al (2014) An empirical study of the classification performance of learners on imbalanced and noisy software quality data. Inf Sci 259:571–595

  4. Zhu Z, Wang Z, Li D et al (2020) Geometric structural ensemble learning for imbalanced problems. IEEE Trans Syst Man Cybern 50(4):1617–1629

    Google Scholar 

  5. Zhu Y, Wang Z, Zha H et al (2018) Boundary-Eliminated Pseudoinverse linear discriminant for imbalanced problems. IEEE Trans Neural Netw 29(6):2581–2594

    Article  MathSciNet  Google Scholar 

  6. Wang Z, Cao C (2019) Cascade interpolation learning with double subspaces and confidence disturbance for imbalanced problems. Neural Netw:17–31

  7. Chawla NV, Bowyer KW, Hall LO et al (2002) SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357

    Article  Google Scholar 

  8. Fernández A, Garcia S, Herrera F et al (2018) SMOTE For learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905

    Article  MathSciNet  Google Scholar 

  9. Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International conference on intelligent computing. Springer, pp 878–887

  10. Maciejewski T, Stefanowski J (2011) Local neighbourhood extension of SMOTE for mining imbalanced data. computational intelligence and data mining, pp 104–111

  11. Hussein AS, Li T, Yohannese CW et al (2019) A-SMOTE: a new preprocessing approach for highly imbalanced datasets by improving SMOTE. Int J Comput Intell Syst 12(2):1412–1422

    Article  Google Scholar 

  12. Lin M, Tang K, Yao X et al (2013) Dynamic sampling approach to training neural networks for multiclass imbalance classification. IEEE Trans Neural Netw 24(4):647–660

    Article  Google Scholar 

  13. Lin W, Tsai C, Hu Y et al (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci:17–26

  14. Tsai C, Lin W, Hu Y et al (2019) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf Sci:47–54

  15. He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21 (9):1263–1284

    Article  Google Scholar 

  16. Baderelden M, Teitei E, Perry T et al (2019) Biased random forest for dealing with the class imbalance problem. IEEE Trans Neural Netwx 30(7):2163–2172

    Article  Google Scholar 

  17. Li F, Zhang X, Zhang X et al (2018) Cost-sensitive and hybrid-attribute measure multi-decision tree over imbalanced data sets. Inf Sci:242–256

  18. Ramentol E, Caballero Y, Bello R et al (2012) SMOTE-RSB*: A hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33(2):245–265

    Article  Google Scholar 

  19. Lin W-C, Tsai C-F, Hu Y-H, Jhang J-S (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci (409–410):17-26

  20. Liu G, Yang Y, Li B et al (2018) Fuzzy rule-based oversampling technique for imbalanced and incomplete data learning. Knowl Based Syst:154–174

  21. Lin W, Tsai C, Hu Y et al (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci:17–26

  22. Loyolagonzalez O, Medinaperez MA, Martineztrinidad JF et al (2017) PBC4Cip: A new contrast pattern-based classifier for class imbalance problems. Knowl Based Sys:100–109

  23. Yu H, Sun C, Yang X et al (2019) Fuzzy support vector machine with relative density information for classifying imbalanced data. IEEE Trans Fuzzy Syst 27(12):2353–2367

    Article  Google Scholar 

  24. Feng W, Huang W, Ren J (2018) Class imbalance ensemble learning based on the margin theory. Appl Sci 8(5):815

    Article  Google Scholar 

  25. Seiffert C, Khoshgoftaar TM, Van Hulse J et al (2010) RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. Syst Man Cybern 40(1):185–197

    Google Scholar 

  26. Chawla NV, Lazarevic A, Hall LO et al (2003) SMOTEBoost: Improving Prediction of the Minority Class in Boosting. european conference on principles of data mining and knowledge discovery, pp 107–119

  27. Lu Y, Cheung Y, Tang YY et al (2016) Hybrid Sampling with Bagging for Class Imbalance Learning. pacific-asia conference on knowledge discovery and data mining, pp 14–26

  28. Chen T, Guestrin C (2016) XGBoost: A Scalable Tree Boosting System. knowledge discovery and data mining, pp 785–794

  29. Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38(4):367–378

    Article  MathSciNet  Google Scholar 

  30. Yuan Y, Shaw MJ (1995) Induction of fuzzy decision trees. Fuzzy Sets Syst 69(2):125–139

    Article  MathSciNet  Google Scholar 

  31. Zhai J, Wang X, Zhang S et al (2018) Tolerance rough fuzzy decision tree. Inf Sci:425–438

  32. Sardari S, Eftekhari M, Afsari F et al (2017) Hesitant fuzzy decision tree approach for highly imbalanced data classification. Appl Soft Comput:727–741

  33. Shannon CE (1948) A mathematical theory of communication. Bell Syst Techn J 27(3):379–423

    Article  MathSciNet  Google Scholar 

  34. Wang Z, Cao C, Zhu Y et al (2020) Entropy and Confidence-Based undersampling boosting random forests for imbalanced problems. IEEE Trans Neural Netw:1–14

  35. Ertoz L, Steinbach M, Kumar V (2002) A new shared nearest neighbor clustering algorithm and its applications. Workshop on clustering high dimensional data and its applications at 2nd SIAM international conference on data mining, pp 105–115

  36. Batuwita R, Palade V (2010) FSVM-CIL Fuzzy support vector machines for class imbalance learning. IEEE Trans Fuzzy Syst 18(3):558–571

    Article  Google Scholar 

  37. Ertekin S, Huang J, Giles CL et al (2007) Active learning for class imbalance problem. international acm sigir conference on research and development in information retrieval, pp 823–824

  38. Alcala-Fdez J, Fernandez A, Luengo J et al (2011) KEEL Data-Mining Software tool: Data set repository, Integration of Algorithms and Experimental Analysis Framework. Soft comput:255–287

  39. Alcala-Fdez J, Fernandez A, Luengo J et al (2011) KEEL Data-Mining Software tool: Data set repository, Integration of Algorithms and Experimental Analysis Framework. Soft Comput:255–287

  40. Widrow B, Greenblatt A, Kim Y et al (2013) The No-Prop algorithm: A new learning algorithm for multilayer neural networks. Neural Netw 37:182–188

    Article  Google Scholar 

  41. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China grant 61573266 and the University Natural Science Research Key Projects of Anhui Province(KJ2019A0816).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mingxue Jiang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, M., Yang, Y. & Qiu, H. Fuzzy entropy and fuzzy support-based boosting random forests for imbalanced data. Appl Intell 52, 4126–4143 (2022). https://doi.org/10.1007/s10489-021-02620-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02620-y

Keywords

Navigation