Abstract
Datasets with skewed class distribution bring difficulties to learning algorithms of pattern classification. The undersampling methods mostly consider the imbalanced ratio and rarely consider the distribution of the original dataset. Also, many algorithms separate the resampling of imbalanced data from classifier training, which may lead to the loss of important information and degradation of classifier performance. To address the mentioned problems, this paper proposes a boosting random forest based on fuzzy entropy and fuzzy support (FESBoost). The proposed algorithm mainly includes two parts, static undersampling and training of ensemble classifiers. First, the attenuation function and shared k-nearest neighbor algorithm are used to construct the global class entropy based on which the area where the majority samples are located is divided into a safe area and a boundary area. Second, density peak clustering (DPCA) is used to select representative samples of the safe area, and this process represents static resampling. Finally, the classifier is trained based on the boosting framework. Since the dataset is not balanced after static undersampling, before each iteration of the algorithm, data are undersampled again based on the global class entropy and the average class support. The number of undersampled samples depends on the number of iterations and the imbalance ratio. In the FESBoost algorithm, methods of static and dynamic resampling are used. Static resampling reduces the imbalance ratio of data and overlap between classes, as well as the classifier training cost. Based on data distribution and the possibility of sample misclassification, dynamic resampling updates the majority samples. The superiority of the proposed algorithm is verified experimentally on 9 synthetic datasets and 34 KEEL datasets. The proposed algorithm is also compared with seven algorithms, and the results show that the proposed algorithm has better generalization performance than other compared algorithms.
Similar content being viewed by others
References
Zhou L (2013) Performance of corporate bankruptcy prediction models on imbalanced dataset: The Effect of sampling methods. Knowl-Based Syst 41:16–25
Lee YH, Hu PJH, Cheng TH et al (2013) A preclustering-based ensemble learning technique for acute appendicitis diagnoses. Artif Intell Med 58(2):115–124
Seiffert C, Khoshgoftaar TM, Van Hulse J et al (2014) An empirical study of the classification performance of learners on imbalanced and noisy software quality data. Inf Sci 259:571–595
Zhu Z, Wang Z, Li D et al (2020) Geometric structural ensemble learning for imbalanced problems. IEEE Trans Syst Man Cybern 50(4):1617–1629
Zhu Y, Wang Z, Zha H et al (2018) Boundary-Eliminated Pseudoinverse linear discriminant for imbalanced problems. IEEE Trans Neural Netw 29(6):2581–2594
Wang Z, Cao C (2019) Cascade interpolation learning with double subspaces and confidence disturbance for imbalanced problems. Neural Netw:17–31
Chawla NV, Bowyer KW, Hall LO et al (2002) SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357
Fernández A, Garcia S, Herrera F et al (2018) SMOTE For learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905
Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International conference on intelligent computing. Springer, pp 878–887
Maciejewski T, Stefanowski J (2011) Local neighbourhood extension of SMOTE for mining imbalanced data. computational intelligence and data mining, pp 104–111
Hussein AS, Li T, Yohannese CW et al (2019) A-SMOTE: a new preprocessing approach for highly imbalanced datasets by improving SMOTE. Int J Comput Intell Syst 12(2):1412–1422
Lin M, Tang K, Yao X et al (2013) Dynamic sampling approach to training neural networks for multiclass imbalance classification. IEEE Trans Neural Netw 24(4):647–660
Lin W, Tsai C, Hu Y et al (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci:17–26
Tsai C, Lin W, Hu Y et al (2019) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf Sci:47–54
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21 (9):1263–1284
Baderelden M, Teitei E, Perry T et al (2019) Biased random forest for dealing with the class imbalance problem. IEEE Trans Neural Netwx 30(7):2163–2172
Li F, Zhang X, Zhang X et al (2018) Cost-sensitive and hybrid-attribute measure multi-decision tree over imbalanced data sets. Inf Sci:242–256
Ramentol E, Caballero Y, Bello R et al (2012) SMOTE-RSB*: A hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33(2):245–265
Lin W-C, Tsai C-F, Hu Y-H, Jhang J-S (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci (409–410):17-26
Liu G, Yang Y, Li B et al (2018) Fuzzy rule-based oversampling technique for imbalanced and incomplete data learning. Knowl Based Syst:154–174
Lin W, Tsai C, Hu Y et al (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci:17–26
Loyolagonzalez O, Medinaperez MA, Martineztrinidad JF et al (2017) PBC4Cip: A new contrast pattern-based classifier for class imbalance problems. Knowl Based Sys:100–109
Yu H, Sun C, Yang X et al (2019) Fuzzy support vector machine with relative density information for classifying imbalanced data. IEEE Trans Fuzzy Syst 27(12):2353–2367
Feng W, Huang W, Ren J (2018) Class imbalance ensemble learning based on the margin theory. Appl Sci 8(5):815
Seiffert C, Khoshgoftaar TM, Van Hulse J et al (2010) RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. Syst Man Cybern 40(1):185–197
Chawla NV, Lazarevic A, Hall LO et al (2003) SMOTEBoost: Improving Prediction of the Minority Class in Boosting. european conference on principles of data mining and knowledge discovery, pp 107–119
Lu Y, Cheung Y, Tang YY et al (2016) Hybrid Sampling with Bagging for Class Imbalance Learning. pacific-asia conference on knowledge discovery and data mining, pp 14–26
Chen T, Guestrin C (2016) XGBoost: A Scalable Tree Boosting System. knowledge discovery and data mining, pp 785–794
Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38(4):367–378
Yuan Y, Shaw MJ (1995) Induction of fuzzy decision trees. Fuzzy Sets Syst 69(2):125–139
Zhai J, Wang X, Zhang S et al (2018) Tolerance rough fuzzy decision tree. Inf Sci:425–438
Sardari S, Eftekhari M, Afsari F et al (2017) Hesitant fuzzy decision tree approach for highly imbalanced data classification. Appl Soft Comput:727–741
Shannon CE (1948) A mathematical theory of communication. Bell Syst Techn J 27(3):379–423
Wang Z, Cao C, Zhu Y et al (2020) Entropy and Confidence-Based undersampling boosting random forests for imbalanced problems. IEEE Trans Neural Netw:1–14
Ertoz L, Steinbach M, Kumar V (2002) A new shared nearest neighbor clustering algorithm and its applications. Workshop on clustering high dimensional data and its applications at 2nd SIAM international conference on data mining, pp 105–115
Batuwita R, Palade V (2010) FSVM-CIL Fuzzy support vector machines for class imbalance learning. IEEE Trans Fuzzy Syst 18(3):558–571
Ertekin S, Huang J, Giles CL et al (2007) Active learning for class imbalance problem. international acm sigir conference on research and development in information retrieval, pp 823–824
Alcala-Fdez J, Fernandez A, Luengo J et al (2011) KEEL Data-Mining Software tool: Data set repository, Integration of Algorithms and Experimental Analysis Framework. Soft comput:255–287
Alcala-Fdez J, Fernandez A, Luengo J et al (2011) KEEL Data-Mining Software tool: Data set repository, Integration of Algorithms and Experimental Analysis Framework. Soft Comput:255–287
Widrow B, Greenblatt A, Kim Y et al (2013) The No-Prop algorithm: A new learning algorithm for multilayer neural networks. Neural Netw 37:182–188
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Acknowledgements
This work was supported by National Natural Science Foundation of China grant 61573266 and the University Natural Science Research Key Projects of Anhui Province(KJ2019A0816).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Jiang, M., Yang, Y. & Qiu, H. Fuzzy entropy and fuzzy support-based boosting random forests for imbalanced data. Appl Intell 52, 4126–4143 (2022). https://doi.org/10.1007/s10489-021-02620-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02620-y