Fuzzy entropy and fuzzy support-based boosting random forests for imbalanced data

Jiang, Mingxue; Yang, Youlong; Qiu, Haiquan

doi:10.1007/s10489-021-02620-y

Fuzzy entropy and fuzzy support-based boosting random forests for imbalanced data

Published: 17 July 2021

Volume 52, pages 4126–4143, (2022)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

461 Accesses
4 Citations
Explore all metrics

Abstract

Datasets with skewed class distribution bring difficulties to learning algorithms of pattern classification. The undersampling methods mostly consider the imbalanced ratio and rarely consider the distribution of the original dataset. Also, many algorithms separate the resampling of imbalanced data from classifier training, which may lead to the loss of important information and degradation of classifier performance. To address the mentioned problems, this paper proposes a boosting random forest based on fuzzy entropy and fuzzy support (FESBoost). The proposed algorithm mainly includes two parts, static undersampling and training of ensemble classifiers. First, the attenuation function and shared k-nearest neighbor algorithm are used to construct the global class entropy based on which the area where the majority samples are located is divided into a safe area and a boundary area. Second, density peak clustering (DPCA) is used to select representative samples of the safe area, and this process represents static resampling. Finally, the classifier is trained based on the boosting framework. Since the dataset is not balanced after static undersampling, before each iteration of the algorithm, data are undersampled again based on the global class entropy and the average class support. The number of undersampled samples depends on the number of iterations and the imbalance ratio. In the FESBoost algorithm, methods of static and dynamic resampling are used. Static resampling reduces the imbalance ratio of data and overlap between classes, as well as the classifier training cost. Based on data distribution and the possibility of sample misclassification, dynamic resampling updates the majority samples. The superiority of the proposed algorithm is verified experimentally on 9 synthetic datasets and 34 KEEL datasets. The proposed algorithm is also compared with seven algorithms, and the results show that the proposed algorithm has better generalization performance than other compared algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Improved Ensemble Classification Algorithm for Imbalanced Data with Sample Overlap

Combining Sampling and Ensemble Classifier for Multiclass Imbalance Data Learning

CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests

Article Open access 14 March 2017

References

Zhou L (2013) Performance of corporate bankruptcy prediction models on imbalanced dataset: The Effect of sampling methods. Knowl-Based Syst 41:16–25
Article Google Scholar
Lee YH, Hu PJH, Cheng TH et al (2013) A preclustering-based ensemble learning technique for acute appendicitis diagnoses. Artif Intell Med 58(2):115–124
Article Google Scholar
Seiffert C, Khoshgoftaar TM, Van Hulse J et al (2014) An empirical study of the classification performance of learners on imbalanced and noisy software quality data. Inf Sci 259:571–595
Zhu Z, Wang Z, Li D et al (2020) Geometric structural ensemble learning for imbalanced problems. IEEE Trans Syst Man Cybern 50(4):1617–1629
Google Scholar
Zhu Y, Wang Z, Zha H et al (2018) Boundary-Eliminated Pseudoinverse linear discriminant for imbalanced problems. IEEE Trans Neural Netw 29(6):2581–2594
Article MathSciNet Google Scholar
Wang Z, Cao C (2019) Cascade interpolation learning with double subspaces and confidence disturbance for imbalanced problems. Neural Netw:17–31
Chawla NV, Bowyer KW, Hall LO et al (2002) SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357
Article Google Scholar
Fernández A, Garcia S, Herrera F et al (2018) SMOTE For learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905
Article MathSciNet Google Scholar
Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International conference on intelligent computing. Springer, pp 878–887
Maciejewski T, Stefanowski J (2011) Local neighbourhood extension of SMOTE for mining imbalanced data. computational intelligence and data mining, pp 104–111
Hussein AS, Li T, Yohannese CW et al (2019) A-SMOTE: a new preprocessing approach for highly imbalanced datasets by improving SMOTE. Int J Comput Intell Syst 12(2):1412–1422
Article Google Scholar
Lin M, Tang K, Yao X et al (2013) Dynamic sampling approach to training neural networks for multiclass imbalance classification. IEEE Trans Neural Netw 24(4):647–660
Article Google Scholar
Lin W, Tsai C, Hu Y et al (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci:17–26
Tsai C, Lin W, Hu Y et al (2019) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf Sci:47–54
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21 (9):1263–1284
Article Google Scholar
Baderelden M, Teitei E, Perry T et al (2019) Biased random forest for dealing with the class imbalance problem. IEEE Trans Neural Netwx 30(7):2163–2172
Article Google Scholar
Li F, Zhang X, Zhang X et al (2018) Cost-sensitive and hybrid-attribute measure multi-decision tree over imbalanced data sets. Inf Sci:242–256
Ramentol E, Caballero Y, Bello R et al (2012) SMOTE-RSB*: A hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33(2):245–265
Article Google Scholar
Lin W-C, Tsai C-F, Hu Y-H, Jhang J-S (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci (409–410):17-26
Liu G, Yang Y, Li B et al (2018) Fuzzy rule-based oversampling technique for imbalanced and incomplete data learning. Knowl Based Syst:154–174
Lin W, Tsai C, Hu Y et al (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci:17–26
Loyolagonzalez O, Medinaperez MA, Martineztrinidad JF et al (2017) PBC4Cip: A new contrast pattern-based classifier for class imbalance problems. Knowl Based Sys:100–109
Yu H, Sun C, Yang X et al (2019) Fuzzy support vector machine with relative density information for classifying imbalanced data. IEEE Trans Fuzzy Syst 27(12):2353–2367
Article Google Scholar
Feng W, Huang W, Ren J (2018) Class imbalance ensemble learning based on the margin theory. Appl Sci 8(5):815
Article Google Scholar
Seiffert C, Khoshgoftaar TM, Van Hulse J et al (2010) RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. Syst Man Cybern 40(1):185–197
Google Scholar
Chawla NV, Lazarevic A, Hall LO et al (2003) SMOTEBoost: Improving Prediction of the Minority Class in Boosting. european conference on principles of data mining and knowledge discovery, pp 107–119
Lu Y, Cheung Y, Tang YY et al (2016) Hybrid Sampling with Bagging for Class Imbalance Learning. pacific-asia conference on knowledge discovery and data mining, pp 14–26
Chen T, Guestrin C (2016) XGBoost: A Scalable Tree Boosting System. knowledge discovery and data mining, pp 785–794
Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38(4):367–378
Article MathSciNet Google Scholar
Yuan Y, Shaw MJ (1995) Induction of fuzzy decision trees. Fuzzy Sets Syst 69(2):125–139
Article MathSciNet Google Scholar
Zhai J, Wang X, Zhang S et al (2018) Tolerance rough fuzzy decision tree. Inf Sci:425–438
Sardari S, Eftekhari M, Afsari F et al (2017) Hesitant fuzzy decision tree approach for highly imbalanced data classification. Appl Soft Comput:727–741
Shannon CE (1948) A mathematical theory of communication. Bell Syst Techn J 27(3):379–423
Article MathSciNet Google Scholar
Wang Z, Cao C, Zhu Y et al (2020) Entropy and Confidence-Based undersampling boosting random forests for imbalanced problems. IEEE Trans Neural Netw:1–14
Ertoz L, Steinbach M, Kumar V (2002) A new shared nearest neighbor clustering algorithm and its applications. Workshop on clustering high dimensional data and its applications at 2nd SIAM international conference on data mining, pp 105–115
Batuwita R, Palade V (2010) FSVM-CIL Fuzzy support vector machines for class imbalance learning. IEEE Trans Fuzzy Syst 18(3):558–571
Article Google Scholar
Ertekin S, Huang J, Giles CL et al (2007) Active learning for class imbalance problem. international acm sigir conference on research and development in information retrieval, pp 823–824
Alcala-Fdez J, Fernandez A, Luengo J et al (2011) KEEL Data-Mining Software tool: Data set repository, Integration of Algorithms and Experimental Analysis Framework. Soft comput:255–287
Alcala-Fdez J, Fernandez A, Luengo J et al (2011) KEEL Data-Mining Software tool: Data set repository, Integration of Algorithms and Experimental Analysis Framework. Soft Comput:255–287
Widrow B, Greenblatt A, Kim Y et al (2013) The No-Prop algorithm: A new learning algorithm for multilayer neural networks. Neural Netw 37:182–188
Article Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China grant 61573266 and the University Natural Science Research Key Projects of Anhui Province(KJ2019A0816).

Author information

Authors and Affiliations

School of Mathematics and Statistics, Xidian University, Xi’an, 710071, China
Mingxue Jiang, Youlong Yang & Haiquan Qiu
College of Information and Network Engineering, Anhui Science and Technology University, Bengbu, 233030, China
Haiquan Qiu

Authors

Mingxue Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Youlong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Haiquan Qiu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mingxue Jiang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiang, M., Yang, Y. & Qiu, H. Fuzzy entropy and fuzzy support-based boosting random forests for imbalanced data. Appl Intell 52, 4126–4143 (2022). https://doi.org/10.1007/s10489-021-02620-y

Download citation

Accepted: 14 June 2021
Published: 17 July 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s10489-021-02620-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fuzzy entropy and fuzzy support-based boosting random forests for imbalanced data

Abstract

Access this article

Similar content being viewed by others

An Improved Ensemble Classification Algorithm for Imbalanced Data with Sample Overlap

Combining Sampling and Ensemble Classifier for Multiclass Imbalance Data Learning

CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fuzzy entropy and fuzzy support-based boosting random forests for imbalanced data

Abstract

Access this article

Similar content being viewed by others

An Improved Ensemble Classification Algorithm for Imbalanced Data with Sample Overlap

Combining Sampling and Ensemble Classifier for Multiclass Imbalance Data Learning

CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation