Abstract
Our paper aims at learning from imbalance data based on ensemble learning. At the stage, the main solution is to combine under-sampling, oversampling or cost sensitivity learning with ensemble learning. However, these feature space-based methods fail to reflect the transformation of distribution and are usually accompanied with high computational complexity and risk of overfitting. In this paper, we propose a dynamic cluster algorithm based on coefficient of variation (or entropy), which learns the local spatial distribution of data and hierarchically clusters the majority. This algorithm has low complexity and can dynamically adjust the cluster according to the iteration of AdaBoost, adaptively synchronized with changes caused by sample weight changes. Then, we design an index to measure the importance of each cluster. Based on this index, a dynamic sampling algorithm based on maximum weight is proposed. The effectiveness of the sampling algorithm is proved by visual experiments. Finally, we propose a cost-sensitive algorithm based on Bagging, and combine it with the dynamic sampling algorithm to propose a multi-fusion imbalanced ensemble learning algorithm. In experimental research, our algorithms have been validated on three artificial datasets, 22 KEEL datasets and two gene expression cancer datasets, and have shown ideal or better performance than SOTA in terms of AUC, indicating that our algorithms are not only effective imbalance algorithms, but also provide potential for building a reliable biological cyber-physical system.







Similar content being viewed by others
References
Breiman L (2017) Classification and regression trees. Routledge, Abingdon
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) Smoteboost: improving prediction of the minority class in boosting. In: European Conference on Principles of Data Mining and Knowledge Discovery. Springer, pp 107–119
Deng X, Zeng D, Shen H (2018) Causation analysis model: based on ahp and hybrid apriori-genetic algorithm. J Intell Fuzzy Syst 35(1):767–778
Deng X, Chen H, Cai R, Zeng F, Xu G, Zhang H (2019) A knowledge-based multiplayer collaborative routing in opportunistic networks. In: 2019 IEEE Intl Conf on Dependable. Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech). IEEE, pp 16–21
Devi RL, Kalaivani V (2019) Machine learning and iot-based cardiac arrhythmia diagnosis using statistical and dynamic features of ecg. J Supercomput 3:1–12
Elkan C (2001) The foundations of cost-sensitive learning. In: International Joint Conference on Artificial Intelligence, Vol 17, No 1. Lawrence Erlbaum Associates Ltd
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139
Freund Y, Schapire RE et al (1996) Experiments with a new boosting algorithm. In: Icml, vol 96. Citeseer, pp 148–156
Han H, Wang WY, Mao BH (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International Conference on Intelligent Computing, Springer, pp 878–887
Hanifah FS, Wijayanto H, Kurnia A (2015) Smotebagging algorithm for imbalanced dataset in logistic regression analysis (case: Credit of bank x). Appl. Math. Sci. 9(138):6857–6865
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE, pp 1322–1328
Hu P, Xia E, Li S, Du X, Ma C, Dong J, Chan KC (2019) Network-based prediction of major adverse cardiac events in acute coronary syndromes from imbalanced emr data. Stud Health Technol Inf 264:1480–1481
Hu S, Liang Y, Ma L, He Y (2009) Msmote: improving classification performance when training data is imbalanced. In: Second International Workshop on Computer Science and Engineering, WCSE’09, vol 2. IEEE, pp 13–17
Desai A, Jadav K, Chaudhary S (2015) An empirical evaluation of costboost extensions for cost-sensitive classification. In: Proceedings of the 8th Annual ACM India Conference, pp 73–77
Kaur P, Negi V (2016) Techniques based upon boosting to counter class imbalance problem?a survey. In: 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), IEEE. pp 2620–2623
Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5(4):221–232
Lee J, Moon D, Kim I, Lee Y (2019) A semantic approach to improving machine readability of a large-scale attack graph. J Supercomput 75(6):3028–3045
Lingchi C, Xiaoheng D, Hailan S, Congxu Z, Le C (2018) Dycusboost: Adaboost-based imbalanced learning using dynamic clustering and undersampling. In: 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech). IEEE, pp 208–215
Liu TY (2009) Easyensemble and feature selection for imbalance data sets. In: International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, 2009. IJCBS’09. IEEE, pp 517–520
Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B (Cybern) 39(2):539–550
Lusa L et al (2012) Evaluation of smote for high-dimensional class-imbalanced microarray data. In: 2012 11th International Conference on Machine Learning and Applications, vol 2. IEEE, pp 89–94
Masnadi-Shirazi H, Vasconcelos N (2011) Cost-sensitive boosting. IEEE Trans Pattern Anal Mach Intell 33(2):294–309
Moorthy K, Mohamad MS (2011) Random forest for gene selection and microarray data classification. Bioinformation 7(3):142
Nanni L, Fantozzi C, Lazzarini N (2015) Coupling different methods for overcoming the class imbalance problem. Neurocomputing 158:48–61
Pandey A, Sequeria R, Kumar P, Kumar S (2019) A multistage deep residual network for biomedical cyber-physical systems. IEEE Syst J 55:1–10
Prati RC, Batista GE, Monard MC (2004) Learning with class skews and small disjuncts. In: Brazilian Symposium on Artificial Intelligence. Springer, pp 296–306
Qi K, Yang H, Hu Q, Yang D (2019) A new adaptive weighted imbalanced data classifier via improved support vector machines with high-dimension nature. Knowl-Based Syst 185:104933
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2010) Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Humans 40(1):185–197
Smeraldi F, Bicego M, Cristani M, Murino V (2011) Cloosting: Clustering data with boosting. In: International Workshop on Multiple Classifier Systems, vol 6713, pp 289–298
Soltani S, Sadri J, Torshizi HA (2011) Feature selection and ensemble hierarchical cluster-based under-sampling approach for extremely imbalanced datasets: Application to gene classification. In: 2011 1st International eConference on Computer and Knowledge Engineering (ICCKE). IEEE, pp 166–171
Tavallali P, Yazdi M, Khosravi MR (2017) An efficient training procedure for viola-jones face detector. In: 2017 International Conference on Computational Science and Computational Intelligence (CSCI). IEEE, pp 828–831
Tavallali P, Yazdi M, Khosravi MR (2019) Robust cascaded skin detector based on adaboost. Multimed Tools Appl 78(2):2599–2620
Viola P, Jones MJ (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–154
Xu G, Jia L, Lu Y, Zeng X, Yao Z, Li X (2018a) A novel efficient maka protocol with desynchronization for anonymous roaming service in global mobility networks. J Netw Comput Appl 107:S1084804518300407
Xu G, Yao Z, Sangaiah AK, Li X, Castiglione A, Xi Z (2018b) Csp-e 2: An abuse-free contract signing protocol with low-storage TTP for energy-efficient electronic transaction ecosystems. Inf Sci 476:505–515
Yoon K, Kwek S (2007) A data reduction approach for resolving the imbalanced data issue in functional genomics. Neural Comput Appl 16(3):295–306
Zeng X, Xu G, Xi Z, Yang X, Zhou W (2018) E-aua: an efficient anonymous user authentication protocol for mobile iot. IEEE Internet Things J PP(99):1–1
Zhang X, Luo Q (2015) Unbalanced data classification algorithm based on clustering ensemble under-sampling. Comput Sci 42(11):63–66
Zhu T, Lin Y, Liu Y (2020) Improving interpolation-based oversampling for imbalanced data learning. Knowl-Based Syst 187:104826
Zhu ZB, Song ZH (2010) Fault diagnosis based on imbalance modified kernel fisher discriminant analysis. Chem Eng Res Des 88(8):936–951
Acknowledgements
This work was supported by the Fundamental Research Funds for the Central Universities of Central South University Grant Nos. 2019zzts588, XCX20190701588 and the National Natural Science Foundation of China under Grant Nos. 61772553, 61379058.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Deng, X., Xu, Y., Chen, L. et al. Dynamic clustering method for imbalanced learning based on AdaBoost. J Supercomput 76, 9716–9738 (2020). https://doi.org/10.1007/s11227-020-03211-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-020-03211-3