Abstract
For multi-class imbalanced classification tasks that occur in many real-world applications, the class imbalance, which is caused by the case that some classes are not as frequent as other classes, and class overlap, which is caused by the case that some classes contains a similar number of data, are the major challenges. Both of them make the classification task complicated. The decomposition-based strategy is an effective way to improve the performance of multi-class imbalanced classification tasks. However, current studies based on this strategy have failed to solve the problems of class imbalance and overlapping simultaneously. Therefore, we propose an effective method , namely clustering-based adaptive decomposition and editing-based diversified oversamping procedure(CluAD-EdiDO), to solve the above problems in this paper. The proposed CluAD-EdiDO consists of two key components: the clustering-based adaptive decomposition and the editing-based diversified oversampling technique. The former is applied to group similar data samples of the data set into clusters(i.e., “sub-problems”). The latter is applied independently in different clusters to combat the imbalance and overlap, reducing the impact of the majority classes in overlapping region and oversampling the minority classes appropriately. Furthermore, a diversified ensemble learning framework is adopted to select the best classification algorithm for different sub-problems. Extensive experiments on 17 real-world datasets demonstrate that our method outperforms for multi-class imbalanced datasets.
Similar content being viewed by others
References
Almeida TA, Almeida J, Yamakami A (2011) Spam filtering: how the dimensionality reduction affects the accuracy of naive bayes classifiers. Journal of Internet Services and Applications 1(3):183–200
Liu Y, Zhang L, Nie L, Yan Y, Rosenblum DS (2016) Fortune teller: predicting your career path. In: Thirtieth AAAI conference on artificial intelligence
Liu Y, Nie L, Han L, Zhang L, Rosenblum DS (2015) Action2activity: recognizing complex activities from sensor data. In: Twenty-fourth international joint conference on artificial intelligence
Wu Q, Ye Y, Zhang H, Ng MK, Ho S-S (2014) Forestexter: an efficient random forest algorithm for imbalanced text categorization. Knowl-Based Syst 67:105–116
Ghorai S, Mukherjee A, Dutta PK (2010) Discriminant analysis for fast multiclass data classification through regularized kernel function approximation. IEEE Transactions on Neural Networks 21(6):1020–1029
Hsu C-W, Lin C-J (2002) A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks 13(2):415–425
Knerr S, Personnaz L, Dreyfus G (1990) Single-layer learning revisited: a stepwise procedure for building and training a neural network. In: Neurocomputing, Springer, pp 41–50
Clark P, Boswell R (1991) Rule induction with cn2: Some recent improvements. In: European working session on learning, Springer, pp 151–163
Dietterich TG, Bakiri G (1994) Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research 2:263–286
Vluymans S, Fernández A, Saeys Y, Cornelis C, Herrera F (2018) Dynamic affinity-based classification of multi-class imbalanced data with one-versus-one decomposition: a fuzzy rough set approach. Knowl Inf Syst 56(1):55–84
Galar M, Fernández A, Barrenechea E, Herrera F (2015) Drcw-ovo: distance-based relative competence weighting combination for one-vs-one strategy in multi-class problems. Pattern Recognition 48(1):28–42
Lee HK, Kim SB (2018) An overlap-sensitive margin classifier for imbalanced and overlapping data. Expert Syst Appl 98:72–83
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16:321–357
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics (3), pp 408–421
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6(1):20–29
Lin W-C, Tsai C-F, Hu Y-H, Jhang J-S (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci 409:17–26
Zhang Z, Krawczyk B, Garcìa S., Rosales-Pérez A., Herrera F (2016) Empowering one-vs-one decomposition with ensemble learning for multi-class imbalanced data. Knowl-Based Syst 106:251–263
Li D-C, Liu C-W, Hu SC (2010) A learning method for the class imbalance problem with medical data sets. Computers in Biology and Medicine 40(5):509–518
Zhu T, Lin Y, Liu Y, Zhang W, Zhang J (2019) Minority oversampling for imbalanced ordinal regression. Knowl-Based Syst 166:140–155
Zhu T, Lin Y, Liu Y (2017) Synthetic minority oversampling technique for multiclass imbalance problems. Pattern Recogn 72:327–340
Schapire RE (1990) The strength of weak learnability. Mach Learn 5(2):197–227
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Wang S, Yao X (2009) Diversity analysis on imbalanced data sets by using ensemble models. In: 2009 IEEE symposium on computational intelligence and data mining, IEEE, pp 324–331
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42(4):463–484
Gónzalez S, García S, Lázaro M, Figueiras-Vidal AR, Herrera F (2017) Class switching according to nearest enemy distance for learning from highly imbalanced data-sets. Pattern Recogn 70:12–24
García S, Zhang Z-L, Altalhi A, Alshomrani S, Herrera F (2018) Dynamic ensemble selection for multi-class imbalanced datasets. Inf Sci 445:22–37
Wang S, Yao X (2012) Multiclass imbalance problems: Analysis and potential solutions. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 42(4):1119–1130
Fernández-Navarro F, Hervás-Martínez C, Gutiérrez PA (2011) A dynamic over-sampling procedure based on sensitivity for multi-class problems. Pattern Recogn 44(8):1821–1833
Abdi L, Hashemi S (2015) To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans Knowledge and Data Eng 28(1):238–251
Ghanem AS, Venkatesh S, West G (2010) Multi-class pattern classification in imbalanced data. In: 2010 20th international conference on pattern recognition, IEEE, pp 2881–2884
Galar M, Fernández A, Barrenechea E, Bustince H, Herrera F (2013) Dynamic classifier selection for one-vs-one strategy: avoiding non-competent classifiers. Pattern Recogn 46(12):3412–3424
Kang S, Cho S, Kang P (2015) Constructing a multi-class classifier using one-against-one approach with different binary classifiers. Neurocomputing 149:677–682
Datta S, Das S (2015) Near-bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs. Neural Netw 70:39–52
Bi J, Zhang C (2018) An empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble learning scheme. Knowl-Based Syst 158:81–93
Ackermann MR, Blömer J, Kuntze D, Sohler C (2014) Analysis of agglomerative clustering. Algorithmica 69(1):184–215
Napierala K, Stefanowski J (2016) Types of minority class examples and their influence on learning classifiers from imbalanced data. J Intel Inform Syst 46(3):563–597
Santoso B, Wijayanto H, Notodiputro KA, Sartono B (2018) K-neighbor over-sampling with cleaning data: a new approach to improve classification performance in data sets with class imbalance. Appl Math Sci 12(10):449–460
Wu T-F, Lin C-J, Weng RC (2004) Probability estimates for multi-class classification by pairwise coupling. J Mach Learn Res 5(Aug):975–1005
Triguero I, González S, Moyano JM, García López S, Alcalá Fernández J, Luengo Martín J, Fernández Hilario A, Díaz J, Sánchez L, Herrera F et al Keel 3.0: an open source software for multi-stage analysis in data mining
Asuncion A, Newman D (2007) Uci machine learning repository
Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics Bulletin 1(6):80–83
Zeng N, Wang Z, Zhang H, Liu W, Alsaadi FE (2016) Deep belief networks for quantitative analysis of a gold immunochromatographic strip. Cognitive Computation 8(4):684–692
Chen Z, Lin T, Xia X, Xu H, Ding S (2018) A synthetic neighborhood generation based ensemble learning for the imbalanced data classification. Appl Intell 48(8):2441– 2457
Akkasi A, Varoğlu E, Dimililer N (2017) Balanced undersampling: a novel sentence-based undersampling method to improve recognition of named entities in chemical and biomedical text. Appl Intell, pp 1–14
Zhang C, Bi J, Xu S, Ramentol E, Fan G, Qiao B, Fujita H (2019) Multi-imbalance: an open-source software for multi-class imbalance learning. Knowl-Based Syst 50:137–143
Li K-S, Wang H-R, Liu K-H (2019) A novel error-correcting output codes algorithm based on genetic programming. Swarm and Evolutionary Computation 50:100564
Beyan C, Fisher R (2015) Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recogn 48(5):1653–1672
Benjilali W, Guicquero W, Jacques L, Sicard G (2019) Exploring hierarchical machine learning for hardware-limited multi-class inference on compressed measurements. In: 2019 IEEE International Symposium on Circuits and Systems (ISCAS), IEEE, pp 1–5
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2009) Rusboost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 40(1):185–197
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chen, X., Zhang, L., Wei, X. et al. An effective method using clustering-based adaptive decomposition and editing-based diversified oversamping for multi-class imbalanced datasets. Appl Intell 51, 1918–1933 (2021). https://doi.org/10.1007/s10489-020-01883-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-020-01883-1