Abstract
Class imbalance is a common problem in classification tasks. The learning schemes of most classification algorithms tend to optimize the overall accuracy, and thus, identification of important but rarely occurring examples is ignored. The Mahalanobis–Taguchi system (MTS) has been shown to be robust in addressing class imbalance problems owing to its inherent properties of classification model construction. The bagging learning approach often has been applied as a superior strategy to reduce the learning bias of classification algorithms. In this study, we propose MTSbag, which integrates the MTS and the bagging-based ensemble learning approaches to enhance the ability of conventional MTS in handling imbalanced data. We perform numerical experiments involving multiple datasets with various class imbalance levels to demonstrate the effectiveness of MTSbag, especially for datasets with high imbalance levels. Finally, as a healthcare application, an early warning system for in-hospital cardiac arrest, was successfully implemented by leveraging the minority class identification ability of MTSbag.
Similar content being viewed by others
References
Alfaro E, Gamez M, Garcia N (2013) Adabag: an R package for classification with boosting and bagging. J Stat Softw 54(2):1–35
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor 6(1):20–29
Błaszczyński J, Deckert M, Stefanowski J, Wilk S (2010) Integrating selective pre-processing of imbalanced data with ivotes ensemble. In: International conference on rough sets and current trends in computing, pp 148–157
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC, BOca Raton
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Breiman L (1996) Out-of-bag estimation. Tech Rep Stat Dep Univ Calif Berkeley 33(34):1–13
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Buenviaje B, Bischoff JE, Roncace RA, Willy CJ (2016) Mahalanobis-Taguchi system to identify preindicators of delirium in the ICU. IEEE J Biomed Health Inform 20(4):1205–1212
Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Accessed 1 Sept 2016
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: European conference on principles of data mining and knowledge discovery, Springer Berlin Heidelberg, pp 107–119
Chawla NV, Cieslak DA, Hall LO, Joshi A (2008) Automatically countering imbalance and its empirical relationship to cost. Data Min Knowl Disc 17(2):225–252
Chen Z, Lin T, Xia X, Xu H, Ding S (2018) A synthetic neighborhood generation based ensemble learning for the imbalanced data classification. Appl Intell 48(8):2441–2457
Chen HH (2017) Package ‘ebmc’. https://CRAN.R-project.org/package=ebmc. Accessed 15 Mar 2018
Das P, Datta S (2007) Exploring the effects of chemical composition in hot rolled steel product using Mahalanobis distance scale under Mahalanobis-Taguchi system. Comput Mater Sci 38(4):671–677
Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36
Fan W, Stolfo SJ, Zhang J, Chan PK (1999) Adacost: misclassification cost-sensitive boosting. In: 16th international conference on machine learning, pp 97–105
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C Appl Rev 42(4):463–484
Galar M, Fernández A, Barrenechea E, Herrera F (2013) EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recogn 46(12):3460–3471
Grzymala-Busse JW, Stefanowski J, Wilk S (2004) A comparison of two approaches to data mining from imbalanced data. Lect Notes Comput Sci 3213:757–763
Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. SIGKDD Explor 6(1):30–39
Guo H, Li Y, Shang J, Gu M, Huang Y, Gong B (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Hakim L, Sartono B, Saefuddin A (2017) Bagging based ensemble classification method on imbalance datasets. Int J Comput Sci Netw 6(6):670–676
Hanifah FS, Wijayanto H, Kurnia A (2015) SMOTEBagging algorithm for imbalanced dataset in logistic regression analysis (case: credit of bank X). Appl Math Sci 9(138):6857–6865
Harliman R, Uchida K (2018) Data-and algorithm-hybrid approach for imbalanced data problems in deep neural network. Int J Mach Learn Comput 8(3):208–213
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Huang JC (2010) Reducing solder paste inspection in surface-mount assembly through Mahalanobis–Taguchi analysis. IEEE Trans Electron Packag Manuf 33(4):265–274
Khoshgoftaar TM, Golawala M, Van Hulse J (2007) An empirical study of learning from imbalanced data using random forest. In: 19th IEEE international conference on tools with artificial intelligence, vol 2, pp 310–317
Khoshgoftaar TM, Van Hulse J, Napolitano A (2011) Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Trans Syst Man Cybern Part A Syst Hum 41(3):552–568
Khwaja AS, Naeem M, Anpalagan A, Venetsanopoulos A, Venkatesh B (2015) Improved short-term load forecasting using bagged neural networks. Electr Power Syst Res 125:109–115
Kuo RJ, Su PY, Zulvia FE, Lin CC (2018) Integrating cluster analysis with granular computing for imbalanced data classification problem—a case study on prostate cancer prognosis. Comput Ind Eng 125:319–332
Ling C, Sheng V, Yang Q (2006) Test strategies for cost-sensitive decision trees. IEEE Trans Knowl Data Eng 18(8):1055–1067
Liparas D, Angelis L, Feldt R (2012) Applying the Mahalanobis–Taguchi strategy for software defect diagnosis. Autom Softw Eng 19(2):141–165
Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class imbalance learning. IEEE Trans Syst Man Cybern B Cybern 39(2):539–550
Mahalanobis PC (1936) On the generalised distance in statistics. Proc Natl Inst Sci India 2:49–55
Manevitz LM, Yousef M (2001) One-class SVMs for document classification. J Mach Learn Res 2:139–154
Phua C, Alahakoon D, Lee V (2004) Minority report in fraud detection: classification of skewed data. SIGKDD Explor 6(1):50–59
Polikar R (2006) Ensemble based systems in decision making. IEEE Circ Syst Mag 6(3):21–45
Raghuwanshi BS, Shukla S (2019) Class imbalance learning using UnderBagging based kernelized extreme learning machine. Neurocomputing 329:172–187
Raskutti A, Kowalczyk A (2004) Extreme rebalancing for SVMs: a case study. SIGKDD Explor 6(1):60–69
RColorBrewer S, Liaw A, Wiener M, Liaw MA (2015) Package ‘randomForest’. ftp://ie.freshrpms.net/pub/CRAN/web/packages/randomForest/randomForest.pdf. Accessed 1 Sept 2016
Riho T, Suzuki A, Oro J, Ohmi K, Tanaka H (2005) The yield enhancement methodology for invisible defects using the MTS + method. IEEE Trans Semicond Manuf 18(4):561–568
Rokach L (2010) Ensemble-based classifiers. Artif Intell Rev 33(1):1–39
Schapire RE (1990) The strength of weak learnability. Mach Learn 5(2):197–227
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Hum 40(1):185–197
Shakya P, Kulkarni MS, Darpe AK (2015) Bearing diagnosis based on Mahalanobis–Taguchi–Gram–Schmidt method. J Sound Vib 337:342–362
Soylemezoglu A, Jagannathan S, Saygin C (2011) Mahalanobis-Taguchi system as a multi-sensor based decision making prognostics tool for centrifugal pump failures. IEEE Trans Reliab 60(4):864–878
Su CT, Hsiao YH (2007) An evaluation of the robustness of MTS for imbalanced data. IEEE Trans Knowl Data Eng 19(10):1321–1332
Sun Y, Kamel MS, Wong AK, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recogn 40(12):3358–3378
Sun J, Lang J, Fujita H, Li H (2018) Imbalanced enterprise credit evaluation with DTE-SBD: decision tree ensemble based on SMOTE and bagging with differentiated sampling rates. Inf Sci 425:76–91
Taguchi G, Jugulum R (2002) The Mahalanobis–Taguchi strategy. Wiley, New York
Ting KM (2000) A comparative study of cost-sensitive boosting algorithms. in: 17th International conference on machine learning, pp 983–990
Wang Q, Luo Z, Huang J, Feng Y, Liu Z (2017) A novel ensemble method for imbalanced data learning: bagging of extrapolation-SMOTE SVM. Comput Intell Neurosci 2017:1827016
Woodall WH, Koudelik R, Tsui KL, Kim SB, Stoumbos ZG, Carvounis CP (2003) A review and analysis of the Mahalanobis–Taguchi system. Technometrics 45(1):1–15
Wu G, Chang E (2003) Adaptive feature-space conformal transformation for imbalanced data learning. In: 20th International conference on machine learning, pp 816–823
Wu G, Chang E (2005) KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Trans Knowl Data Eng 17(6):786–795
Yang Q, Wu X (2006) 10 challenging problems in data mining research. Int J Inf Technol Decis Making 5(4):597–604
Yu H, Sun C, Yang X, Zheng S, Zou H (2019) Fuzzy support vector machine with relative density information for classifying imbalanced data. In: IEEE transactions on fuzzy systems
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Hsiao, YH., Su, CT. & Fu, PC. Integrating MTS with bagging strategy for class imbalance problems. Int. J. Mach. Learn. & Cyber. 11, 1217–1230 (2020). https://doi.org/10.1007/s13042-019-01033-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-019-01033-1