Abstract
Dealing with imbalanced datasets is a recurrent issue in health-care data processing. Most literature deals with small academic datasets, so that results often do not extrapolate to the large real-life datasets, or have little real-life validity. When minority class sample generation by interpolation is meaningless, the recourse to undersampling the majority class is mandatory in order to reach some acceptable results. Ensembles of classifiers provide the advantage of the diversity of their members, which may allow adaptation to the imbalanced distribution. In this paper, we present a pipeline method combining random undersampling with bootstrap aggregation (bagging) for a hybrid ensemble of extreme learning machines and decision trees, whose diversity improves adaptation to the imbalanced class dataset. The approach is demonstrated on a realistic greatly imbalanced dataset of emergency department patients from a Chilean hospital targeted to predict patient readmission. Computational experiments show that our approach outperforms other well-known classification algorithms.
Similar content being viewed by others
References
Arora S, Patel P, Lahewala S, Patel N, Patel NJ, Thakore K, Amin A, Tripathi B, Kumar V, Shah H, Shah M, Panaich S, Deshmukh A, Badheka A, Gidwani U, Gopalan R (2017) Etiologies, trends, and predictors of 30-day readmission in patients with heart failure. Am J Cardiol 119(5):760–769
Artetxe A, Ayerdi B, Graa M, Rios, S (2017) Using anticipative hybrid extreme rotation forest to predict emergency service readmission risk. J Comput Sci
Artetxe A, Beristain A, Graña M, Besga A (2016) Predicting 30-day emergency readmission risk. In: International conference on European transnational education, Springer, pp 3–12
Billings J, Blunt I, Steventon A, Georghiou T, Lewis G, Bardsley M (2012) Development of a predictive model to identify inpatients at risk of re-admission within 30 days of discharge (parr-30). BMJ Open 2(4):e001,667
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks, 2008. IJCNN 2008, IEEE world congress on computational intelligence, IEEE, pp 1322–1328
Huang G, Huang GB, Song S, You K (2015) Trends in extreme learning machines: a review. Neural Netw 61:32–48
Huang GB, Wang DH, Lan Y (2011) Extreme learning machines: a survey. Int J Mach Learn Cybern 2(2):107–122
Kansagara D, Englander H, Salanitro A, Kagen D, Theobald C, Freeman M, Kripalani S (2011) Risk prediction models for hospital readmission: a systematic review. JAMA 306(15):1688–1698
Khalilia M, Chakraborty S, Popescu M (2011) Predicting disease risks from highly imbalanced data using random forest. BMC Med Inf Decis Mak 11(1):1
Lin SJ, Chang C, Hsu MF (2013) Multiple extreme learning machines for a two-class imbalance corporate life cycle prediction. Knowl Based Syst 39:214–223
López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141
Mateo F, Soria-Olivas E, Martınez-Sober M, Téllez-Plaza M, Gómez-Sanchis J, Redón J (2016) Multi-step strategy for mortality assessment in cardiovascular risk patients with imbalanced data. In: European symposium on artificial neural networks, computational intelligence and machine learning
Mazurowski MA, Habas PA, Zurada JM, Lo JY, Baker JA, Tourassi GD (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw 21(2):427–436
Meadem N, Verbiest N, Zolfaghar K, Agarwal J, Chin SC, Roy SB (2013) Exploring preprocessing techniques for prediction of risk of readmission for congestive heart failure patients. In: Data mining and healthcare (DMH), at international conference on knowledge discovery and data mining (KDD)
Mortazavi BJ, Downing NS, Bucholz EM, Dharmarajan K, Manhapra A, Li SX, Negahban SN, Krumholz HM (2016) Analysis of machine learning techniques for heart failure readmissions. Circ Cardiovasc Qual Outcomes 9:629–664
Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106
Shi X, Xu G, Shen F, Zhao J (2015) Solving the data imbalance problem of p300 detection via random under-sampling bagging SVMs. In: 2015 international joint conference on Neural networks (IJCNN), IEEE, pp 1–5
Steinberg D, Colla P (1995) Cart: tree-structured non-parametric data analysis. Salford Systems, San Diego
Sun Y, Wong AKC, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recognit Artif Intell 23(04):687–719
Turgeman L, May JH (2016) A mixed-ensemble model for hospital readmission. Artif Intell Med 72:72–82
Urma D, Huang CC (2017) Interventions and strategies to reduce 30-day readmission rates. Hosp Med Clin 6(2):216–228
Wang B, Pineau J (2016) Online bagging and boosting for imbalanced data streams. IEEE Trans Knowl Data Eng 28(12):3353–3366
Yang Q, Wu X (2006) Ten challenging problems in data mining research. Int J Inf Technol Decis Mak 5(04):597–604
Yoon K, Kwek S (2007) A data reduction approach for resolving the imbalanced data issue in functional genomics. Neural Comput Appl 16(3):295–306
Young WA, Nykl SL, Weckman GR, Chelberg DM (2015) Using voronoi diagrams to improve classification performances when modeling imbalanced datasets. Neural Comput Appl 26(5):1041–1054
Zhang Y, Fu P, Liu W, Chen G (2014) Imbalanced data classification based on scaling kernel-based support vector machine. Neural Comput Appl 25(3):927–935
Zhang Z, Krawczyk B, Garcia S, Rosales-Perez A, Herrera F (2016) Empowering one-versus-one decomposition with ensemble learning for multi-class imbalanced data. Knowl Based Syst 106:251–263
Zheng B, Zhang J, Yoon SW, Lam SS, Khasawneh M, Poranki S (2015) Predictive modeling of hospital readmissions using metaheuristics and data mining. Expert Syst Appl 42(20):7110–7120
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest
Rights and permissions
About this article
Cite this article
Artetxe, A., Graña, M., Beristain, A. et al. Balanced training of a hybrid ensemble method for imbalanced datasets: a case of emergency department readmission prediction. Neural Comput & Applic 32, 5735–5744 (2020). https://doi.org/10.1007/s00521-017-3242-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-017-3242-y