Abstract
High-dimensional imbalanced biomedical data has dual characteristics of high-dimensional and imbalanced distribution. It is important to improve classification accuracy by filtering out low-dimensional feature subsets that are highly correlated with the classification target and have minimal mutual redundancy. However, traditional feature selection algorithms tend to select the feature subset that is favorable to class with large sample size, resulting in poor classification performance for minority samples. In response to the above problems, the P-AdaBoost-PAUC algorithm is proposed to be applied to high-dimensional imbalanced biomedical data classification. The idea of P-AdaBoost-PAUC algorithm has two major contributions. The first is that an improved decision tree attribute optimization algorithm (DT-P) is proposed, which pays more attention to the correlation among attributes. The second is that an improved AdaBoost algorithm based on probabilistic AUC (AdaBoost-PAUC) is proposed, which comprehensively considers misclassification probability and AUC to pay more attention to minority samples. An ensemble algorithm for high-dimensional imbalanced biomedical data classification is formed, which is conducive to improve classification performance. Experimental results show that Recall, Specificity, F1, and AUC values of P-AdaBoost-PAUC ensemble algorithm have reached the highest values on datasets with different imbalance rate. Especially when the proportion of minority samples is only 12.6\(\%\), Recall, Specificity, F1 and AUC values all reached above 0.95. And algorithm stability experiments show that P-AdaBoost-PAUC algorithm is more stable than other algorithms. Therefore, the P-AdaBoost-PAUC ensemble algorithm proposed in this paper improves classification performance of minority samples on high-dimensional imbalanced biomedical data to a certain extent.












Similar content being viewed by others
References
Haixiang Guo, Yijing Li, Shang Jennifer et al (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Lan F (2015) The discriminate analysis and dimension reduction methods of high dimension. Open J Soc Sci 03(3):7–13
Haro-García AD, Cerruela-García G, García-Pedrajas N (2020) Ensembles of feature selectors for dealing with class-imbalanced datasets: a proposal and comparative study-ScienceDirect. Inf Sci 540:89–116
Liu M, Xu C, Luo Y et al (2018) Cost-sensitive feature selection by optimizing F-measures. IEEE Trans Image Process 27(3):1323–1335
Yang K, Yu Z, Wen X et al (2019) Hybrid classifier ensemble for imbalanced data. IEEE Trans Neural Netw Learn Syst 99:1–14
Elsebakhi E, Asparouhov O, Al-Ali R (2015) Novel incremental ranking framework for biomedical data analytics and dimensionality reduction: big data challenges and opportunities. J Comput Sci Syst Biol 8(4):203–214
Maldonado S, Weber R, Famili F (2014) Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Inf Sci 286:228–246
Zhou P, Hu X, Li P et al (2017) Online feature selection for high-dimensional class-imbalanced data. Knowl-Based Syst 136:187–199
Cao P, Liu X, Yang J et al (2017) Nonlinearity-aware based dimensionality reduction and over-sampling for AD/MCI classification from MRI measures. Comput Biol Med 91:21–37
Viegas F, Rocha L, Gonalves M et al (2018) A genetic programming approach for feature selection in highly dimensional skewed data. Neurocomputing 273:554–569
Khan SH, Hayat M, Bennamoun M et al (2018) Cost-sensitive learning of deep Feature representations from imbalanced data. IEEE Trans Neural Netw Learn Syst 29(8):3573–3587
R Abdulhammed, Faezipour M, Musafer H, et al.: Efficient network intrusion detection using PCA-based dimensionality reduction of features. In: 2019 IEEE International Symposium on Networks, Computers and Communications, pp 1-6 (2019)
Wen G, Li X, Zhu Y et al (2021) One-step spectral rotation clustering for imbalanced high-dimensional data. Inf Process Manage 58(1):102388
Gaddam SR, Phoha VV, Balagani KS (2007) K-means+id3: a novel method for supervised anomaly detection by cascading k-Means clustering and id3 decision tree learning methods. IEEE Trans Knowl Data Eng 19(3):345–354
Polat K, Guenes S (2009) A novel hybrid intelligent method based on C4.5 decision tree classifier and one-against-all approach for multi-class classification problems. Expert Syst Appl 36(2–1):1587–1592
Burrows WR, Benjamin M, Beauchamp S et al (2010) CART decision-tree statistical analysis and prediction of summer season maximum surface ozone for the Vancouver, Montreal, and Atlantic regions of Canada[J]. J Appl Meteorol 34(8):1848–1862
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139
Lu C, Feng J, Chen Y et al (2020) Tensor robust principal component analysis with a new tensor nuclear norm. IEEE Trans Pattern Anal Mach Intell 42(4):925–938
Liu S, Wu J, Feng L et al (2018) Quasi-curvature local linear projection and extreme learning machine for nonlinear dimensionality reduction. Neurocomputing 277:208–217
Cai Y, Tao H, Hu L et al (2012) Prediction of lysine ubiquitination with mRMR feature selection and analysis. Amino Acids 42(4):1387–1395
Cheriguene S, Azizi N, Dey N et al (2019) A new hybrid classifier selection model based on mRMR method and diversity measures. Int J Mach Learn Cybern 10(5):1189–1204
Duan KB, Rajapakse JC, Wang H et al (2005) Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans Nanobiosci 4(3):228–234
Wa Ng C, Xiao Z, Wa Ng B et al (2019) Identification of autism based on SVM-RFE and stacked sparse Auto-Encoder. IEEE Access 99:1–1
Park MY, Hastie T (2007) L1-regularization path algorithm for generalized linear models. J R Stat Soc Ser B Stat Methodol 69(4):659–677
Mazza-Anthony C, Mazoure B, Coates M (2020) Learning gaussian graphical models with ordered weighted L1 regularization. IEEE Trans Signal Process 99:1–1
Sysoev O (2019) A smoothed monotonic regression via L2 regularization. Knowl Inf Syst 59(1):197–218
Yang PA, Lin YP, Zhu TF (2019) AdaBoostRS: integration of high-dimensional unbalanced data learning. Computer Science 46(12):14–18
Prokhorenkova L, Gusev G, Vorobev A et al (2018) CatBoost: unbiased boosting with categorical features. In: NIPS’18: Proceedings of the 32nd International Conference on Neural Information Processing Systems December
Dhananjay B, Jayaraman S (2021) Analysis and classification of heart rate using CatBoost feature ranking model[J]. Biomed Signal Process Control 68(16):102610
Zhang C, Wang G, Ying Z, et al (2018) Feature selection for high dimensional imbalanced class data based on F-measure optimization. In: International Conference on Security
Mohan P, Paramasivam I: Feature reduction using SVM-RFE technique to detect autism spectrum disorder. Evol Intell, pp 1-9 (2020)
Acknowledgements
The authors are very indebted to the anonymous referees for their critical comments and suggestions for the improvement of this paper. This work was also supported by the major project of National Natural Science Foundation of China (51991365), the Natural Science Foundation of Shandong Province of China (ZR2021MF082).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, X., Li, K. High-dimensional imbalanced biomedical data classification based on P-AdaBoost-PAUC algorithm. J Supercomput 78, 16581–16604 (2022). https://doi.org/10.1007/s11227-022-04509-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04509-0