High-dimensional imbalanced biomedical data classification based on P-AdaBoost-PAUC algorithm

Li, Xiao; Li, Kewen

doi:10.1007/s11227-022-04509-0

High-dimensional imbalanced biomedical data classification based on P-AdaBoost-PAUC algorithm

Published: 08 May 2022

Volume 78, pages 16581–16604, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Xiao Li¹ &
Kewen Li¹

513 Accesses
Explore all metrics

Abstract

High-dimensional imbalanced biomedical data has dual characteristics of high-dimensional and imbalanced distribution. It is important to improve classification accuracy by filtering out low-dimensional feature subsets that are highly correlated with the classification target and have minimal mutual redundancy. However, traditional feature selection algorithms tend to select the feature subset that is favorable to class with large sample size, resulting in poor classification performance for minority samples. In response to the above problems, the P-AdaBoost-PAUC algorithm is proposed to be applied to high-dimensional imbalanced biomedical data classification. The idea of P-AdaBoost-PAUC algorithm has two major contributions. The first is that an improved decision tree attribute optimization algorithm (DT-P) is proposed, which pays more attention to the correlation among attributes. The second is that an improved AdaBoost algorithm based on probabilistic AUC (AdaBoost-PAUC) is proposed, which comprehensively considers misclassification probability and AUC to pay more attention to minority samples. An ensemble algorithm for high-dimensional imbalanced biomedical data classification is formed, which is conducive to improve classification performance. Experimental results show that Recall, Specificity, F1, and AUC values of P-AdaBoost-PAUC ensemble algorithm have reached the highest values on datasets with different imbalance rate. Especially when the proportion of minority samples is only 12.6$\%$, Recall, Specificity, F1 and AUC values all reached above 0.95. And algorithm stability experiments show that P-AdaBoost-PAUC algorithm is more stable than other algorithms. Therefore, the P-AdaBoost-PAUC ensemble algorithm proposed in this paper improves classification performance of minority samples on high-dimensional imbalanced biomedical data to a certain extent.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Research on classification method of high-dimensional class-imbalanced datasets based on SVM

Article 09 July 2018

Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications

Article Open access 02 November 2017

A hybrid feature weighting and selection-based strategy to classify the high-dimensional and imbalanced medical data

Article 20 April 2024

References

Haixiang Guo, Yijing Li, Shang Jennifer et al (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Article Google Scholar
Lan F (2015) The discriminate analysis and dimension reduction methods of high dimension. Open J Soc Sci 03(3):7–13
Google Scholar
Haro-García AD, Cerruela-García G, García-Pedrajas N (2020) Ensembles of feature selectors for dealing with class-imbalanced datasets: a proposal and comparative study-ScienceDirect. Inf Sci 540:89–116
Article Google Scholar
Liu M, Xu C, Luo Y et al (2018) Cost-sensitive feature selection by optimizing F-measures. IEEE Trans Image Process 27(3):1323–1335
Article MathSciNet Google Scholar
Yang K, Yu Z, Wen X et al (2019) Hybrid classifier ensemble for imbalanced data. IEEE Trans Neural Netw Learn Syst 99:1–14
Google Scholar
Elsebakhi E, Asparouhov O, Al-Ali R (2015) Novel incremental ranking framework for biomedical data analytics and dimensionality reduction: big data challenges and opportunities. J Comput Sci Syst Biol 8(4):203–214
Google Scholar
Maldonado S, Weber R, Famili F (2014) Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Inf Sci 286:228–246
Article Google Scholar
Zhou P, Hu X, Li P et al (2017) Online feature selection for high-dimensional class-imbalanced data. Knowl-Based Syst 136:187–199
Article Google Scholar
Cao P, Liu X, Yang J et al (2017) Nonlinearity-aware based dimensionality reduction and over-sampling for AD/MCI classification from MRI measures. Comput Biol Med 91:21–37
Article Google Scholar
Viegas F, Rocha L, Gonalves M et al (2018) A genetic programming approach for feature selection in highly dimensional skewed data. Neurocomputing 273:554–569
Article Google Scholar
Khan SH, Hayat M, Bennamoun M et al (2018) Cost-sensitive learning of deep Feature representations from imbalanced data. IEEE Trans Neural Netw Learn Syst 29(8):3573–3587
Article Google Scholar
R Abdulhammed, Faezipour M, Musafer H, et al.: Efficient network intrusion detection using PCA-based dimensionality reduction of features. In: 2019 IEEE International Symposium on Networks, Computers and Communications, pp 1-6 (2019)
Wen G, Li X, Zhu Y et al (2021) One-step spectral rotation clustering for imbalanced high-dimensional data. Inf Process Manage 58(1):102388
Article Google Scholar
Gaddam SR, Phoha VV, Balagani KS (2007) K-means+id3: a novel method for supervised anomaly detection by cascading k-Means clustering and id3 decision tree learning methods. IEEE Trans Knowl Data Eng 19(3):345–354
Article Google Scholar
Polat K, Guenes S (2009) A novel hybrid intelligent method based on C4.5 decision tree classifier and one-against-all approach for multi-class classification problems. Expert Syst Appl 36(2–1):1587–1592
Article Google Scholar
Burrows WR, Benjamin M, Beauchamp S et al (2010) CART decision-tree statistical analysis and prediction of summer season maximum surface ozone for the Vancouver, Montreal, and Atlantic regions of Canada[J]. J Appl Meteorol 34(8):1848–1862
Article Google Scholar
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139
Article MathSciNet Google Scholar
Lu C, Feng J, Chen Y et al (2020) Tensor robust principal component analysis with a new tensor nuclear norm. IEEE Trans Pattern Anal Mach Intell 42(4):925–938
Article Google Scholar
Liu S, Wu J, Feng L et al (2018) Quasi-curvature local linear projection and extreme learning machine for nonlinear dimensionality reduction. Neurocomputing 277:208–217
Article Google Scholar
Cai Y, Tao H, Hu L et al (2012) Prediction of lysine ubiquitination with mRMR feature selection and analysis. Amino Acids 42(4):1387–1395
Article Google Scholar
Cheriguene S, Azizi N, Dey N et al (2019) A new hybrid classifier selection model based on mRMR method and diversity measures. Int J Mach Learn Cybern 10(5):1189–1204
Article Google Scholar
Duan KB, Rajapakse JC, Wang H et al (2005) Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans Nanobiosci 4(3):228–234
Article Google Scholar
Wa Ng C, Xiao Z, Wa Ng B et al (2019) Identification of autism based on SVM-RFE and stacked sparse Auto-Encoder. IEEE Access 99:1–1
Google Scholar
Park MY, Hastie T (2007) L1-regularization path algorithm for generalized linear models. J R Stat Soc Ser B Stat Methodol 69(4):659–677
Article MathSciNet Google Scholar
Mazza-Anthony C, Mazoure B, Coates M (2020) Learning gaussian graphical models with ordered weighted L1 regularization. IEEE Trans Signal Process 99:1–1
Google Scholar
Sysoev O (2019) A smoothed monotonic regression via L2 regularization. Knowl Inf Syst 59(1):197–218
Article Google Scholar
Yang PA, Lin YP, Zhu TF (2019) AdaBoostRS: integration of high-dimensional unbalanced data learning. Computer Science 46(12):14–18
Google Scholar
Prokhorenkova L, Gusev G, Vorobev A et al (2018) CatBoost: unbiased boosting with categorical features. In: NIPS’18: Proceedings of the 32nd International Conference on Neural Information Processing Systems December
Dhananjay B, Jayaraman S (2021) Analysis and classification of heart rate using CatBoost feature ranking model[J]. Biomed Signal Process Control 68(16):102610
Article Google Scholar
Zhang C, Wang G, Ying Z, et al (2018) Feature selection for high dimensional imbalanced class data based on F-measure optimization. In: International Conference on Security
Mohan P, Paramasivam I: Feature reduction using SVM-RFE technique to detect autism spectrum disorder. Evol Intell, pp 1-9 (2020)

Download references

Acknowledgements

The authors are very indebted to the anonymous referees for their critical comments and suggestions for the improvement of this paper. This work was also supported by the major project of National Natural Science Foundation of China (51991365), the Natural Science Foundation of Shandong Province of China (ZR2021MF082).

Author information

Authors and Affiliations

College of Computer Science and Technology, China University of Petroleum Huadong, Qingdao, 266580, Shandong, China
Xiao Li & Kewen Li

Authors

Xiao Li
View author publications
You can also search for this author inPubMed Google Scholar
Kewen Li
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Kewen Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, X., Li, K. High-dimensional imbalanced biomedical data classification based on P-AdaBoost-PAUC algorithm. J Supercomput 78, 16581–16604 (2022). https://doi.org/10.1007/s11227-022-04509-0

Download citation

Accepted: 06 April 2022
Published: 08 May 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s11227-022-04509-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

High-dimensional imbalanced biomedical data classification based on P-AdaBoost-PAUC algorithm

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Research on classification method of high-dimensional class-imbalanced datasets based on SVM

Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications

A hybrid feature weighting and selection-based strategy to classify the high-dimensional and imbalanced medical data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now