Skip to main content
Log in

High-dimensional imbalanced biomedical data classification based on P-AdaBoost-PAUC algorithm

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

High-dimensional imbalanced biomedical data has dual characteristics of high-dimensional and imbalanced distribution. It is important to improve classification accuracy by filtering out low-dimensional feature subsets that are highly correlated with the classification target and have minimal mutual redundancy. However, traditional feature selection algorithms tend to select the feature subset that is favorable to class with large sample size, resulting in poor classification performance for minority samples. In response to the above problems, the P-AdaBoost-PAUC algorithm is proposed to be applied to high-dimensional imbalanced biomedical data classification. The idea of P-AdaBoost-PAUC algorithm has two major contributions. The first is that an improved decision tree attribute optimization algorithm (DT-P) is proposed, which pays more attention to the correlation among attributes. The second is that an improved AdaBoost algorithm based on probabilistic AUC (AdaBoost-PAUC) is proposed, which comprehensively considers misclassification probability and AUC to pay more attention to minority samples. An ensemble algorithm for high-dimensional imbalanced biomedical data classification is formed, which is conducive to improve classification performance. Experimental results show that Recall, Specificity, F1, and AUC values of P-AdaBoost-PAUC ensemble algorithm have reached the highest values on datasets with different imbalance rate. Especially when the proportion of minority samples is only 12.6\(\%\), Recall, Specificity, F1 and AUC values all reached above 0.95. And algorithm stability experiments show that P-AdaBoost-PAUC algorithm is more stable than other algorithms. Therefore, the P-AdaBoost-PAUC ensemble algorithm proposed in this paper improves classification performance of minority samples on high-dimensional imbalanced biomedical data to a certain extent.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Haixiang Guo, Yijing Li, Shang Jennifer et al (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239

    Article  Google Scholar 

  2. Lan F (2015) The discriminate analysis and dimension reduction methods of high dimension. Open J Soc Sci 03(3):7–13

    Google Scholar 

  3. Haro-García AD, Cerruela-García G, García-Pedrajas N (2020) Ensembles of feature selectors for dealing with class-imbalanced datasets: a proposal and comparative study-ScienceDirect. Inf Sci 540:89–116

    Article  Google Scholar 

  4. Liu M, Xu C, Luo Y et al (2018) Cost-sensitive feature selection by optimizing F-measures. IEEE Trans Image Process 27(3):1323–1335

    Article  MathSciNet  Google Scholar 

  5. Yang K, Yu Z, Wen X et al (2019) Hybrid classifier ensemble for imbalanced data. IEEE Trans Neural Netw Learn Syst 99:1–14

    Google Scholar 

  6. Elsebakhi E, Asparouhov O, Al-Ali R (2015) Novel incremental ranking framework for biomedical data analytics and dimensionality reduction: big data challenges and opportunities. J Comput Sci Syst Biol 8(4):203–214

    Google Scholar 

  7. Maldonado S, Weber R, Famili F (2014) Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Inf Sci 286:228–246

    Article  Google Scholar 

  8. Zhou P, Hu X, Li P et al (2017) Online feature selection for high-dimensional class-imbalanced data. Knowl-Based Syst 136:187–199

    Article  Google Scholar 

  9. Cao P, Liu X, Yang J et al (2017) Nonlinearity-aware based dimensionality reduction and over-sampling for AD/MCI classification from MRI measures. Comput Biol Med 91:21–37

    Article  Google Scholar 

  10. Viegas F, Rocha L, Gonalves M et al (2018) A genetic programming approach for feature selection in highly dimensional skewed data. Neurocomputing 273:554–569

    Article  Google Scholar 

  11. Khan SH, Hayat M, Bennamoun M et al (2018) Cost-sensitive learning of deep Feature representations from imbalanced data. IEEE Trans Neural Netw Learn Syst 29(8):3573–3587

    Article  Google Scholar 

  12. R Abdulhammed, Faezipour M, Musafer H, et al.: Efficient network intrusion detection using PCA-based dimensionality reduction of features. In: 2019 IEEE International Symposium on Networks, Computers and Communications, pp 1-6 (2019)

  13. Wen G, Li X, Zhu Y et al (2021) One-step spectral rotation clustering for imbalanced high-dimensional data. Inf Process Manage 58(1):102388

    Article  Google Scholar 

  14. Gaddam SR, Phoha VV, Balagani KS (2007) K-means+id3: a novel method for supervised anomaly detection by cascading k-Means clustering and id3 decision tree learning methods. IEEE Trans Knowl Data Eng 19(3):345–354

    Article  Google Scholar 

  15. Polat K, Guenes S (2009) A novel hybrid intelligent method based on C4.5 decision tree classifier and one-against-all approach for multi-class classification problems. Expert Syst Appl 36(2–1):1587–1592

    Article  Google Scholar 

  16. Burrows WR, Benjamin M, Beauchamp S et al (2010) CART decision-tree statistical analysis and prediction of summer season maximum surface ozone for the Vancouver, Montreal, and Atlantic regions of Canada[J]. J Appl Meteorol 34(8):1848–1862

    Article  Google Scholar 

  17. Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139

    Article  MathSciNet  Google Scholar 

  18. Lu C, Feng J, Chen Y et al (2020) Tensor robust principal component analysis with a new tensor nuclear norm. IEEE Trans Pattern Anal Mach Intell 42(4):925–938

    Article  Google Scholar 

  19. Liu S, Wu J, Feng L et al (2018) Quasi-curvature local linear projection and extreme learning machine for nonlinear dimensionality reduction. Neurocomputing 277:208–217

    Article  Google Scholar 

  20. Cai Y, Tao H, Hu L et al (2012) Prediction of lysine ubiquitination with mRMR feature selection and analysis. Amino Acids 42(4):1387–1395

    Article  Google Scholar 

  21. Cheriguene S, Azizi N, Dey N et al (2019) A new hybrid classifier selection model based on mRMR method and diversity measures. Int J Mach Learn Cybern 10(5):1189–1204

    Article  Google Scholar 

  22. Duan KB, Rajapakse JC, Wang H et al (2005) Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans Nanobiosci 4(3):228–234

    Article  Google Scholar 

  23. Wa Ng C, Xiao Z, Wa Ng B et al (2019) Identification of autism based on SVM-RFE and stacked sparse Auto-Encoder. IEEE Access 99:1–1

    Google Scholar 

  24. Park MY, Hastie T (2007) L1-regularization path algorithm for generalized linear models. J R Stat Soc Ser B Stat Methodol 69(4):659–677

    Article  MathSciNet  Google Scholar 

  25. Mazza-Anthony C, Mazoure B, Coates M (2020) Learning gaussian graphical models with ordered weighted L1 regularization. IEEE Trans Signal Process 99:1–1

    Google Scholar 

  26. Sysoev O (2019) A smoothed monotonic regression via L2 regularization. Knowl Inf Syst 59(1):197–218

    Article  Google Scholar 

  27. Yang PA, Lin YP, Zhu TF (2019) AdaBoostRS: integration of high-dimensional unbalanced data learning. Computer Science 46(12):14–18

    Google Scholar 

  28. Prokhorenkova L, Gusev G, Vorobev A et al (2018) CatBoost: unbiased boosting with categorical features. In: NIPS’18: Proceedings of the 32nd International Conference on Neural Information Processing Systems December

  29. Dhananjay B, Jayaraman S (2021) Analysis and classification of heart rate using CatBoost feature ranking model[J]. Biomed Signal Process Control 68(16):102610

    Article  Google Scholar 

  30. Zhang C, Wang G, Ying Z, et al (2018) Feature selection for high dimensional imbalanced class data based on F-measure optimization. In: International Conference on Security

  31. Mohan P, Paramasivam I: Feature reduction using SVM-RFE technique to detect autism spectrum disorder. Evol Intell, pp 1-9 (2020)

Download references

Acknowledgements

The authors are very indebted to the anonymous referees for their critical comments and suggestions for the improvement of this paper. This work was also supported by the major project of National Natural Science Foundation of China (51991365), the Natural Science Foundation of Shandong Province of China (ZR2021MF082).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kewen Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, X., Li, K. High-dimensional imbalanced biomedical data classification based on P-AdaBoost-PAUC algorithm. J Supercomput 78, 16581–16604 (2022). https://doi.org/10.1007/s11227-022-04509-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04509-0

Keywords

Navigation