Abstract
Currently, diabetes is one of the most dangerous diseases in modern society. Prevention is an extremely important aspect in the field of medicine, and the field of artificial intelligence and the healthcare industry are penetrating and integrating with each other, and combining machine models for prediction and diagnosis of diabetes is a big trend. In order to validate the advantages and potential of XGBoost model in the field of diabetes prediction, this study identified 10 key features by processing a medical examination dataset containing 556,495 sample size. Among them, glycated hemoglobin has high clinical value as a predictor. By constructing six machine models (XGBoost, Decision Tree, Logistic Regression, Random Forest, CatBoost, and LightGBM) and comparing their performances, we finally obtained that: the performance of XGBoost is relatively the best, with accuracy of 97.5%, recall of 97%, F1 score of 96.9%, and ROC-AUC score of 0.971.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
Data will be made avaliable on reasonable request.
References
Kharroubi T (2015) Darwish HM Diabetes mellitus: The epidemic of the century, (in eng). World J Diabetes 6(6):850–67. https://doi.org/10.4239/wjd.v6.i6.850
Magliano E (2021) DJ, IDF Diabetes Atlas 10th edition scientific committee, IDF DIABETES ATLAS [Internet]. 10th edition. Brussels: International Diabetes Federation; Chapter 1, What is diabetes? https://www.ncbi.nlm.nih.gov/books/NBK581938/. Accessed Jan–Jun 2023
Bonora E et al (2020) Chronic complications in patients with newly diagnosed type 2 diabetes: prevalence and related metabolic and clinical features: the Verona newly diagnosed type 2 diabetes study (VNDS) 9. BMJ Open Diabetes Res Care 8(1):e001549. https://doi.org/10.1136/bmjdrc-2020-001549
Susan van JWJ, Beulens S, van der Yvonne T, Grobbee DE, Nealb B (2010) The global burden of diabetes and its complications: an emerging pandemic, European journal of cardiovascular prevention and rehabilitation17 1_suppl s3-s8. https://doi.org/10.1097/01.hjr.0000368191.86614.5a
Dunachie S, Chamnan P (2018) The double burden of diabetes and global infection in low and middle-income countries. Trans R Soc Trop Med Hyg 113(2):56–64. https://doi.org/10.1093/trstmh/try124
Liu J et al (2023) Projected rapid growth in diabetes disease burden and economic burden in China: a spatio-temporal study from 2020 to 2030. Lancet Reg Health – Western Pac 33. https://doi.org/10.1016/j.lanwpc.2023.100700
El-Sofany H, El-Seoud SA, Karam OH, Abd El-Latif YM, Taj-Eddin IATF (2024) A proposed technique using machine learning for the prediction of diabetes disease through a mobile app. Int J Intell Syst 2024:6688934. https://doi.org/10.1155/2024/6688934
Alghamdi T (2023) Prediction of diabetes complications using computational intelligence techniques. Applied Sciences 13(5)3030. https://doi.org/10.3390/app13053030
Ganie SM, Pramanik PKD, Bashir Malik M, Mallik S, Qin H (2023) An ensemble learning approach for diabetes prediction using boosting techniques, (in English), Frontiers in Genetics, Original Research 14. https://doi.org/10.3389/fgene.2023.1252159
Gupta N, Kaushik B, Rahmani M-K-I, Lashari S-A (2023) Performance Evaluation of Deep Dense Layer Neural Network for Diabetes Prediction, Computers, Materials \& Continua76(1):347–366. https://doi.org/10.32604/cmc.2023.038864
Gündoğdu S Efficient prediction of early-stage diabetes using XGBoost classifier with random forest feature selection technique. Multimedia Tools Appl 82(22):34163–34181. https://doi.org/10.1007/s11042-023-15165-8
Shin J, Lee J, Ko T, Lee K, Choi Y, Kim H-S (2022) Improving Machine Learning Diabetes Prediction Models for the Utmost Clinical Effectiveness. Journal of Personalized Medicine12(11):1899. [Online]. Available: https://doi.org/10.3390/jpm12111899
Committee ADAPP (2023) Introduction and methodology: Standards of Care in Diabetes—2024. Diabetes Care 47(1):S1–S4. https://doi.org/10.2337/dc24-SINT
Deberneh HM, Kim I (2021) Prediction of type 2 diabetes based on machine learning algorithm, international journal of environmental research and public health 18(6) 3317. https://doi.org/10.3390/ijerph18063317
Zhou Y, Kang J, Guo H Many-objective optimization of feature selection based on two-level particle cooperation. Inf Sci 532:91–109. https://doi.org/10.1016/j.ins.2020.05.004
Nguyen LP et al (2023) The utilization of machine learning algorithms for assisting physicians in the diagnosis of diabetes. Diagnostics 13(12):2087. https://doi.org/10.3390/diagnostics13122087
Qin Y et al (2022) Machine learning models for data-driven prediction of diabetes by lifestyle type, international journal of environmental research and public health 19(22)15027. https://doi.org/10.3390/ijerph192215027
Dritsas, Trigka M (2022) Data-driven machine-learning methods for diabetes risk prediction. Sensors 22(14)5304. https://doi.org/10.3390/s22145304
Chen T, Guestrin C (2016) XGBoost: A scalable tree boosting system, presented at the proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, USA. https://doi.org/10.1145/2939672.2939785
Qiu X, Zhang R, Xu H, Li X (2021) Local interpretable explanations for GBDT, in International Joint Conference on Neural Networks (IJCNN) 18(22)1–10, https://doi.org/10.1109/IJCNN52387.2021.9534081
Somvanshi M, Tambade S, Chavan P, Shinde SV, Ieee (2016) A review of machine learning techniques using decision tree and support vector machine, presented at the international conference on computing communication control and automation (ICCUBEA). https://doi.org/10.1109/ICCUBEA.2016.7860040
Yang Y, Wu M (2021) Explainable machine learning for improving logistic regression models, in IEEE 19th International Conference on Industrial Informatics (INDIN) 2021 1–6. https://doi.org/10.1109/INDIN45523.2021.9557392
Ren Q, Cheng H, Han H (2017) Research on machine learning framework based on random forest algorithm, AIP Conference Proceedings 1820(1). https://doi.org/10.1063/1.4977376
Cao L, He X, Chen S, Fang L (2023) Assessing forest quality through forest growth potential, an index based on improved catboost machine learning. Sustainability 15(11) 8888. https://doi.org/10.3390/su15118888
Nemeth M, Borkin D, Michalconok G (2019) Computational statistics and mathematical modeling methods in intelligent systems. The comparison of machine-learning methods xgboost and lightgbm to predict energy development, Cham. Springer International Publishing 208–215. https://doi.org/10.1007/978-3-030-31362-3_21
Uysal A, Öztürk and Ieee, (2018) Comparison of machine learning algorithms on different datasets, presented at the 2018 26th signal processing and communications applications conference (SIU) https://doi.org/10.1109/SIU.2018.8404193
Shah K, Chaturvedi P, Jain A and Ieee (2018) Contemplation of machine learning algorithm under distinct datasets, presented at the international conference on advanced computation and telecommunication (ICACAT). https://doi.org/10.1109/ICACAT.2018.8933753
Chandrasekaran J, Feng H, Lei Y, Kacker R, Kuhn DR (2020) Effectiveness of dataset reduction in testing machine learning algorithms, in IEEE International Conference On Artificial Intelligence Testing (AITest) 133–140. https://doi.org/10.1109/AITEST49225.2020.00027
Shilane D, Ieee (2022) Automated feature reduction in machine learning, presented at the 2022 IEEE 12th annual computing and communication workshop and conference (CCWC). https://doi.org/10.1109/CCWC54503.2022.9720821
Nadeem MW, Goh HG, Ponnusamy V, Andonovic I, Khan MA, Hussain M (2021) A fusion-based machine learning approach for the prediction of the onset of diabetes. Healthcare 9(10):1393. https://doi.org/10.3390/healthcare9101393
Kushwaha S et al (2022) Harnessing machine learning models for non-invasive pre-diabetes screening in children and adolescents, Computer Methods and Programs in Biomedicine 226:107180. https://doi.org/10.1016/j.cmpb.2022.107180
Amma B En-RfRsK: An ensemble machine learning technique for prognostication of diabetes mellitus. Egypt Inf J 25:100441. https://doi.org/10.1016/j.eij.2024.100441
Chen Y et al (2018) Data from: Association of body mass index and age with incident diabetes in Chinese adults: a population-based cohort study [Dataset]. Dryad. https://doi.org/10.5061/dryad.ft8750v
Lugner M, Rawshani A, Helleryd E, Eliasson B (2024) Identifying top ten predictors of type 2 diabetes through machine learning analysis of UK Biobank data, Scientific Reports 14(1):2102. https://doi.org/10.1038/s41598-024-52023-5
Liang X et al (2024) A machine learning approach to predicting vascular calcification risk of type 2 diabetes: a retrospective study. Clin Cardiol 47(4):e24264. https://doi.org/10.1002/clc.24264
Li L et al (2023) Machine learning for predicting diabetes risk in western China adults, diabetology & metabolic syndrome 15(1):165. https://doi.org/10.1186/s13098-023-01112-y
Su Y, Huang C, Zhu W, Lyu X, Ji F Multi-party diabetes mellitus risk prediction based on secure federated learning. Biomed Signal Process Control 85:104881. https://doi.org/10.1016/j.bspc.2023.104881
Acknowledgements
The study was supported by 2022 Shanghai “Science and Technology Innovation Action Plan” Biomedical Science and Technology Support Special Project (Grant No. 22S31904600).
Author information
Authors and Affiliations
Contributions
(1) Conception and Design: Qi Sun, Xin Cheng; (2) Administrative support: Ping Li, He Ren, Linhui Li; (3) Provision of research materials or patients: Zhaoli Zhou, Ping Li, He Ren; (4) Data collection and compilation: all authors; (5) Data Analysis and Interpretation: Kuo Han, Wenlong Zhao; (6) Manuscript writing: all authors; (7) Final approval of manuscript: all authors.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
This article does not contain any studies with human participants or animals.
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest. The authors declare that they have no conflict of interest.
Competing interests
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sun, Q., Cheng, X., Han, K. et al. Machine learning-based assessment of diabetes risk. Appl Intell 55, 106 (2025). https://doi.org/10.1007/s10489-024-05912-1
Accepted:
Published:
DOI: https://doi.org/10.1007/s10489-024-05912-1