Abstract
Type 2 diabetes is a lifelong disease that causes a substantial increase of sugar (glucose) in the blood. Nowadays, diabetes type 2 is a major public worldwide health challenge. Therefore, it is necessary to automate the process of predicting diseases. The dataset used was the “PIMA Indians Diabetes Data Set”. This dataset is imbalanced. Consequently, the authors have randomly selected 268 cases from each class to create a new balanced dataset. The objective is to analyse the impact of imbalanced data for predicting diabetes type 2. Four different machine learning methods have been applied to the original and balanced dataset. Neural network, k-nearest neighbors, Logistic Regression, and AdaBoost have been implemented with 10-fold cross-validation. Detailed information concerning the proposed model’s parameters is presented. The results recommend the use of Neural Networks for predicting diabetes type 2. This method presents 71.4% and 82.3% of accuracy for the original and balanced dataset, respectively. Furthermore, the proposed method has been compared with other studies available in the state of the art. Neural Networks presented 85.9% for AUC, 82.2% for F1-Score, 82.6% for Precision, 82.3% for Recall/sensitivity and 77.6% for specificity when applied in balanced data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
“Diabetes” World Wealth Organization 8 June 2020, 27 December 2020. https://www.who.int/news-room/fact-sheets/detail/diabetes
Osborn, C.O.: Type 1 and Type 2 diabetes: what’s the difference? Healthline 28 October 2020, 27 December 2020. https://www.healthline.com/health/difference-between-type-1-and-type 2-diabetes
Stewart, C.: Prevalence of diabetes in adult population in Europe 2019, by country. Statista, 24 Jun 2020, 2 January 2021. https://www.statista.com/statistics/1081006/prevalence-of-diabetes-in-europe/
Saeedi, P., et al.: Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: results from the International Diabetes Federation Diabetes Atlas, 9th edn. Diabetes Research and Clinical Practice, vol. 157, p. 107843, November 2019, https://doi.org/10.1016/j.diabres.2019.107843
Smith, J.W., et al.: Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In: Proceedings of the Annual Symposium on Computer Application in Medical Care. American Medical Informatics Association (1988)
Rajni, B., Bagga, A.: RB-Bayes algorithm for the prediction of diabetic in Pima Indian dataset. Int. J. Electr. Comput. Eng. 9(6), 4866 (2019)
Sisodia, D., Sisodia, D.S.: Prediction of diabetes using classification algorithms. Proc. Comput. Sci. 132, 1578–1585 (2018)
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., Johannes, R.S.: Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In: Proceedings Symposium on Computer Applications and Medical Care, pp. 261–265. IEEE Computer Society Press, Piscataway (1988)
Sehly, R., Mezher, M.: Comparative analysis of classification models for pima dataset. In: 2020 International Conference on Computing and Information Technology (ICCIT-1441), Tabuk, Saudi Arabia, pp. 1–5 (2020). https://doi.org/10.1109/ICCIT-144147971.2020.9213821
Kayaer, K., Yildirim, T.: Medical diagnosis on Pima Indian diabetes using general regression neural networks. In: Proceedings of the International Conference on Artificial Neural Networks and Neural Information Processing (2003)
Alpan, K.,, İlgi, G.S.: Classification of diabetes dataset with data mining techniques by using WEKA approach. In: 2020 4th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), Istanbul, Turkey, pp. 1–7 (2020). https://doi.org/10.1109/ISMSIT50672.2020.9254720
Zdravevski, E., et al.: Feature ranking based on information gain for large classification problems with mapreduce. In: 2015 IEEE Trustcom/BigDataSE/ISPA, vol. 2. IEEE (2015)
Demšar, J., et al.: Orange: data mining toolbox in Python. J. Mach. Learn. Res. 14(1), 2349–2353 (2013)
Kramer, O.: K-nearest neighbors. In: Dimensionality Reduction with Unsupervised Nearest Neighbors, pp. 13–23. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38652-7_2
Kleinbaum, D.G., et al.: Logistic Regression. Springer, New York (2002). https://doi.org/10.1007/b97379
Yegnanarayana, B.: Artificial Neural Networks. PHI Learning Pvt. Ltd. (2009)
Schapire, R.E.: Explaining AdaBoost. In: Schölkopf, B., Luo, Z., Vovk, V. (eds.) Empirical Inference, pp. 37–52. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41136-6_5
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth & Books/Cole Advanced Boks & Software, Monterey (1984)
Chawla, N.V.: Data mining for imbalanced datasets: an overview. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 875–886. Springer, Boston (2009). https://doi.org/10.1007/978-0-387-09823-4_45
Wei Q, Dunbrack RL Jr. The role of balanced training and testing data sets for binary classifiers in bioinformatics. PLoS ONE 8(7), e67863 (013). PMID: 23874456, PMCID: PMC3706434. https://doi.org/10.1371/journal.pone.0067863
Acknowledgements
This work was supported by the Polytechnic of Coimbra (ESTGOH). We thank Polytechnic of Coimbra (ESTGOH) for their continuous support in this work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Mesquita, F., Marques, G. (2021). Predicting Type 2 Diabetes Through Machine Learning: Performance Analysis in Balanced and Imbalanced Data. In: Elbiaze, H., Sabir, E., Falcone, F., Sadik, M., Lasaulce, S., Ben Othman, J. (eds) Ubiquitous Networking. UNet 2021. Lecture Notes in Computer Science(), vol 12845. Springer, Cham. https://doi.org/10.1007/978-3-030-86356-2_22
Download citation
DOI: https://doi.org/10.1007/978-3-030-86356-2_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86355-5
Online ISBN: 978-3-030-86356-2
eBook Packages: Computer ScienceComputer Science (R0)