Predicting Type 2 Diabetes Through Machine Learning: Performance Analysis in Balanced and Imbalanced Data

Mesquita, Francisco; Marques, Gonçalo

doi:10.1007/978-3-030-86356-2_22

Part of the book series: Lecture Notes in Computer Science ((LNCCN,volume 12845))

Included in the following conference series:

International Symposium on Ubiquitous Networking

489 Accesses
1 Citations

Abstract

Type 2 diabetes is a lifelong disease that causes a substantial increase of sugar (glucose) in the blood. Nowadays, diabetes type 2 is a major public worldwide health challenge. Therefore, it is necessary to automate the process of predicting diseases. The dataset used was the “PIMA Indians Diabetes Data Set”. This dataset is imbalanced. Consequently, the authors have randomly selected 268 cases from each class to create a new balanced dataset. The objective is to analyse the impact of imbalanced data for predicting diabetes type 2. Four different machine learning methods have been applied to the original and balanced dataset. Neural network, k-nearest neighbors, Logistic Regression, and AdaBoost have been implemented with 10-fold cross-validation. Detailed information concerning the proposed model’s parameters is presented. The results recommend the use of Neural Networks for predicting diabetes type 2. This method presents 71.4% and 82.3% of accuracy for the original and balanced dataset, respectively. Furthermore, the proposed method has been compared with other studies available in the state of the art. Neural Networks presented 85.9% for AUC, 82.2% for F1-Score, 82.6% for Precision, 82.3% for Recall/sensitivity and 77.6% for specificity when applied in balanced data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

“Diabetes” World Wealth Organization 8 June 2020, 27 December 2020. https://www.who.int/news-room/fact-sheets/detail/diabetes
Osborn, C.O.: Type 1 and Type 2 diabetes: what’s the difference? Healthline 28 October 2020, 27 December 2020. https://www.healthline.com/health/difference-between-type-1-and-type 2-diabetes
Stewart, C.: Prevalence of diabetes in adult population in Europe 2019, by country. Statista, 24 Jun 2020, 2 January 2021. https://www.statista.com/statistics/1081006/prevalence-of-diabetes-in-europe/
Saeedi, P., et al.: Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: results from the International Diabetes Federation Diabetes Atlas, 9th edn. Diabetes Research and Clinical Practice, vol. 157, p. 107843, November 2019, https://doi.org/10.1016/j.diabres.2019.107843
Smith, J.W., et al.: Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In: Proceedings of the Annual Symposium on Computer Application in Medical Care. American Medical Informatics Association (1988)
Google Scholar
Rajni, B., Bagga, A.: RB-Bayes algorithm for the prediction of diabetic in Pima Indian dataset. Int. J. Electr. Comput. Eng. 9(6), 4866 (2019)
Google Scholar
Sisodia, D., Sisodia, D.S.: Prediction of diabetes using classification algorithms. Proc. Comput. Sci. 132, 1578–1585 (2018)
Article Google Scholar
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., Johannes, R.S.: Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In: Proceedings Symposium on Computer Applications and Medical Care, pp. 261–265. IEEE Computer Society Press, Piscataway (1988)
Google Scholar
Sehly, R., Mezher, M.: Comparative analysis of classification models for pima dataset. In: 2020 International Conference on Computing and Information Technology (ICCIT-1441), Tabuk, Saudi Arabia, pp. 1–5 (2020). https://doi.org/10.1109/ICCIT-144147971.2020.9213821
Kayaer, K., Yildirim, T.: Medical diagnosis on Pima Indian diabetes using general regression neural networks. In: Proceedings of the International Conference on Artificial Neural Networks and Neural Information Processing (2003)
Google Scholar
Alpan, K.,, İlgi, G.S.: Classification of diabetes dataset with data mining techniques by using WEKA approach. In: 2020 4th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), Istanbul, Turkey, pp. 1–7 (2020). https://doi.org/10.1109/ISMSIT50672.2020.9254720
Zdravevski, E., et al.: Feature ranking based on information gain for large classification problems with mapreduce. In: 2015 IEEE Trustcom/BigDataSE/ISPA, vol. 2. IEEE (2015)
Google Scholar
Demšar, J., et al.: Orange: data mining toolbox in Python. J. Mach. Learn. Res. 14(1), 2349–2353 (2013)
MATH Google Scholar
Kramer, O.: K-nearest neighbors. In: Dimensionality Reduction with Unsupervised Nearest Neighbors, pp. 13–23. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38652-7_2
Kleinbaum, D.G., et al.: Logistic Regression. Springer, New York (2002). https://doi.org/10.1007/b97379
Yegnanarayana, B.: Artificial Neural Networks. PHI Learning Pvt. Ltd. (2009)
Google Scholar
Schapire, R.E.: Explaining AdaBoost. In: Schölkopf, B., Luo, Z., Vovk, V. (eds.) Empirical Inference, pp. 37–52. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41136-6_5
Chapter Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth & Books/Cole Advanced Boks & Software, Monterey (1984)
Google Scholar
Chawla, N.V.: Data mining for imbalanced datasets: an overview. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 875–886. Springer, Boston (2009). https://doi.org/10.1007/978-0-387-09823-4_45
Wei Q, Dunbrack RL Jr. The role of balanced training and testing data sets for binary classifiers in bioinformatics. PLoS ONE 8(7), e67863 (013). PMID: 23874456, PMCID: PMC3706434. https://doi.org/10.1371/journal.pone.0067863

Download references

Acknowledgements

This work was supported by the Polytechnic of Coimbra (ESTGOH). We thank Polytechnic of Coimbra (ESTGOH) for their continuous support in this work.

Author information

Authors and Affiliations

Polytechnic of Coimbra, ESTGOH, Rua General Santos Costa, 3400-124, Oliveira do Hospital, Portugal
Francisco Mesquita & Gonçalo Marques

Authors

Francisco Mesquita
View author publications
You can also search for this author in PubMed Google Scholar
Gonçalo Marques
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gonçalo Marques .

Editor information

Editors and Affiliations

University of Quebec at Montreal (UQAM), Motreal, QC, Canada
Halima Elbiaze
ENSEM, Hassan II University of Casablanca, Casablanca, Morocco
Essaid Sabir
Universidad Publica de Navarra (UPNA), Pampelune, Spain
Francisco Falcone
ENSEM, Hassan II University of Casablanca, Casablanca, Morocco
Mohamed Sadik
University of Lorraine, Nancy, France
Samson Lasaulce
Sorbonne University, Villetaneuse, France
Jalel Ben Othman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mesquita, F., Marques, G. (2021). Predicting Type 2 Diabetes Through Machine Learning: Performance Analysis in Balanced and Imbalanced Data. In: Elbiaze, H., Sabir, E., Falcone, F., Sadik, M., Lasaulce, S., Ben Othman, J. (eds) Ubiquitous Networking. UNet 2021. Lecture Notes in Computer Science(), vol 12845. Springer, Cham. https://doi.org/10.1007/978-3-030-86356-2_22

Download citation

DOI: https://doi.org/10.1007/978-3-030-86356-2_22
Published: 12 December 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86355-5
Online ISBN: 978-3-030-86356-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics