Addressing Classification on Highly Imbalanced Clinical Datasets

Fonseca, Alexandre Babilone; Martins-Jr, David Correa; Wicik, Zofia; Postula, Marek; Simões, Sérgio Nery

doi:10.1007/978-3-031-17531-2_9

Alexandre Babilone Fonseca¹⁴,
David Correa Martins-Jr¹⁵,
Zofia Wicik^15,16,
Marek Postula¹⁶ &
…
Sérgio Nery Simões¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 13254))

Included in the following conference series:

International Conference on Computational Advances in Bio and Medical Sciences

196 Accesses

Abstract

During the last twenty years, machine learning provided a myriad of frameworks and tools to improve data analyses in several fields. Classification, regression, clustering and dimensionality reduction techniques have been widely used in clinical studies to assist health professionals in screening, risk estimation, diagnostics and prognostics. Prospective studies often involve a long follow-up period and a large sample, therefore many investigations rely on a retrospective technique to develop precise classifiers. However, biological data usually presents a limited number of samples and imbalanced number of classes, which affects classification performance. These issues can be alleviated by employing balancing techniques, which increase the number of samples of the minority classes (oversampling) and/or decrease the number of samples of the majority classes (undersampling). In this work, we propose an original framework to assess several balancing techniques, combining them with 3 out-of-the-box classifiers. We applied the combination of techniques to the AVOCADO clinical study, which consists of a database of patient information including cardiovascular death or survival. Our results from the retrospective analysis of this database showed that for training the algorithm to predict cardiovascular outcomes in both sexes, the best undersampling techniques were ENN, RENN and Near-Miss 3, while ADASYN and SMOTE were the best oversampling techniques. Regarding the classifier algorithms, Random Forest and Logistic Regression (with internal balancing parameter enabled) achieved the best results with both families of balancing techniques. Proper balancing techniques associated with feature importance analysis improved the identification of clinical patterns in the data, enabling detection of high risk patients. This approach can be used for personalized medicine, for improving patients survival and recovery.

Supported by organizations IFES, CAPES and FAPESP (procs 2018/18560-6, 2018/21934-5) and EMPATHY trial ABM05/2020/1.1.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/dmlc/xgboost.
2.
In this work, we used the Scikit-Learn Python library (https://scikit-learn.org/) to perform the analyses: Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825–2830, 2011.
3.
Regarding the parameters configuration of over-sampling techniques, to SMOTE and BorderlineSMOTE we set \(k\_neighbors\,=\,5\) and to ADASYN we set n_neighbors = 5. Concerning the parameters configuration of under-sampling techniques, to NearMiss versions 1, 2 and 3 we set n_neighbors = 3 and \(n\_neighbors\_ver3\,=\,20\) (only for version 3). For ENN and RENN we set n_neighbors = 3 and, to ClusterCentroids, voting = ‘auto’.
4.
Platelet graph shows better results with Near-Miss techniques. This indicates the requirement for an automatic framework to find the best combination for each dataset.

References

Ahmed, Z., Mohamed, K., Zeeshan, S., Dong, X.: Artificial intelligence with multi-functional machine learning platform development for better healthcare and precision medicine. Database 2020, 1–35 (2020)
Article Google Scholar
Estabrooks, A., Jo, T., Japkowicz, N.: A multiple resampling method for learning from imbalanced data sets. Comput. Intell. 20(1), 18–36 (2004)
Article MathSciNet Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Article Google Scholar
Fehr, D., et al.: Automatic classification of prostate cancer Gleason scores from multiparametric magnetic resonance images. Proc. Natl. Acad. Sci. 112(46), E6265–E6273 (2015)
Article Google Scholar
Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera, F.: Learning from Imbalanced Data Sets. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98074-4
Book Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)
Article Google Scholar
Krawczyk, B., Galar, M., Jeleń, Ł, Herrera, F.: Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Appl. Soft Comput. 38, 714–726 (2016)
Article Google Scholar
Larrañaga, P., et al.: Machine learning in bioinformatics. Brief. Bioinform. 7(1), 86–112 (2006)
Article Google Scholar
Min, S., Lee, B., Yoon, S.: Deep learning in bioinformatics. Brief. Bioinform. (March) bbw068 (2016)
Google Scholar
Mohedano-Munoz, M., Alique-García, S., Rubio-Sánchez, M., Raya, L., Sanchez, A.: Interactive visual clustering and classification based on dimensionality reduction mappings: a case study for analyzing patients with dermatologic conditions. Expert Syst. Appl. 171(2019), 114605 (2021)
Article Google Scholar
Rosiak, M., et al.: Effect of ASA dose doubling versus switching to clopidogrel on plasma inflammatory markers concentration in patients with type 2 diabetes and high platelet reactivity: the AVOCADO study. Cardiol. J. 20(5), 545–551 (2013)
Article Google Scholar
Sabatino, J., et al.: MicroRNAs fingerprint of bicuspid aortic valve. J. Mol. Cellular Cardiol. 134(July), 98–106 (2019)
Article Google Scholar
Oh, S., Lee, M.S., Zhang, B.-T.: Ensemble learning with active example selection for imbalanced biomedical data classification. IEEE/ACM Trans. Comput. Biol. Bioinform. 8(2), 316–325 (2011)
Article Google Scholar
Shah, P., et al.: Artificial intelligence and machine learning in clinical development: a translational perspective. NPJ Digit. Med. 2(1), 69 (2019)
Article Google Scholar
Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning - ICML 2007, vol. 227, pp. 935–942. ACM Press, New York (2007)
Google Scholar
Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Federal Institute of Espírito Santo (IFES), Serra, ES, Brazil
Alexandre Babilone Fonseca & Sérgio Nery Simões
Federal University of ABC (UFABC), Santo André, Brazil
David Correa Martins-Jr & Zofia Wicik
Department of Experimental and Clinical Pharmacology, Center for Preclinical Research and Technology CEPT, Medical University of Warsaw, Warsaw, Poland
Zofia Wicik & Marek Postula

Authors

Alexandre Babilone Fonseca
View author publications
You can also search for this author in PubMed Google Scholar
David Correa Martins-Jr
View author publications
You can also search for this author in PubMed Google Scholar
Zofia Wicik
View author publications
You can also search for this author in PubMed Google Scholar
Marek Postula
View author publications
You can also search for this author in PubMed Google Scholar
Sérgio Nery Simões
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sérgio Nery Simões .

Editor information

Editors and Affiliations

University of Connecticut, Storrs, CT, USA
Mukul S. Bansal
University of Connecticut, Storrs, CT, USA
Ion Măndoiu
University of Connecticut Health Center, Farmington, CT, USA
Marmar Moussa
Georgia State University, Atlanta, GA, USA
Murray Patterson
University of Connecticut, Storrs, CT, USA
Sanguthevar Rajasekaran
Georgia State University, Atlanta, GA, USA
Pavel Skums
Georgia State University, Atlanta, GA, USA
Alexander Zelikovsky

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fonseca, A.B., Martins-Jr, D.C., Wicik, Z., Postula, M., Simões, S.N. (2022). Addressing Classification on Highly Imbalanced Clinical Datasets. In: Bansal, M.S., et al. Computational Advances in Bio and Medical Sciences. ICCABS 2021. Lecture Notes in Computer Science(), vol 13254. Springer, Cham. https://doi.org/10.1007/978-3-031-17531-2_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-17531-2_9
Published: 19 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17530-5
Online ISBN: 978-3-031-17531-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Addressing Classification on Highly Imbalanced Clinical Datasets