Abstract
During the last twenty years, machine learning provided a myriad of frameworks and tools to improve data analyses in several fields. Classification, regression, clustering and dimensionality reduction techniques have been widely used in clinical studies to assist health professionals in screening, risk estimation, diagnostics and prognostics. Prospective studies often involve a long follow-up period and a large sample, therefore many investigations rely on a retrospective technique to develop precise classifiers. However, biological data usually presents a limited number of samples and imbalanced number of classes, which affects classification performance. These issues can be alleviated by employing balancing techniques, which increase the number of samples of the minority classes (oversampling) and/or decrease the number of samples of the majority classes (undersampling). In this work, we propose an original framework to assess several balancing techniques, combining them with 3 out-of-the-box classifiers. We applied the combination of techniques to the AVOCADO clinical study, which consists of a database of patient information including cardiovascular death or survival. Our results from the retrospective analysis of this database showed that for training the algorithm to predict cardiovascular outcomes in both sexes, the best undersampling techniques were ENN, RENN and Near-Miss 3, while ADASYN and SMOTE were the best oversampling techniques. Regarding the classifier algorithms, Random Forest and Logistic Regression (with internal balancing parameter enabled) achieved the best results with both families of balancing techniques. Proper balancing techniques associated with feature importance analysis improved the identification of clinical patterns in the data, enabling detection of high risk patients. This approach can be used for personalized medicine, for improving patients survival and recovery.
Supported by organizations IFES, CAPES and FAPESP (procs 2018/18560-6, 2018/21934-5) and EMPATHY trial ABM05/2020/1.1.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
In this work, we used the Scikit-Learn Python library (https://scikit-learn.org/) to perform the analyses: Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825–2830, 2011.
- 3.
Regarding the parameters configuration of over-sampling techniques, to SMOTE and BorderlineSMOTE we set \(k\_neighbors\,=\,5\) and to ADASYN we set n_neighbors = 5. Concerning the parameters configuration of under-sampling techniques, to NearMiss versions 1, 2 and 3 we set n_neighbors = 3 and \(n\_neighbors\_ver3\,=\,20\) (only for version 3). For ENN and RENN we set n_neighbors = 3 and, to ClusterCentroids, voting = ‘auto’.
- 4.
Platelet graph shows better results with Near-Miss techniques. This indicates the requirement for an automatic framework to find the best combination for each dataset.
References
Ahmed, Z., Mohamed, K., Zeeshan, S., Dong, X.: Artificial intelligence with multi-functional machine learning platform development for better healthcare and precision medicine. Database 2020, 1–35 (2020)
Estabrooks, A., Jo, T., Japkowicz, N.: A multiple resampling method for learning from imbalanced data sets. Comput. Intell. 20(1), 18–36 (2004)
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Fehr, D., et al.: Automatic classification of prostate cancer Gleason scores from multiparametric magnetic resonance images. Proc. Natl. Acad. Sci. 112(46), E6265–E6273 (2015)
Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera, F.: Learning from Imbalanced Data Sets. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98074-4
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)
Krawczyk, B., Galar, M., Jeleń, Ł, Herrera, F.: Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Appl. Soft Comput. 38, 714–726 (2016)
Larrañaga, P., et al.: Machine learning in bioinformatics. Brief. Bioinform. 7(1), 86–112 (2006)
Min, S., Lee, B., Yoon, S.: Deep learning in bioinformatics. Brief. Bioinform. (March) bbw068 (2016)
Mohedano-Munoz, M., Alique-García, S., Rubio-Sánchez, M., Raya, L., Sanchez, A.: Interactive visual clustering and classification based on dimensionality reduction mappings: a case study for analyzing patients with dermatologic conditions. Expert Syst. Appl. 171(2019), 114605 (2021)
Rosiak, M., et al.: Effect of ASA dose doubling versus switching to clopidogrel on plasma inflammatory markers concentration in patients with type 2 diabetes and high platelet reactivity: the AVOCADO study. Cardiol. J. 20(5), 545–551 (2013)
Sabatino, J., et al.: MicroRNAs fingerprint of bicuspid aortic valve. J. Mol. Cellular Cardiol. 134(July), 98–106 (2019)
Oh, S., Lee, M.S., Zhang, B.-T.: Ensemble learning with active example selection for imbalanced biomedical data classification. IEEE/ACM Trans. Comput. Biol. Bioinform. 8(2), 316–325 (2011)
Shah, P., et al.: Artificial intelligence and machine learning in clinical development: a translational perspective. NPJ Digit. Med. 2(1), 69 (2019)
Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning - ICML 2007, vol. 227, pp. 935–942. ACM Press, New York (2007)
Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Fonseca, A.B., Martins-Jr, D.C., Wicik, Z., Postula, M., Simões, S.N. (2022). Addressing Classification on Highly Imbalanced Clinical Datasets. In: Bansal, M.S., et al. Computational Advances in Bio and Medical Sciences. ICCABS 2021. Lecture Notes in Computer Science(), vol 13254. Springer, Cham. https://doi.org/10.1007/978-3-031-17531-2_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-17531-2_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17530-5
Online ISBN: 978-3-031-17531-2
eBook Packages: Computer ScienceComputer Science (R0)