Skip to main content

Addressing Classification on Highly Imbalanced Clinical Datasets

  • Conference paper
  • First Online:
Computational Advances in Bio and Medical Sciences (ICCABS 2021)

Abstract

During the last twenty years, machine learning provided a myriad of frameworks and tools to improve data analyses in several fields. Classification, regression, clustering and dimensionality reduction techniques have been widely used in clinical studies to assist health professionals in screening, risk estimation, diagnostics and prognostics. Prospective studies often involve a long follow-up period and a large sample, therefore many investigations rely on a retrospective technique to develop precise classifiers. However, biological data usually presents a limited number of samples and imbalanced number of classes, which affects classification performance. These issues can be alleviated by employing balancing techniques, which increase the number of samples of the minority classes (oversampling) and/or decrease the number of samples of the majority classes (undersampling). In this work, we propose an original framework to assess several balancing techniques, combining them with 3 out-of-the-box classifiers. We applied the combination of techniques to the AVOCADO clinical study, which consists of a database of patient information including cardiovascular death or survival. Our results from the retrospective analysis of this database showed that for training the algorithm to predict cardiovascular outcomes in both sexes, the best undersampling techniques were ENN, RENN and Near-Miss 3, while ADASYN and SMOTE were the best oversampling techniques. Regarding the classifier algorithms, Random Forest and Logistic Regression (with internal balancing parameter enabled) achieved the best results with both families of balancing techniques. Proper balancing techniques associated with feature importance analysis improved the identification of clinical patterns in the data, enabling detection of high risk patients. This approach can be used for personalized medicine, for improving patients survival and recovery.

Supported by organizations IFES, CAPES and FAPESP (procs 2018/18560-6, 2018/21934-5) and EMPATHY trial ABM05/2020/1.1.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/dmlc/xgboost.

  2. 2.

    In this work, we used the Scikit-Learn Python library (https://scikit-learn.org/) to perform the analyses: Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825–2830, 2011.

  3. 3.

    Regarding the parameters configuration of over-sampling techniques, to SMOTE and BorderlineSMOTE we set \(k\_neighbors\,=\,5\) and to ADASYN we set n_neighbors = 5. Concerning the parameters configuration of under-sampling techniques, to NearMiss versions 1, 2 and 3 we set n_neighbors = 3 and \(n\_neighbors\_ver3\,=\,20\) (only for version 3). For ENN and RENN we set n_neighbors = 3 and, to ClusterCentroids, voting = ‘auto’.

  4. 4.

    Platelet graph shows better results with Near-Miss techniques. This indicates the requirement for an automatic framework to find the best combination for each dataset.

References

  1. Ahmed, Z., Mohamed, K., Zeeshan, S., Dong, X.: Artificial intelligence with multi-functional machine learning platform development for better healthcare and precision medicine. Database 2020, 1–35 (2020)

    Article  Google Scholar 

  2. Estabrooks, A., Jo, T., Japkowicz, N.: A multiple resampling method for learning from imbalanced data sets. Comput. Intell. 20(1), 18–36 (2004)

    Article  MathSciNet  Google Scholar 

  3. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)

    Article  Google Scholar 

  4. Fehr, D., et al.: Automatic classification of prostate cancer Gleason scores from multiparametric magnetic resonance images. Proc. Natl. Acad. Sci. 112(46), E6265–E6273 (2015)

    Article  Google Scholar 

  5. Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera, F.: Learning from Imbalanced Data Sets. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98074-4

    Book  Google Scholar 

  6. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)

    Article  Google Scholar 

  7. Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)

    Article  Google Scholar 

  8. Krawczyk, B., Galar, M., Jeleń, Ł, Herrera, F.: Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Appl. Soft Comput. 38, 714–726 (2016)

    Article  Google Scholar 

  9. Larrañaga, P., et al.: Machine learning in bioinformatics. Brief. Bioinform. 7(1), 86–112 (2006)

    Article  Google Scholar 

  10. Min, S., Lee, B., Yoon, S.: Deep learning in bioinformatics. Brief. Bioinform. (March) bbw068 (2016)

    Google Scholar 

  11. Mohedano-Munoz, M., Alique-García, S., Rubio-Sánchez, M., Raya, L., Sanchez, A.: Interactive visual clustering and classification based on dimensionality reduction mappings: a case study for analyzing patients with dermatologic conditions. Expert Syst. Appl. 171(2019), 114605 (2021)

    Article  Google Scholar 

  12. Rosiak, M., et al.: Effect of ASA dose doubling versus switching to clopidogrel on plasma inflammatory markers concentration in patients with type 2 diabetes and high platelet reactivity: the AVOCADO study. Cardiol. J. 20(5), 545–551 (2013)

    Article  Google Scholar 

  13. Sabatino, J., et al.: MicroRNAs fingerprint of bicuspid aortic valve. J. Mol. Cellular Cardiol. 134(July), 98–106 (2019)

    Article  Google Scholar 

  14. Oh, S., Lee, M.S., Zhang, B.-T.: Ensemble learning with active example selection for imbalanced biomedical data classification. IEEE/ACM Trans. Comput. Biol. Bioinform. 8(2), 316–325 (2011)

    Article  Google Scholar 

  15. Shah, P., et al.: Artificial intelligence and machine learning in clinical development: a translational perspective. NPJ Digit. Med. 2(1), 69 (2019)

    Article  Google Scholar 

  16. Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning - ICML 2007, vol. 227, pp. 935–942. ACM Press, New York (2007)

    Google Scholar 

  17. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sérgio Nery Simões .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fonseca, A.B., Martins-Jr, D.C., Wicik, Z., Postula, M., Simões, S.N. (2022). Addressing Classification on Highly Imbalanced Clinical Datasets. In: Bansal, M.S., et al. Computational Advances in Bio and Medical Sciences. ICCABS 2021. Lecture Notes in Computer Science(), vol 13254. Springer, Cham. https://doi.org/10.1007/978-3-031-17531-2_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-17531-2_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-17530-5

  • Online ISBN: 978-3-031-17531-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics