Skip to main content

Application of Preprocessing Methods to Imbalanced Clinical Data: An Experimental Study

  • Conference paper
  • First Online:

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 471))

Abstract

In this paper we describe an experimental study where we analyzed data difficulty factors encountered in imbalanced clinical data sets and examined how selected data preprocessing methods were able to address these factors. We considered five data sets describing various pediatric acute conditions. In all these data sets the minority class was sparse and overlapped with the majority classes, thus difficult to learn. We studied five different preprocessing methods: random under- and oversampling, SMOTE, neighborhood cleaning rule and SPIDER2 that were combined with the following classifiers: k-nearest neighbors, decision trees and rules, naive Bayes, neural networks and support vector machines. Application of preprocessing always improved classification performance, and the largest improvement was observed for random undersampling. Moreover, naive Bayes was the best performing classifier regardless of a used preprocessing method.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://www.cs.waikato.ac.nz/ml/weka/.

References

  1. Bellazzi, R., Zupan, B.: Predictive data mining in clinical medicine: current issues and guidelines. Int. J. Med. Inf. 77(2), 81–97 (2008)

    Article  Google Scholar 

  2. Chawla, N.: Data mining for imbalanced datasets: an overview. In: Maimon, O., Rokach, L. (eds.): The Data Mining and Knowledge Discovery Handbook, pp. 853–867. Springer (2005)

    Google Scholar 

  3. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 341–378 (2002)

    MATH  Google Scholar 

  4. Cios, K., Moore, G.: Uniqueness of medical data mining. Artif. Intell. Med. 26, 1–24 (2002)

    Article  Google Scholar 

  5. Drummond, C., Holte, R.: C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Proceedings of the Workshop on Learning from Imbalanced Data Sets, ICML 2003, pp. 1–8 (2003)

    Google Scholar 

  6. Drummond, C., Holte, R.: Severe class imbalance: Why better algorithms aren’t the answer. In: Proceedings of the 16th European Conference ECML 2005, pp. 539–546, Springer (2005)

    Google Scholar 

  7. Farion, K., Wilk, S., Michalowski, W., O’Sullivan, D., Sayyad-Shirabad, J.: Comparing predictions made by a prediction model, clinical score, and physicians: pediatric asthma exacerbations in the emergency department. Appl. Clinic. Inform. 4(3), 376–391 (2013)

    Article  Google Scholar 

  8. He, H., Ma, Y.: Imbalanced Learning: Foundations, Algorithms and Applications. Wiley (2013)

    Google Scholar 

  9. Hoens, T., Chawla, N.: Imbalanced datasets: from sampling to classifiers. In: He, H., Ma, Y. (eds.) Imbalanced Learning: Foundations, Algorithms and Applications. Wiley, pp. 43–59 (2013)

    Google Scholar 

  10. Japkowicz, N.: Class imbalance: are we focusing on the right issue. In: Proceedings of the 2nd Workshop on Learning from Imbalanced Data Sets, ICML 2003, pp. 17–23 (2003)

    Google Scholar 

  11. Klement, W., Wilk, S., Michalowski, W., Matwin, S.: Classifying severely imbalanced data. In: Proceedings of the 24th Canadian Conference on Artificial Intelligence, Canadian AI 2011, pp. 258–264. Springer (2011)

    Google Scholar 

  12. Klement, W., Wilk, S., Michalowski, M., Farion, K., Osmond, M., Verter, V.: Predicting the need for CT imaging in children with minor head injury using an ensemble of naive bayes classifiers. Artif. Intell. Med. 54(3), 163–170 (2012)

    Article  Google Scholar 

  13. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the 14th International Conference ICML 1997, pp. 179–186 (1997)

    Google Scholar 

  14. Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Proceedings of the 8th Conference AIME 2001. Volume 2101 of LNCS, pp. 63–66. Springer (2001)

    Google Scholar 

  15. Napierala, K., Stefanowski, J.: The influence of minority class distribution on learning from imbalance data. In: Proceedings of the 7th Conference HAIS 2012. Volume 7209 of LNAI, pp. 139–150. Springer (2012)

    Google Scholar 

  16. Napierala, K., Stefanowski, J.: Types of minority class examples and their influence on learning classifiers from imbalanced data. J. Intell. Inform. Syst. (2016, to appear)

    Google Scholar 

  17. Napierala, K., Stefanowski, J., Wilk, S.: Learning from imbalanced data in presence of noisy and borderline examples. In: Proceedings of the 7th International Conference RSCTC 2010. Volume 6086 of LNAI, pp. 158–167. Springer (2010)

    Google Scholar 

  18. Sajda, P.: Machine learning for detection and diagnosis of disease. Annu. Rev. Biomed. Eng. 8, 537–565 (2006)

    Article  Google Scholar 

  19. Saez, J., Luengo, J., Stefanowski, J., Herrera, F.: Addressing the noisy and borderline examples problem in classification with imbalanced datasets via a class noise filtering method-based re-sampling technique. Inform. Sci. 291, 184–203 (2015)

    Article  Google Scholar 

  20. Sanchez, V.G.J., Mollineda, R.: An empirical study of the behavior of classifiers on imbalanced and overlapped data sets. In: Proceedings of the 12th Iberoamerican Conference on Progress in Pattern Recognition, Image Analysis and Applications, pp. 397–406. Springer (2007)

    Google Scholar 

  21. Staelin, C.: Parameter selection for support vector machines. Technical Report HPL-2002-354 (R.1). HP Laboratories, Israel (2003)

    Google Scholar 

  22. Stefanowski, J., Wilk, S.: Selective pre-processing of imbalanced data for improving classification performance. In: Proceedings of the 10th International Conference DaWaK 2008. Volume 5182 of LNCS, pp. 283–292. Springer (2008)

    Google Scholar 

  23. Wallace, B., Small, K., Brodley, C., Trikalinos, T.: Class imbalance, redux. In: Proceedings of the 11th IEEE International Conference on Data Mining, pp. 754–763 (2011)

    Google Scholar 

  24. Wei, Q., Dunbrack, R.: The role of balanced training and testing data sets for binary classifiers in bioinformatics. PLoS ONE 7(8), e67863 (2013)

    Article  Google Scholar 

  25. Wilson, D., Martinez, T.: Improved heterogeneous distance functions. J. Atif. Intell. Res. 6, 1–34 (1997)

    MathSciNet  MATH  Google Scholar 

  26. Wilson, D., Martinez, T.: Reduction techniques for instance-based learning algorithms. Mach. Learn. J. 38, 257–286 (2000)

    Article  MATH  Google Scholar 

Download references

Acknowledgments

The first three authors would like to acknowledge support by the Polish National Science Center under Grant No. DEC-2013/11/B/ST6/ 00963.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Szymon Wilk .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Wilk, S., Stefanowski, J., Wojciechowski, S., Farion, K.J., Michalowski, W. (2016). Application of Preprocessing Methods to Imbalanced Clinical Data: An Experimental Study. In: Piętka, E., Badura, P., Kawa, J., Wieclawek, W. (eds) Information Technologies in Medicine. ITiB 2016. Advances in Intelligent Systems and Computing, vol 471. Springer, Cham. https://doi.org/10.1007/978-3-319-39796-2_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-39796-2_41

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-39795-5

  • Online ISBN: 978-3-319-39796-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics