Skip to main content

Comparing Statistical and Machine Learning Imputation Techniques in Breast Cancer Classification

  • Conference paper
  • First Online:
Computational Science and Its Applications – ICCSA 2020 (ICCSA 2020)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12252))

Included in the following conference series:

Abstract

Missing data imputation is an important task when dealing with crucial data that cannot be discarded such as medical data. This study evaluates and compares the impacts of two statistical and two machine learning imputation techniques when classifying breast cancer patients, using several evaluation metrics. Mean, Expectation-Maximization (EM), Support Vector Regression (SVR) and K-Nearest Neighbor (KNN) were applied to impute 18% of missing data missed completely at random in the two Wisconsin datasets. Thereafter, we empirically evaluated these four imputation techniques when using five classifiers: decision tree (C4.5), Case Based Reasoning (CBR), Random Forest (RF), Support Vector Machine (SVM) and Multi-Layer Perceptron (MLP). In total, 1380 experiments were conducted and the findings confirmed that classification using imputation based machine learning outperformed classification using statistical imputation. Moreover, our experiment showed that SVR was the best imputation method for breast cancer classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Oskouei, R.J., Kor, N.M., Maleki, S.A.: Data mining and medical world: Breast cancers’ diagnosis, treatment, prognosis and challenges. Am. J. Cancer Res. 7, 610–627 (2017)

    Google Scholar 

  2. Garg, B.: Optimizing number of inputs to classify breast cancer using artificial neural network. J. Comput. Sci. Syst. Biol. 02, 247–254 (2009). https://doi.org/10.4172/jcsb.1000037

    Article  Google Scholar 

  3. Sibbering, M., Courtney, C.A.: Management of breast cancer: basic principles. Surg. (United Kingdom) 34, 25–31 (2016). https://doi.org/10.1016/j.mpsur.2015.10.005

    Article  Google Scholar 

  4. Morimoto, L.M., et al.: Obesity, body size, and risk of postmenopausal breast cancer: the women’s health initiative (United States). Cancer Causes Control 13, 741–751 (2002). https://doi.org/10.1023/A:1020239211145

    Article  Google Scholar 

  5. Schafer, J.L., Graham, J.W.: Missing data: our view of the state of the art. Psychol. Methods 7, 147–177 (2002). https://doi.org/10.1037/1082-989X.7.2.147

    Article  Google Scholar 

  6. Bhat, V.H., Rao, P.G., Krishna, S., Shenoy, P.D., Venugopal, K.R., Patnaik, L.M.: An efficient framework for prediction in healthcare data using soft computing techniques. In: Abraham, A., Mauri, J.L., Buford, J.F., Suzuki, J., Thampi, S.M. (eds.) ACC 2011. CCIS, vol. 192, pp. 522–532. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22720-2_55

    Chapter  Google Scholar 

  7. Idri, A., Benhar, H., Fernández-Alemán, J.L., Kadi, I.: A systematic map of medical data preprocessing in knowledge discovery. Comput. Methods Programs Biomed. 162, 69–85 (2018)

    Google Scholar 

  8. Albayrak, M., Turhan, K., Informatics, M., Introduction, I.: A missing data imputation approach using clustering and maximum likelihood estimation. In: IEEE (ed.) 2017 Medical Technologies National Congress (TIPTEKNO), Trabzon, Turkey, pp. 1–4 (2017)

    Google Scholar 

  9. Kadi, I., Idri, A., Fernandez-Aleman, J.L.: Knowledge discovery in cardiology: a systematic literature review. Int. J. Med. Inform. 97, 12–32 (2017). https://doi.org/10.1016/j.ijmedinf.2016.09.005

    Article  Google Scholar 

  10. Lang, K.M., Little, T.D.: Principled missing data treatments. Prev. Sci. 19(3), 284–294 (2016). https://doi.org/10.1007/s11121-016-0644-5

    Article  Google Scholar 

  11. Idri, A., Abnane, I., Abran, A.: Missing data techniques in analogy-based software development effort estimation. J. Syst. Softw. 117, 595–611 (2016). https://doi.org/10.1016/j.jss.2016.04.058

    Article  Google Scholar 

  12. Jerez, J.M., et al.: Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif. Intell. Med. 50, 105–115 (2010). https://doi.org/10.1016/j.artmed.2010.05.002

    Article  Google Scholar 

  13. Gayathri, B.M., Sumathi, C.P.: Mamdani fuzzy inference system for breast cancer risk detection. In: 2015 IEEE International Conference on Computational Intelligence and Computing Research, ICCIC 2015, Madurai, Tamilnadu, India, pp. 1–6 (2016)

    Google Scholar 

  14. Myers, T.A.: Goodbye, listwise deletion: presenting hot deck imputation as an easy and effective tool for handling missing data. Commun. Methods Meas. 5, 297–310 (2011). https://doi.org/10.1080/19312458.2011.624490

    Article  Google Scholar 

  15. Barzi, F., Woodward, M.: Imputations of missing values in practice: results from imputations of serum cholesterol in 28 cohort studies. Am. J. Epidemiol. 160, 34–45 (2004). https://doi.org/10.1093/aje/kwh175

    Article  Google Scholar 

  16. Liu, Y., Gopalakrishnan, V.: An overview and evaluation of recent machine learning imputation methods using cardiac imaging data. Data (2017). https://doi.org/10.3390/data2010008

    Article  Google Scholar 

  17. Penone, C., et al.: Imputation of missing data in life-history trait datasets: which approach performs the best? Methods Ecol. Evol. 5, 961–970 (2014). https://doi.org/10.1111/2041-210X.12232

    Article  Google Scholar 

  18. Vateekul, P., Sarinnapakorn, K.: Tree-based approach to missing data imputation. In: ICDM Workshops 2009 - IEEE International Conference on Data Mining, pp. 70–75 (2009)

    Google Scholar 

  19. Idri, A., Abnane, I., Abran, A.: Support vector regression-based imputation in analogy-based software development effort estimation. J. Softw. Evol. Process. 30, 1–23 (2018). https://doi.org/10.1002/smr.2114

    Article  Google Scholar 

  20. Wu, X., Akbarzadeh Khorshidi, H., Aickelin, U., Edib, Z., Peate, M.: Imputation techniques on missing values in breast cancer treatment and fertility data. Health Inf. Sci. Syst. 7(1), 1–8 (2019). https://doi.org/10.1007/s13755-019-0082-4

    Article  Google Scholar 

  21. Acuña, E., Rodriguez, C.: The treatment of missing values and its effect on classifier accuracy. In: Banks, D., McMorris, F.R., Arabie, P., Gaul, W. (eds.) Classification, Clustering, and Data Mining Applications. STUDIES CLASS, pp. 639–647. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-642-17103-1_60

    Chapter  Google Scholar 

  22. Chlioui, I., Idri, A., Abnane, I.: Data preprocessing in knowledge discovery in breast cancer: systematic mapping study. Comput. Methods Biomech. Biomed. Eng. Imaging Vis., 1–15

    Google Scholar 

  23. Musil, C.M., Warner, C.B., Yobas, P.K., Jones, S.L.: A comparison of imputation techniques for handling Missing data. West. J. Nurs. Res. 24, 815–829 (2002). https://doi.org/10.1177/019394502762477004

    Article  Google Scholar 

  24. Santos, M.S., Soares, J.P., Henriques Abreu, P., Araújo, H., Santos, J.: Influence of data distribution in missing data imputation. In: ten Teije, A., Popow, C., Holmes, J.H., Sacchi, L. (eds.) AIME 2017. LNCS (LNAI), vol. 10259, pp. 285–294. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59758-4_33

    Chapter  Google Scholar 

  25. Peng, L., Lei, L.: A review of missing data treatment methods. Int. J. Intell. Inf. Manag. Syst. Technol. 1, 412–419 (2005)

    MathSciNet  Google Scholar 

  26. Moon, T.K.: The expectation-maximization algorithm (1996)

    Google Scholar 

  27. Chlioui, I., Idri, A., Abnane, I., de Gea, J.M.C., Fernández-Alemán, J.L.: Breast cancer classification with missing data imputation. In: Rocha, Á., Adeli, H., Reis, L.P., Costanzo, S. (eds.) WorldCIST 2019. AISC, vol. 932, pp. 13–23. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-16187-3_2

    Chapter  Google Scholar 

  28. Vapnik, V.: The Nature of Statistical Learning Theory (2013)

    Google Scholar 

  29. Debasish, B., Srimanta, P., Dipak Chandra, P.: Support vector regression. Neural Inf. Process. Lett. Rev. 11, 699–708 (2007)

    Google Scholar 

  30. Drucker, H., Surges, C.J.C., Kaufman, L., Smola, A., Vapnik, V.: Support vector regression machines. In: Advances in Neural Information Processing Systems (1997)

    Google Scholar 

  31. Smola, A.J., Schölkopf, B.: A tutorial on support vector regression (2004)

    Google Scholar 

  32. Müller, K.-R., Smola, A.J., Rätsch, G., Schölkopf, B., Kohlmorgen, J., Vapnik, V.: Predicting time series with support vector machines. In: Gerstner, W., Germond, A., Hasler, M., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 999–1004. Springer, Heidelberg (1997). https://doi.org/10.1007/BFb0020283

    Chapter  Google Scholar 

  33. Quinlan, J.R.: C4.5: Programs for Machine Learning. Elsevier, Amsterdam (1992)

    Google Scholar 

  34. Witten, I.H., Frank, E., Hall, M.A.: Data Mining. Elsevier, Amsterdam (2011)

    Google Scholar 

  35. Alpaydın, E.: Introduction to machine learning, London (2014)

    Google Scholar 

  36. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel Based Learning Methods. Cambridge University Press, Cambridge (2000)

    MATH  Google Scholar 

  37. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2, 121–167 (1998). https://doi.org/10.1023/A:1009715923555

    Article  Google Scholar 

  38. Marsilin, J.R.: An efficient CBIR approach for diagnosing the stages of breast cancer using KNN classifier. Bonfring Int. J. Adv. Image Process. 2, 01–05 (2012). https://doi.org/10.9756/bijaip.1127

    Article  Google Scholar 

  39. Odajima, K., Pawlovsky, A.P.: A detailed description of the use of the kNN method for breast cancer diagnosis. In: 2014 7th International Conference on BioMedical Engineering and Informatics, BMEI 2014, Dalian, China, pp. 688–692 (2014)

    Google Scholar 

  40. Kowarik, A., Templ, M.: Imputation with the R Package VIM. J. Stat. Softw. 74, 1–16 (2016). https://doi.org/10.18637/jss.v074.i07

  41. Hu, S., Liang, Y., Ma, L., He, Y.: MSMOTE: improving classification performance when training data is imbalanced. In: 2nd International Workshop on Computer Science and Engineering, WCSE 2009, Qingdao, China, pp. 13–17 (2009)

    Google Scholar 

  42. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001). https://doi.org/10.1023/A:1010933404324

    Article  MATH  Google Scholar 

  43. Pavlov, Y.L.: Random forest. In: Probabilistic Methods in Discrete Mathematics, pp. 11–18 (2000)

    Google Scholar 

  44. Ghosh, S., Mondal, S., Ghosh, B.: A comparative study of breast cancer detection based on SVM and MLP BPN classifier. In: 1st International Conference on Automation, Control, Energy and Systems, ACES 2014, Hooghly, West Bengal, India, pp. 1–4 (2014)

    Google Scholar 

  45. Brockmann, D., Hufnagel, L., Geisel, T.: Data Mining and Knowledge Discovery Handbook. Springer, Boston (2006)

    Google Scholar 

  46. Jhajharia, S., Varshney, H.K., Verma, S., Kumar, R.: A neural network based breast cancer prognosis model with PCA processed features. In: 2016 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2016, Jaipur, India, pp. 1896–1901 (2016)

    Google Scholar 

  47. Song, Q., Shepperd, M., Chen, X., Liu, J.: Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation. J. Syst. Softw. 81, 2361–2370 (2008). https://doi.org/10.1016/j.jss.2008.05.008

  48. Dua, D., Graff, C.: UCI Machine Learning Repository. http://archive.ics.uci.edu/ml

  49. Chlioui, I., Idri, A., Abnane, I., de Gea, J.M.C., Fernández-Alemán, J.L.: Breast cancer classification with missing data imputation (2019)

    Google Scholar 

  50. Idri, A., Hosni, M., Abnane, I., Carrillo de Gea, J.M., Fernández Alemán, J.L.: Impact of parameter tuning on machine learning based breast cancer classification. In: World Conference on Information Systems and Technologies, Galicia, Spain, pp. 115–125 (2019)

    Google Scholar 

  51. García, V., Mollineda, R.A., Sánchez, J.S.: Index of balanced accuracy: a performance measure for skewed class distributions. In: Araujo, H., Mendonça, A.M., Pinho, A.J., Torres, M.I. (eds.) IbPRIA 2009. LNCS, vol. 5524, pp. 441–448. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02172-5_57

    Chapter  Google Scholar 

  52. Jonsdottir, T., Hvannberg, E.T., Sigurdsson, H., Sigurdsson, S.: The feasibility of constructing a Predictive Outcome Model for breast cancer using the tools of data mining. Expert Syst. Appl. 34, 108–118 (2008). https://doi.org/10.1016/j.eswa.2006.08.029

    Article  Google Scholar 

  53. Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 37–46 (1960). https://doi.org/10.1177/001316446002000104

    Article  Google Scholar 

  54. Cortes, C., Mohri, M.: AUC optimization vs. error rate minimization. In: Advances in Neural Information Processing Systems, pp. 313–320 (2004)

    Google Scholar 

  55. Jelihovschi, E., Faria, J.C., Allaman, I.B.: ScottKnott: a package for performing the Scott-Knott clustering algorithm in R. TEMA (São Carlos) 15, 003 (2014). https://doi.org/10.5540/tema.2014.015.01.0003

  56. Hosni, M., Idri, A., Abran, A., Nassif, A.B.: On the value of parameter tuning in heterogeneous ensembles effort estimation. Soft. Comput. 22(18), 5977–6010 (2017). https://doi.org/10.1007/s00500-017-2945-4

    Article  Google Scholar 

  57. Vuurpijl, L., Schomaker, L.: An overview and comparison of voting methods for pattern recognition. In: IEEE (ed.) Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition, pp. 195–200 (2002)

    Google Scholar 

  58. García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19, 263–282 (2010). https://doi.org/10.1007/s00521-009-0295-6

    Article  Google Scholar 

  59. Salzberg, S.L.: On comparing classifiers: pitfalls to avoid and a recommended approach. Data Min. Knowl. Discov. 1, 317–328 (1997). https://doi.org/10.1023/A:1009752403260

    Article  Google Scholar 

  60. Hosni, M., Abnane, I., Idri, A., de Gea, J.M.C., Alemán, J.L.F.: Reviewing ensemble classification methods in breast cancer. Comput. Methods Programs Biomed. 177, 89–112 (2019). https://doi.org/10.1016/j.cmpb.2019.05.019

    Article  Google Scholar 

  61. Abnane, I., Hosni, M., Idri, A., Abran, A.: Analogy software effort estimation using ensemble KNN imputation. In: IEEE (ed.) 2019 45th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), Kallithea, Greece, pp. 228–235 (2019)

    Google Scholar 

  62. Twala, B.: An empirical comparison of techniques for handling incomplete data using decision trees. Appl. Artif. Intell. 23, 373–405 (2009). https://doi.org/10.1080/08839510902872223

    Article  Google Scholar 

  63. Abnane, I., Idri, A.: Improved analogy-based effort estimation with incomplete mixed data. In: Proceedings of the 2018 Federated Conference on Computer Science and Information Systems, FedCSIS 2018, Poznań, Poland, pp. 1015–1024 (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Idri .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chlioui, I., Abnane, I., Idri, A. (2020). Comparing Statistical and Machine Learning Imputation Techniques in Breast Cancer Classification. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2020. ICCSA 2020. Lecture Notes in Computer Science(), vol 12252. Springer, Cham. https://doi.org/10.1007/978-3-030-58811-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58811-3_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58810-6

  • Online ISBN: 978-3-030-58811-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics