Comparing Statistical and Machine Learning Imputation Techniques in Breast Cancer Classification

Chlioui, Imane; Abnane, Ibtissam; Idri, Ali

doi:10.1007/978-3-030-58811-3_5

Imane Chlioui¹⁹,
Ibtissam Abnane¹⁹ &
Ali Idri^19,20

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12252))

Included in the following conference series:

International Conference on Computational Science and Its Applications

1840 Accesses
3 Citations

Abstract

Missing data imputation is an important task when dealing with crucial data that cannot be discarded such as medical data. This study evaluates and compares the impacts of two statistical and two machine learning imputation techniques when classifying breast cancer patients, using several evaluation metrics. Mean, Expectation-Maximization (EM), Support Vector Regression (SVR) and K-Nearest Neighbor (KNN) were applied to impute 18% of missing data missed completely at random in the two Wisconsin datasets. Thereafter, we empirically evaluated these four imputation techniques when using five classifiers: decision tree (C4.5), Case Based Reasoning (CBR), Random Forest (RF), Support Vector Machine (SVM) and Multi-Layer Perceptron (MLP). In total, 1380 experiments were conducted and the findings confirmed that classification using imputation based machine learning outperformed classification using statistical imputation. Moreover, our experiment showed that SVR was the best imputation method for breast cancer classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Breast Cancer Classification with Missing Data Imputation

Impact of machine learning-based imputation techniques on medical datasets- a comparative analysis

Article 11 April 2024

A Novel Hybrid Imputation Method to Predict Missing Values in Medical Datasets

References

Oskouei, R.J., Kor, N.M., Maleki, S.A.: Data mining and medical world: Breast cancers’ diagnosis, treatment, prognosis and challenges. Am. J. Cancer Res. 7, 610–627 (2017)
Google Scholar
Garg, B.: Optimizing number of inputs to classify breast cancer using artificial neural network. J. Comput. Sci. Syst. Biol. 02, 247–254 (2009). https://doi.org/10.4172/jcsb.1000037
Article Google Scholar
Sibbering, M., Courtney, C.A.: Management of breast cancer: basic principles. Surg. (United Kingdom) 34, 25–31 (2016). https://doi.org/10.1016/j.mpsur.2015.10.005
Article Google Scholar
Morimoto, L.M., et al.: Obesity, body size, and risk of postmenopausal breast cancer: the women’s health initiative (United States). Cancer Causes Control 13, 741–751 (2002). https://doi.org/10.1023/A:1020239211145
Article Google Scholar
Schafer, J.L., Graham, J.W.: Missing data: our view of the state of the art. Psychol. Methods 7, 147–177 (2002). https://doi.org/10.1037/1082-989X.7.2.147
Article Google Scholar
Bhat, V.H., Rao, P.G., Krishna, S., Shenoy, P.D., Venugopal, K.R., Patnaik, L.M.: An efficient framework for prediction in healthcare data using soft computing techniques. In: Abraham, A., Mauri, J.L., Buford, J.F., Suzuki, J., Thampi, S.M. (eds.) ACC 2011. CCIS, vol. 192, pp. 522–532. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22720-2_55
Chapter Google Scholar
Idri, A., Benhar, H., Fernández-Alemán, J.L., Kadi, I.: A systematic map of medical data preprocessing in knowledge discovery. Comput. Methods Programs Biomed. 162, 69–85 (2018)
Google Scholar
Albayrak, M., Turhan, K., Informatics, M., Introduction, I.: A missing data imputation approach using clustering and maximum likelihood estimation. In: IEEE (ed.) 2017 Medical Technologies National Congress (TIPTEKNO), Trabzon, Turkey, pp. 1–4 (2017)
Google Scholar
Kadi, I., Idri, A., Fernandez-Aleman, J.L.: Knowledge discovery in cardiology: a systematic literature review. Int. J. Med. Inform. 97, 12–32 (2017). https://doi.org/10.1016/j.ijmedinf.2016.09.005
Article Google Scholar
Lang, K.M., Little, T.D.: Principled missing data treatments. Prev. Sci. 19(3), 284–294 (2016). https://doi.org/10.1007/s11121-016-0644-5
Article Google Scholar
Idri, A., Abnane, I., Abran, A.: Missing data techniques in analogy-based software development effort estimation. J. Syst. Softw. 117, 595–611 (2016). https://doi.org/10.1016/j.jss.2016.04.058
Article Google Scholar
Jerez, J.M., et al.: Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif. Intell. Med. 50, 105–115 (2010). https://doi.org/10.1016/j.artmed.2010.05.002
Article Google Scholar
Gayathri, B.M., Sumathi, C.P.: Mamdani fuzzy inference system for breast cancer risk detection. In: 2015 IEEE International Conference on Computational Intelligence and Computing Research, ICCIC 2015, Madurai, Tamilnadu, India, pp. 1–6 (2016)
Google Scholar
Myers, T.A.: Goodbye, listwise deletion: presenting hot deck imputation as an easy and effective tool for handling missing data. Commun. Methods Meas. 5, 297–310 (2011). https://doi.org/10.1080/19312458.2011.624490
Article Google Scholar
Barzi, F., Woodward, M.: Imputations of missing values in practice: results from imputations of serum cholesterol in 28 cohort studies. Am. J. Epidemiol. 160, 34–45 (2004). https://doi.org/10.1093/aje/kwh175
Article Google Scholar
Liu, Y., Gopalakrishnan, V.: An overview and evaluation of recent machine learning imputation methods using cardiac imaging data. Data (2017). https://doi.org/10.3390/data2010008
Article Google Scholar
Penone, C., et al.: Imputation of missing data in life-history trait datasets: which approach performs the best? Methods Ecol. Evol. 5, 961–970 (2014). https://doi.org/10.1111/2041-210X.12232
Article Google Scholar
Vateekul, P., Sarinnapakorn, K.: Tree-based approach to missing data imputation. In: ICDM Workshops 2009 - IEEE International Conference on Data Mining, pp. 70–75 (2009)
Google Scholar
Idri, A., Abnane, I., Abran, A.: Support vector regression-based imputation in analogy-based software development effort estimation. J. Softw. Evol. Process. 30, 1–23 (2018). https://doi.org/10.1002/smr.2114
Article Google Scholar
Wu, X., Akbarzadeh Khorshidi, H., Aickelin, U., Edib, Z., Peate, M.: Imputation techniques on missing values in breast cancer treatment and fertility data. Health Inf. Sci. Syst. 7(1), 1–8 (2019). https://doi.org/10.1007/s13755-019-0082-4
Article Google Scholar
Acuña, E., Rodriguez, C.: The treatment of missing values and its effect on classifier accuracy. In: Banks, D., McMorris, F.R., Arabie, P., Gaul, W. (eds.) Classification, Clustering, and Data Mining Applications. STUDIES CLASS, pp. 639–647. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-642-17103-1_60
Chapter Google Scholar
Chlioui, I., Idri, A., Abnane, I.: Data preprocessing in knowledge discovery in breast cancer: systematic mapping study. Comput. Methods Biomech. Biomed. Eng. Imaging Vis., 1–15
Google Scholar
Musil, C.M., Warner, C.B., Yobas, P.K., Jones, S.L.: A comparison of imputation techniques for handling Missing data. West. J. Nurs. Res. 24, 815–829 (2002). https://doi.org/10.1177/019394502762477004
Article Google Scholar
Santos, M.S., Soares, J.P., Henriques Abreu, P., Araújo, H., Santos, J.: Influence of data distribution in missing data imputation. In: ten Teije, A., Popow, C., Holmes, J.H., Sacchi, L. (eds.) AIME 2017. LNCS (LNAI), vol. 10259, pp. 285–294. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59758-4_33
Chapter Google Scholar
Peng, L., Lei, L.: A review of missing data treatment methods. Int. J. Intell. Inf. Manag. Syst. Technol. 1, 412–419 (2005)
MathSciNet Google Scholar
Moon, T.K.: The expectation-maximization algorithm (1996)
Google Scholar
Chlioui, I., Idri, A., Abnane, I., de Gea, J.M.C., Fernández-Alemán, J.L.: Breast cancer classification with missing data imputation. In: Rocha, Á., Adeli, H., Reis, L.P., Costanzo, S. (eds.) WorldCIST 2019. AISC, vol. 932, pp. 13–23. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-16187-3_2
Chapter Google Scholar
Vapnik, V.: The Nature of Statistical Learning Theory (2013)
Google Scholar
Debasish, B., Srimanta, P., Dipak Chandra, P.: Support vector regression. Neural Inf. Process. Lett. Rev. 11, 699–708 (2007)
Google Scholar
Drucker, H., Surges, C.J.C., Kaufman, L., Smola, A., Vapnik, V.: Support vector regression machines. In: Advances in Neural Information Processing Systems (1997)
Google Scholar
Smola, A.J., Schölkopf, B.: A tutorial on support vector regression (2004)
Google Scholar
Müller, K.-R., Smola, A.J., Rätsch, G., Schölkopf, B., Kohlmorgen, J., Vapnik, V.: Predicting time series with support vector machines. In: Gerstner, W., Germond, A., Hasler, M., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 999–1004. Springer, Heidelberg (1997). https://doi.org/10.1007/BFb0020283
Chapter Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Elsevier, Amsterdam (1992)
Google Scholar
Witten, I.H., Frank, E., Hall, M.A.: Data Mining. Elsevier, Amsterdam (2011)
Google Scholar
Alpaydın, E.: Introduction to machine learning, London (2014)
Google Scholar
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel Based Learning Methods. Cambridge University Press, Cambridge (2000)
MATH Google Scholar
Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2, 121–167 (1998). https://doi.org/10.1023/A:1009715923555
Article Google Scholar
Marsilin, J.R.: An efficient CBIR approach for diagnosing the stages of breast cancer using KNN classifier. Bonfring Int. J. Adv. Image Process. 2, 01–05 (2012). https://doi.org/10.9756/bijaip.1127
Article Google Scholar
Odajima, K., Pawlovsky, A.P.: A detailed description of the use of the kNN method for breast cancer diagnosis. In: 2014 7th International Conference on BioMedical Engineering and Informatics, BMEI 2014, Dalian, China, pp. 688–692 (2014)
Google Scholar
Kowarik, A., Templ, M.: Imputation with the R Package VIM. J. Stat. Softw. 74, 1–16 (2016). https://doi.org/10.18637/jss.v074.i07
Hu, S., Liang, Y., Ma, L., He, Y.: MSMOTE: improving classification performance when training data is imbalanced. In: 2nd International Workshop on Computer Science and Engineering, WCSE 2009, Qingdao, China, pp. 13–17 (2009)
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Article MATH Google Scholar
Pavlov, Y.L.: Random forest. In: Probabilistic Methods in Discrete Mathematics, pp. 11–18 (2000)
Google Scholar
Ghosh, S., Mondal, S., Ghosh, B.: A comparative study of breast cancer detection based on SVM and MLP BPN classifier. In: 1st International Conference on Automation, Control, Energy and Systems, ACES 2014, Hooghly, West Bengal, India, pp. 1–4 (2014)
Google Scholar
Brockmann, D., Hufnagel, L., Geisel, T.: Data Mining and Knowledge Discovery Handbook. Springer, Boston (2006)
Google Scholar
Jhajharia, S., Varshney, H.K., Verma, S., Kumar, R.: A neural network based breast cancer prognosis model with PCA processed features. In: 2016 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2016, Jaipur, India, pp. 1896–1901 (2016)
Google Scholar
Song, Q., Shepperd, M., Chen, X., Liu, J.: Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation. J. Syst. Softw. 81, 2361–2370 (2008). https://doi.org/10.1016/j.jss.2008.05.008
Dua, D., Graff, C.: UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
Chlioui, I., Idri, A., Abnane, I., de Gea, J.M.C., Fernández-Alemán, J.L.: Breast cancer classification with missing data imputation (2019)
Google Scholar
Idri, A., Hosni, M., Abnane, I., Carrillo de Gea, J.M., Fernández Alemán, J.L.: Impact of parameter tuning on machine learning based breast cancer classification. In: World Conference on Information Systems and Technologies, Galicia, Spain, pp. 115–125 (2019)
Google Scholar
García, V., Mollineda, R.A., Sánchez, J.S.: Index of balanced accuracy: a performance measure for skewed class distributions. In: Araujo, H., Mendonça, A.M., Pinho, A.J., Torres, M.I. (eds.) IbPRIA 2009. LNCS, vol. 5524, pp. 441–448. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02172-5_57
Chapter Google Scholar
Jonsdottir, T., Hvannberg, E.T., Sigurdsson, H., Sigurdsson, S.: The feasibility of constructing a Predictive Outcome Model for breast cancer using the tools of data mining. Expert Syst. Appl. 34, 108–118 (2008). https://doi.org/10.1016/j.eswa.2006.08.029
Article Google Scholar
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 37–46 (1960). https://doi.org/10.1177/001316446002000104
Article Google Scholar
Cortes, C., Mohri, M.: AUC optimization vs. error rate minimization. In: Advances in Neural Information Processing Systems, pp. 313–320 (2004)
Google Scholar
Jelihovschi, E., Faria, J.C., Allaman, I.B.: ScottKnott: a package for performing the Scott-Knott clustering algorithm in R. TEMA (São Carlos) 15, 003 (2014). https://doi.org/10.5540/tema.2014.015.01.0003
Hosni, M., Idri, A., Abran, A., Nassif, A.B.: On the value of parameter tuning in heterogeneous ensembles effort estimation. Soft. Comput. 22(18), 5977–6010 (2017). https://doi.org/10.1007/s00500-017-2945-4
Article Google Scholar
Vuurpijl, L., Schomaker, L.: An overview and comparison of voting methods for pattern recognition. In: IEEE (ed.) Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition, pp. 195–200 (2002)
Google Scholar
García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19, 263–282 (2010). https://doi.org/10.1007/s00521-009-0295-6
Article Google Scholar
Salzberg, S.L.: On comparing classifiers: pitfalls to avoid and a recommended approach. Data Min. Knowl. Discov. 1, 317–328 (1997). https://doi.org/10.1023/A:1009752403260
Article Google Scholar
Hosni, M., Abnane, I., Idri, A., de Gea, J.M.C., Alemán, J.L.F.: Reviewing ensemble classification methods in breast cancer. Comput. Methods Programs Biomed. 177, 89–112 (2019). https://doi.org/10.1016/j.cmpb.2019.05.019
Article Google Scholar
Abnane, I., Hosni, M., Idri, A., Abran, A.: Analogy software effort estimation using ensemble KNN imputation. In: IEEE (ed.) 2019 45th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), Kallithea, Greece, pp. 228–235 (2019)
Google Scholar
Twala, B.: An empirical comparison of techniques for handling incomplete data using decision trees. Appl. Artif. Intell. 23, 373–405 (2009). https://doi.org/10.1080/08839510902872223
Article Google Scholar
Abnane, I., Idri, A.: Improved analogy-based effort estimation with incomplete mixed data. In: Proceedings of the 2018 Federated Conference on Computer Science and Information Systems, FedCSIS 2018, Poznań, Poland, pp. 1015–1024 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Software Project Management Research Team, ENSIAS, Mohammed V University in Rabat, Rabat, Morocco
Imane Chlioui, Ibtissam Abnane & Ali Idri
CSEHS-MSDA, Mohammed VI Polytechnic University, Ben Guerir, Morocco
Ali Idri

Authors

Imane Chlioui
View author publications
You can also search for this author in PubMed Google Scholar
Ibtissam Abnane
View author publications
You can also search for this author in PubMed Google Scholar
Ali Idri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ali Idri .

Editor information

Editors and Affiliations

University of Perugia, Perugia, Italy
Osvaldo Gervasi
University of Basilicata, Potenza, Potenza, Italy
Beniamino Murgante
Chair- Center of ICT/ICE, Covenant University, Ota, Nigeria
Sanjay Misra
University of Cagliari, Cagliari, Italy
Chiara Garau
University of Cagliari, Cagliari, Italy
Ivan Blečić
Clayton School of Information Technology, Monash University, Clayton, VIC, Australia
David Taniar
Department of Information Science, Kyushu Sangyo University, Fukuoka, Japan
Bernady O. Apduhan
University of Minho, Braga, Portugal
Ana Maria A. C. Rocha
Polytechnic University of Bari, Bari, Italy
Eufemia Tarantino
Polytechnic University of Bari, Bari, Italy
Carmelo Maria Torre
Department of Neurology, University of Massachusetts Medical School, Worcester, MA, USA
Yeliz Karaca

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chlioui, I., Abnane, I., Idri, A. (2020). Comparing Statistical and Machine Learning Imputation Techniques in Breast Cancer Classification. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2020. ICCSA 2020. Lecture Notes in Computer Science(), vol 12252. Springer, Cham. https://doi.org/10.1007/978-3-030-58811-3_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-58811-3_5
Published: 29 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58810-6
Online ISBN: 978-3-030-58811-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics