Abstract
Clinical databases often comprise noisy, inconsistent, missing, imbalanced and high dimensional data. These challenges may reduce the performance of DM techniques. Data preprocessing is, therefore, essential step in order to use DM algorithms on these medical datasets as regards making it appropriate and suitable for mining. The objective is to carry out a systematic mapping study in order to review the use of preprocessing techniques in clinical datasets. As results, 110 papers published between January 2000 and March 2017 were, selected, analyzed and classified according to publication years and channels, research type and the preprocessing tasks used. This study shows that researchers have paid a considerable amount of attention to preprocessing in medical DM in last decade and a significant number of the selected studies used data reduction and cleaning preprocessing tasks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Kitchenham, B., Budgen, D., Brereton, O.P.: The value of mapping studies – participant-observer case study. In: Proceedings of the 14th international conference on Evaluation and Assessment in Software Engineering EASE 2010, pp. 25–33 (2010)
Petersen, K., Feldt, R., Mujtaba, S., Mattsson, M.: Systematic mapping studies in software engineering. In: Proceedings of the 12th international conference on Evaluation and Assessment in Software Engineering EASE 2008, pp. 68–77 (2008)
Bowyer, K.W.: Mentoring Advice on “Conferences Versus Journals” for CSE Faculty (2012)
Akay, M.F.: Support vector machines combined with feature selection for breast cancer diagnosis. Expert Syst. Appl. 36, 3240–3247 (2009)
Khemphila, A., Boonjing, V.: Heart disease classification using neural network and feature selection. In: 21st International Conference on Systems Engineering, pp. 406–409 (2011). https://doi.org/10.1109/icseng.2011.80
Poolsawad, N., Moore, L., Kambhampati, C., Cleland, J.G.F.: Issues in the mining of heart failure datasets. Int. J. Autom. Comput. 11, 162–179 (2014)
Almuhaideb, S., Menai, M.E.B.: Impact of preprocessing on medical data classification. Front. Comput. Sci. 10, 1082–1102 (2016)
Exarchos, T.P., Papaloukas, C., Fotiadis, D.I., Michalis, L.K.: An association rule mining-based methodology for automated detection of ischemic ECG beats. IEEE Trans. Biomed. Eng. 53, 1531–1540 (2006)
Demšar, J., et al.: Feature mining and predictive model construction from severe trauma patient’s data. Int. J. Med. Inform. 63, 41–50 (2001)
Duggal, R., Shukla, S., Chandra, S., Shukla, B., Khatri, S.K.: Impact of selected pre-processing techniques on prediction of risk of early readmission for diabetic patients in India. Int. J. Diabetes Dev. Ctries. 36, 469–476 (2016)
Razzaghi, T., Roderick, O., Safro, I., Marko, N.: Multilevel weighted support vector machine for classification on healthcare data with missing values. PLoS One 11 (2016)
Bai, B.M., Mangathayaru, N., Rani, B.P.: An Approach to Find Missing Values in Medical Datasets. In: Proceedings of the International Conference on Engineering & MIS 2015 - ICEMIS 2015, pp. 1–7 (2015). https://doi.org/10.1145/2832987.2833083
Lee, I.-N., Liao, S.-C., Embrechts, M.: Data mining techniques applied to medical information. Med. Inform. Internet Med. 25, 81–102 (2000)
Lungeanu, D., Zaharie, D., Zamfirache, F. Influence of Missing Values Handling on Classification Rules Evolved from Medical Data in Industrial Conference on Data Mining - Posters and Workshops (2008)
Zhang, Y., Kambhampati, C., Davis, D. N., Goode, K., Cleland, J.G.F.: A comparative study of missing value imputation with multiclass classification for clinical heart failure data. In Proceedings of 9th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2012, pp. 2840–2844 (2012)
Bhat, V.H., Rao, P.G., Shenoy, P.D., Venugopal, K.R., Patnaik, L.M.: An efficient prediction model for diabetic database using soft computing techniques. In: 12th International Conference Rough Sets, Fuzzy Sets, Data Mining Granular Computing RSFDGrC 2009, December 15, 2009 - December 18, 2009 5908 LNAI, pp. 328–335 (2009)
Mendes, D., Paredes, S., Rocha, T., Carvalho, P., Henriques, J., Cabiddu, R., Morais, J.: Assessment of cardiovascular risk based on a data -driven knowledge discovery approach. In: Conference of the IEEE Engineering in Medicine and Biology Society (2015)
Jayalskshmi, T., Santhakumaran, A.: Impact of preprocessing for diagnosis of diabetes mellitus using artificial neural networks. In: Second International Conference on Machine Learning and Computing (ICMLC), pp. 109–112 (2010). https://doi.org/10.1109/icmlc.2010.65
Karabulut, E.M., Ibrikci, T.: Effective automated prediction of vertebral column pathologies based on logistic model tree with SMOTE preprocessing. J. Med. Syst. 38, 50 (2014)
Huang, J., Li, Y.-F., Xie, M.: An empirical analysis of data preprocessing for machine learning-based software cost estimation. Inf. Softw. Technol. 67, 108–127 (2015)
Esfandiari, N., Babavalian, M.R., Moghadam, A.M.E., Tabar, V.K.: Knowledge discovery in medicine: Current issue and future trend. Expert Syst. Appl. 41, 4434–4463 (2014)
Jabbar, M.A., Deekshatulu, B. L., Chandra, P.: Computational intelligence technique for early diagnosis of heart disease. In: IEEE International Conference on Engineering and Technology (ICETECH), pp. 1–6 (2015)
Huang, M.W., et al.: Data preprocessing issues for incomplete medical datasets. Expert Syst. 33, 432–438 (2016)
Hejazi, M., Al-Haddad, S.A.R., Singh, Y.P., Hashim, S.J., Aziz, A.F.A.: Multiclass support vector machines for classification of ECG data with missing values. Appl. Artif. Intell. 29, 660–674 (2015)
El-Sappagh, S., Elmogy, M., Riad, A.M., Zaghlol, H., Badria, F.A.: EHR data preparation for case based reasoning construction. In: International Conference on Advanced Machine Learning Technologies and Applications, vol. 488, pp. 483–497(2014)
Duhamel, A., Nuttens, M.C., Devos, P., Picavet, M., Beuscart, R.: A preprocessing method for improving data mining techniques. Application to a large medical diabetes database. Stud. Health Technol. Inf. 95, 269–274 (2003)
Pérez, J., et al.: A data preparation methodology in data mining applied to mortality population databases. Adv. Intell. Syst. Comput. 353, 1173–1182 (2015)
Rahm, E., Do, H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 3–13 (2000)
Oded, M., Lior, R.: Data Mining and Knowledge Discovery Handbook, 2nd edn. Springer, US (2010)
Pradhan, M., Bamnote, G.R.: Efficient binary classifier for prediction of diabetes using data preprocessing and support vector machine. In: International Conference on Frontiers of Intelligent Computing: Theory and Applications, vol. 327, pp. 131–140 (2014)
Ragothaman, B., Sarojini, B.: A Multi-objective Non-Dominated Sorted Artificial Bee Colony Feature Selection Algorithm for Medical Datasets. Indian J. Sci. Technol. 9, 1–5 (2016)
Zhu, M., et al.: Dimensionality Reduction in Complex Medical Data: Improved Self-Adaptive Niche Genetic Algorithm. Comput. Math. Methods Med. 2015(2), 1–12 (2015)
Huang, Y., McCullagh, P., Black, N., Harper, R.: Feature selection and classification model construction on type 2 diabetic patients’ data. Artif. Intell. Med. 41, 251–262 (2007)
Longadge, R., Dongre, S.S., Malik, L.: Class imbalance problem in data mining: review. Int. J. Comput. Sci. Netw. 2, 83–87 (2013)
Abolkarlou, N.A., Niknafs, A.A., Ebrahimpour, M.K.: Ensemble imbalance classification: Using data preprocessing, clustering algorithm and genetic algorithm. In: Proceedings of the 4th International Conference on Computer and Knowledge Engineering, ICCKE 2014 (2014). https://doi.org/10.1109/iccke.2014.6993364
Brereton, P., Kitchenham, B.A., Budgen, D., Turner, M., Khalil, M.: Lessons from applying the systematic literature review process within the software engineering domain. J. Syst. Softw. 80, 571–583 (2007)
Kitchenham, B., Charters, S.: Guidelines for performing Systematic Literature reviews in Software Engineering Version 2.3. Engineering 45, 1051 (2007)
Ouhbi, S., Idri, A., Fernández-Alemán, J.L., Toval, A.: Requirements engineering education: a systematic mapping study. Requir. Eng. 20, 119–138 (2013)
Kadi, I., Idri, A., Fernandez-Aleman, J.L.: Knowledge discovery in cardiology: a systematic literature review. Int. J. Med. Inform. 97, 12–32 (2017)
Li, D.-C., Liu, C.-W., Hu, S.C.: A fuzzy-based data transformation for feature extraction to increase classification performance with small medical data sets. Artif. Intell. Med. 52, 45–52 (2011)
Kitchenham, B., Mendes, E., Travassos, G.: A systematic review of cross-vs. within-company cost estimation studies. In: Proceedings of the Empirical Assessment in Software Engineering, pp. 81–90 (2006)
Gonçalves, J.J., Rocha, Á.M.: A decision support system for quality of life in head and neck oncology patients. Head Neck Oncol. 4(1), 3 (2012)
Acknowledgements
This research is part of the project PPR1/09: “mPHR in Morocco” financed by the Ministry of High education and Scientific research in Morocco and CNRST, 2015-2017, and part of the GINSENG project (TIN2015-70259-C2-2-R) supported by the Spanish Ministry of Economy and Competitiveness and European FEDER funds.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Benhar, H., Idri, A., Fernández-Alemán, J.L. (2018). Data Preprocessing for Decision Making in Medical Informatics: Potential and Analysis. In: Rocha, Á., Adeli, H., Reis, L., Costanzo, S. (eds) Trends and Advances in Information Systems and Technologies. WorldCIST'18 2018. Advances in Intelligent Systems and Computing, vol 746. Springer, Cham. https://doi.org/10.1007/978-3-319-77712-2_116
Download citation
DOI: https://doi.org/10.1007/978-3-319-77712-2_116
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77711-5
Online ISBN: 978-3-319-77712-2
eBook Packages: EngineeringEngineering (R0)