Abstract
In this article we present a new and efficient algorithm to handle missing values in databases applied in data mining (DM). Missing values may harm the calculation of the clustering algorithm, and might lead to distorted results. Therefore missing values must be treated before the DM. Commonly, methods to handle missing values are implemented as a separate process from the DM. This may cause a long runtime and may lead to redundant I/O accesses. As a result, the entire DM process may be inefficient. We present a new algorithm (km-Impute) which integrates clustering and imputation of missing values in a unified process. The algorithm was tested on real Red wine quality measures (from the UCI Machine Learning Repository). km-Impute succeeded in imputing missing values and in building clusters as a unified integrated process. The structure and quality of clusters which were produced by km-Impute were similar to clusters of k-means. In addition, the clusters were analyzed by a wine expert. The clusters represented different types of Red wine quality. The success and the accuracy of the imputation were validated using another two datasets: White wine and Page blocks (from the UCI). The results were consistent with the tests which were applied on Red wine: The ratio of success of imputation in all three datasets was similar. Although the complexity of km-Impute was the same as k-means, in practice it was more efficient when applying on middle sized databases: The runtime was significantly shorter than k-means and fewer iterations were required until convergence. km-Impute also performed much less I/O accesses in comparison to k-means.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Pub, Waltham (2012)
Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann Pub, San Francisco (1999)
Suthar, B., Patel, H., Goswami, A.: A survey: classification of imputation methods in data mining. Int. J. Emerg. Technol. Adv. Eng. 2(1), 309–312 (2012)
Fujikawa, Y., Ho, T.-B.: Cluster-based algorithms for dealing with missing values. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 549–554. Springer, Heidelberg (2002)
Bache, K., Lichman, M.: UCI Machine Learning Repository (2013). http://archive.ics.uci.edu/ml/
Zhang, S., Zhang, J., Zhu, X., Qin, Y., Zhang, C.: Missing value imputation based on data clustering. In: Gavrilova, M.L., Kenneth Tan, C.J. (eds.) Transactions on Computational Science I. LNCS, vol. 4750, pp. 128–138. Springer, Heidelberg (2008)
Ayuyev, V.V., Jupin, J., Harris, P.W., Obradovic, Z.: Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data. In: Pedersen, T.B., Mohania, M.K., Tjoa, A.M. (eds.) DaWaK 2009. LNCS, vol. 5691, pp. 366–377. Springer, Heidelberg (2009)
Miller, L.D., Stender, N., Soh, L.K., Samal, A., Kupzyk, K.: Hierarchical clustering algorithm with dynamic tree cut for data imputation (2011). http://ponca.unl.edu/facdb/csefacdb/TechReportArchive/TR-UNL-CSE-2011-0003.pdf
Luengo, J., Garcia, S., Herrera, F.: Imputation of missing values : methods’ description. University of Granada, Granada, Spain (2011). http://sci2s.ugr.es/MVDM/pdf/MV-methods-description-Complementary-material.pdf
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst. 47(4), 547–553 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Ben Ishay, R., Herman, M. (2015). A Novel Algorithm for the Integration of the Imputation of Missing Values and Clustering. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2015. Lecture Notes in Computer Science(), vol 9166. Springer, Cham. https://doi.org/10.1007/978-3-319-21024-7_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-21024-7_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21023-0
Online ISBN: 978-3-319-21024-7
eBook Packages: Computer ScienceComputer Science (R0)