Abstract
Missing values in data are common in real world applications. In this research we developed a new version of the well-known k-means clustering algorithm that deals with such incomplete datasets. The k-means algorithm has two basic steps, performed at each iteration: it associates each point with its closest centroid and then it computes the new centroids. So, to run it we need a distance function and a mean computation formula. To measure the similarity between two incomplete points, we use the distribution of the incomplete attributes. We propose several directions for computing the centroids. In the first, incomplete points are dealt with as one point and the centroid is computed according to the developed formula derived in this research. In the second and the third, each incomplete point is replaced with a large number of points according to the data distribution and from these points the centroid is computed. Even so, the runtime complexity of the suggested k-means is the same as the standard k-means over complete datasets. We experimented on six standard numerical datasets from different fields and compared the performance of our proposed k-means to other basic methods. Our experiments show that our suggested k-means algorithms outperform previously published methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
AbdAllah, L., Shimshoni, I.: Mean shift clustering algorithm for data with missing values. In: Bellatreche, L., Mohania, M.K. (eds.) DaWaK 2014. LNCS, vol. 8646, pp. 426–438. Springer, Heidelberg (2014)
Donders, A.R.T., van der Heijden, G.J., Stijnen, T., Moons, K.G.: Review: a gentle introduction to imputation of missing values. Journal of Clinical Epidemiology 59(10), 1087–1091 (2006)
Emil, E., Amaury, L., Vincent, V., Christophe, B.: Mixture of gaussians for distance estimation with missing data. Neurocomputing 131, 32–42 (2014)
Ghahramani, Z., Jordan, M.: Learning from incomplete data. Technical Report, MIT AI Lab Memo, (1509) (1995)
Grzymała-Busse, J.W., Hu, M.: A comparison of several approaches to missing attribute values in data mining. In: Ziarko, W.P., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI), vol. 2005, pp. 378–385. Springer, Heidelberg (2001)
Hunt, L., Jorgensen, M.: Mixture model clustering for mixed data with missing information. Computational Statistics & Data Analysis 41(3), 429–440 (2003)
Ibrahim, J.G., Chen, M.H., Lipsitz, S.R., Herring, A.H.: Missing-data methods for generalized linear models: A comparative review. Journal of the American Statistical Association 100(469), 332–346 (2005)
Little, R.J.A.: Missing-data adjustments in large surveys. Journal of Business & Economic Statistics 6(3), 287–296 (1988)
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. John Wiley & Sons (2014)
Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Information Theory 28, 129–137 (1982)
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Symposium on Math, Statistics, and Probability, pp. 281–297 (1967)
Magnani, M.: Techniques for dealing with missing data in knowledge discovery tasks. Obtido 15(01), 2007 (2004). http://magnanim.web.cs.unibo.it/index.html
Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66(336), 846–850 (1971)
Steinhaus, H.: Sur la division des corp materiels en parties. Bull. Acad. Polon. Science 4(3), 801–804 (1956)
Speech University of Eastern Finland and Image Processing Unit. Clustering dataset. (2008). http://cs.joensuu.fi/sipu/datasets/
Zhang, S., Qin, Z., Ling, C.X., Sheng, S.: “Missing is useful”: missing values in cost-sensitive decision trees. IEEE Trans. on KDE 17(12), 1689–1693 (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
AbdAllah, L., Shimshoni, I. (2016). K-Means over Incomplete Datasets Using Mean Euclidean Distance. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2016. Lecture Notes in Computer Science(), vol 9729. Springer, Cham. https://doi.org/10.1007/978-3-319-41920-6_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-41920-6_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41919-0
Online ISBN: 978-3-319-41920-6
eBook Packages: Computer ScienceComputer Science (R0)