K-Means over Incomplete Datasets Using Mean Euclidean Distance

AbdAllah, Loai; Shimshoni, Ilan

doi:10.1007/978-3-319-41920-6_9

Loai AbdAllah¹⁴ &
Ilan Shimshoni¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9729))

Included in the following conference series:

International Conference on Machine Learning and Data Mining in Pattern Recognition

3183 Accesses

Abstract

Missing values in data are common in real world applications. In this research we developed a new version of the well-known k-means clustering algorithm that deals with such incomplete datasets. The k-means algorithm has two basic steps, performed at each iteration: it associates each point with its closest centroid and then it computes the new centroids. So, to run it we need a distance function and a mean computation formula. To measure the similarity between two incomplete points, we use the distribution of the incomplete attributes. We propose several directions for computing the centroids. In the first, incomplete points are dealt with as one point and the centroid is computed according to the developed formula derived in this research. In the second and the third, each incomplete point is replaced with a large number of points according to the data distribution and from these points the centroid is computed. Even so, the runtime complexity of the suggested k-means is the same as the standard k-means over complete datasets. We experimented on six standard numerical datasets from different fields and compared the performance of our proposed k-means to other basic methods. Our experiments show that our suggested k-means algorithms outperform previously published methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Fuzzy c-Means Clustering of Incomplete Data Using Dimension-Wise Fuzzy Variances of Clusters

A partial order framework for incomplete data clustering

Article 02 August 2022

A Robust Fuzzy c-Means Clustering Algorithm for Incomplete Data

References

AbdAllah, L., Shimshoni, I.: Mean shift clustering algorithm for data with missing values. In: Bellatreche, L., Mohania, M.K. (eds.) DaWaK 2014. LNCS, vol. 8646, pp. 426–438. Springer, Heidelberg (2014)
Google Scholar
Donders, A.R.T., van der Heijden, G.J., Stijnen, T., Moons, K.G.: Review: a gentle introduction to imputation of missing values. Journal of Clinical Epidemiology 59(10), 1087–1091 (2006)
Article Google Scholar
Emil, E., Amaury, L., Vincent, V., Christophe, B.: Mixture of gaussians for distance estimation with missing data. Neurocomputing 131, 32–42 (2014)
Article Google Scholar
Ghahramani, Z., Jordan, M.: Learning from incomplete data. Technical Report, MIT AI Lab Memo, (1509) (1995)
Google Scholar
Grzymała-Busse, J.W., Hu, M.: A comparison of several approaches to missing attribute values in data mining. In: Ziarko, W.P., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI), vol. 2005, pp. 378–385. Springer, Heidelberg (2001)
Chapter Google Scholar
Hunt, L., Jorgensen, M.: Mixture model clustering for mixed data with missing information. Computational Statistics & Data Analysis 41(3), 429–440 (2003)
Article MathSciNet MATH Google Scholar
Ibrahim, J.G., Chen, M.H., Lipsitz, S.R., Herring, A.H.: Missing-data methods for generalized linear models: A comparative review. Journal of the American Statistical Association 100(469), 332–346 (2005)
Article MathSciNet MATH Google Scholar
Little, R.J.A.: Missing-data adjustments in large surveys. Journal of Business & Economic Statistics 6(3), 287–296 (1988)
MathSciNet Google Scholar
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. John Wiley & Sons (2014)
Google Scholar
Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Information Theory 28, 129–137 (1982)
Article MathSciNet MATH Google Scholar
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Symposium on Math, Statistics, and Probability, pp. 281–297 (1967)
Google Scholar
Magnani, M.: Techniques for dealing with missing data in knowledge discovery tasks. Obtido 15(01), 2007 (2004). http://magnanim.web.cs.unibo.it/index.html
Google Scholar
Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66(336), 846–850 (1971)
Article Google Scholar
Steinhaus, H.: Sur la division des corp materiels en parties. Bull. Acad. Polon. Science 4(3), 801–804 (1956)
MathSciNet MATH Google Scholar
Speech University of Eastern Finland and Image Processing Unit. Clustering dataset. (2008). http://cs.joensuu.fi/sipu/datasets/
Zhang, S., Qin, Z., Ling, C.X., Sheng, S.: “Missing is useful”: missing values in cost-sensitive decision trees. IEEE Trans. on KDE 17(12), 1689–1693 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Community Information Systems Zefat Academic College, Department of Mathematics and Computer Science, The College of Sakhnin for Teacher Education, Sakhnin, Israel
Loai AbdAllah
Department of Information Systems, University of Haifa, Haifa, Israel
Ilan Shimshoni

Authors

Loai AbdAllah
View author publications
You can also search for this author in PubMed Google Scholar
Ilan Shimshoni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Loai AbdAllah .

Editor information

Editors and Affiliations

IBaI, Inst of Comp Vision and applied Comp Sci, Leipzig, Sachsen, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

AbdAllah, L., Shimshoni, I. (2016). K-Means over Incomplete Datasets Using Mean Euclidean Distance. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2016. Lecture Notes in Computer Science(), vol 9729. Springer, Cham. https://doi.org/10.1007/978-3-319-41920-6_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-41920-6_9
Published: 28 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41919-0
Online ISBN: 978-3-319-41920-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics