Abstract
K-Means is one of the most popular clustering algorithm. It aims to minimize the sum of pair-wise distance within a cluster. It has been widely used in data analysis, image recognition and many other fields. However, traditional K-Means cannot handle missing values, which greatly limits its application scenarios. Missing values are ubiquitous in the real world due to sensor failure, high cost, and privacy protection. The appearance of missing values leads to useful information lost in the information system, and makes it difficult to perform data mining. Currently, improvements of K-Means for missing values generally based on data completion and partial distance strategy. Above methods achieve satisfied performance with random missing values, but they will fail when data is missing not at random (MNAR). Considering the effect of missing mechanism, this paper proposes an improved method of traditional K-Means for data of missing not at random, which integrating missing pattern in the distance measurement to assist clustering process. The experiment results on public datasets show that the proposed method outperforms data completion-based K-Means and partial distance-based K-Means.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Afridi, M.K., Azam, N., Yao, J., Alanazi, E.: A three-way clustering approach for handling missing data using GTRS. Int. J. Approx. Reason. 98, 11–24 (2018)
Bezdek, J.C., Ehrlich, R., Full, W.: FCM: the fuzzy c-means clustering algorithm. Comput. Geosci. 10(2–3), 191–203 (1984)
Drineas, P., Frieze, A., Kannan, R., Vempala, S., Vinay, V.: Clustering large graphs via the singular value decomposition. Mach. Learn. 56(1), 9–33 (2004)
Elhamifar, E., Vidal, R.: Sparse subspace clustering: algorithm, theory, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2765–2781 (2013)
Fan, J., Chow, T.W.: Sparse subspace clustering for data with missing entries and high-rank matrix completion. Neural Netw. 93, 36–44 (2017)
García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R., Verleysen, M.: K nearest neighbours with mutual information for simultaneous classification and missing data imputation. Neurocomputing 72(7–9), 1483–1493 (2009)
Gunnemann, S., Muller, E., Raubach, S., Seidl, T.: Flexible fault tolerant subspace clustering for data with missing values. In: 2011 IEEE 11th International Conference on Data Mining, pp. 231–240. IEEE (2011)
Hartigan, J.A., Wong, M.A.: Algorithm as 136: a k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979)
Hathaway, R.J., Bezdek, J.C.: Fuzzy c-means clustering of incomplete data. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 31(5), 735–744 (2001)
Li, J., Song, S., Zhang, Y., Zhou, Z.: Robust k-median and k-means clustering algorithms for incomplete data. Math. Probl. Eng. 2016 (2016)
Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
Santos, M.S., Pereira, R.C., Costa, A.F., Soares, J.P., Santos, J., Abreu, P.H.: Generating synthetic missing data: a review by missing mechanism. IEEE Access 7, 11651–11667 (2019)
Vassilvitskii, S., Arthur, D.: k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035 (2006)
Wang, H., Wang, S.: Discovering patterns of missing data in survey databases: an application of rough sets. Expert Syst. Appl. 36(3), 6256–6260 (2009)
Wang, S., et al.: K-means clustering with incomplete data. IEEE Access 7, 69162–69171 (2019)
Yao, Y.: Three-way decision: an interpretation of rules in rough set theory. In: Wen, P., Li, Y., Polkowski, L., Yao, Y., Tsumoto, S., Wang, G. (eds.) RSKT 2009. LNCS (LNAI), vol. 5589, pp. 642–649. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02962-2_81
Yu, H., Su, T., Zeng, X.: A three-way decisions clustering algorithm for incomplete data. In: Miao, D., Pedrycz, W., Ślȩzak, D., Peters, G., Hu, Q., Wang, R. (eds.) RSKT 2014. LNCS (LNAI), vol. 8818, pp. 765–776. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11740-9_70
Zhang, L., Lu, W., Liu, X., Pedrycz, W., Zhong, C.: Fuzzy C-means clustering of incomplete data based on probabilistic information granules of missing values. Knowl.-Based Syst. 99, 51–70 (2016)
Acknowledgements
This work was jointly supported by the National Natural Science Foundation of China (62136002, 61876027), and the Natural Science Foundation of Chongqing (cstc2022ycjh-bgzxm0004).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhou, R., Yu, H. (2022). MP-KMeans: K-Means with Missing Pattern for Data of Missing Not at Random. In: Yao, J., Fujita, H., Yue, X., Miao, D., Grzymala-Busse, J., Li, F. (eds) Rough Sets. IJCRS 2022. Lecture Notes in Computer Science(), vol 13633. Springer, Cham. https://doi.org/10.1007/978-3-031-21244-4_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-21244-4_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21243-7
Online ISBN: 978-3-031-21244-4
eBook Packages: Computer ScienceComputer Science (R0)