Skip to main content

MP-KMeans: K-Means with Missing Pattern for Data of Missing Not at Random

  • Conference paper
  • First Online:
Rough Sets (IJCRS 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13633))

Included in the following conference series:

  • 674 Accesses

Abstract

K-Means is one of the most popular clustering algorithm. It aims to minimize the sum of pair-wise distance within a cluster. It has been widely used in data analysis, image recognition and many other fields. However, traditional K-Means cannot handle missing values, which greatly limits its application scenarios. Missing values are ubiquitous in the real world due to sensor failure, high cost, and privacy protection. The appearance of missing values leads to useful information lost in the information system, and makes it difficult to perform data mining. Currently, improvements of K-Means for missing values generally based on data completion and partial distance strategy. Above methods achieve satisfied performance with random missing values, but they will fail when data is missing not at random (MNAR). Considering the effect of missing mechanism, this paper proposes an improved method of traditional K-Means for data of missing not at random, which integrating missing pattern in the distance measurement to assist clustering process. The experiment results on public datasets show that the proposed method outperforms data completion-based K-Means and partial distance-based K-Means.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://archive.ics.uci.edu/ml/datasets/Soybean+%28Large%29.

  2. 2.

    https://archive.ics.uci.edu/ml/datasets/Iris.

  3. 3.

    https://archive.ics.uci.edu/ml/datasets/Glass+Identification.

References

  1. Afridi, M.K., Azam, N., Yao, J., Alanazi, E.: A three-way clustering approach for handling missing data using GTRS. Int. J. Approx. Reason. 98, 11–24 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  2. Bezdek, J.C., Ehrlich, R., Full, W.: FCM: the fuzzy c-means clustering algorithm. Comput. Geosci. 10(2–3), 191–203 (1984)

    Article  Google Scholar 

  3. Drineas, P., Frieze, A., Kannan, R., Vempala, S., Vinay, V.: Clustering large graphs via the singular value decomposition. Mach. Learn. 56(1), 9–33 (2004)

    Article  MATH  Google Scholar 

  4. Elhamifar, E., Vidal, R.: Sparse subspace clustering: algorithm, theory, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2765–2781 (2013)

    Article  Google Scholar 

  5. Fan, J., Chow, T.W.: Sparse subspace clustering for data with missing entries and high-rank matrix completion. Neural Netw. 93, 36–44 (2017)

    Article  MATH  Google Scholar 

  6. García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R., Verleysen, M.: K nearest neighbours with mutual information for simultaneous classification and missing data imputation. Neurocomputing 72(7–9), 1483–1493 (2009)

    Article  Google Scholar 

  7. Gunnemann, S., Muller, E., Raubach, S., Seidl, T.: Flexible fault tolerant subspace clustering for data with missing values. In: 2011 IEEE 11th International Conference on Data Mining, pp. 231–240. IEEE (2011)

    Google Scholar 

  8. Hartigan, J.A., Wong, M.A.: Algorithm as 136: a k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979)

    Google Scholar 

  9. Hathaway, R.J., Bezdek, J.C.: Fuzzy c-means clustering of incomplete data. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 31(5), 735–744 (2001)

    Article  Google Scholar 

  10. Li, J., Song, S., Zhang, Y., Zhou, Z.: Robust k-median and k-means clustering algorithms for incomplete data. Math. Probl. Eng. 2016 (2016)

    Google Scholar 

  11. Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  12. Santos, M.S., Pereira, R.C., Costa, A.F., Soares, J.P., Santos, J., Abreu, P.H.: Generating synthetic missing data: a review by missing mechanism. IEEE Access 7, 11651–11667 (2019)

    Article  Google Scholar 

  13. Vassilvitskii, S., Arthur, D.: k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035 (2006)

    Google Scholar 

  14. Wang, H., Wang, S.: Discovering patterns of missing data in survey databases: an application of rough sets. Expert Syst. Appl. 36(3), 6256–6260 (2009)

    Article  Google Scholar 

  15. Wang, S., et al.: K-means clustering with incomplete data. IEEE Access 7, 69162–69171 (2019)

    Article  Google Scholar 

  16. Yao, Y.: Three-way decision: an interpretation of rules in rough set theory. In: Wen, P., Li, Y., Polkowski, L., Yao, Y., Tsumoto, S., Wang, G. (eds.) RSKT 2009. LNCS (LNAI), vol. 5589, pp. 642–649. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02962-2_81

    Chapter  Google Scholar 

  17. Yu, H., Su, T., Zeng, X.: A three-way decisions clustering algorithm for incomplete data. In: Miao, D., Pedrycz, W., Ślȩzak, D., Peters, G., Hu, Q., Wang, R. (eds.) RSKT 2014. LNCS (LNAI), vol. 8818, pp. 765–776. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11740-9_70

    Chapter  Google Scholar 

  18. Zhang, L., Lu, W., Liu, X., Pedrycz, W., Zhong, C.: Fuzzy C-means clustering of incomplete data based on probabilistic information granules of missing values. Knowl.-Based Syst. 99, 51–70 (2016)

    Article  Google Scholar 

Download references

Acknowledgements

This work was jointly supported by the National Natural Science Foundation of China (62136002, 61876027), and the Natural Science Foundation of Chongqing (cstc2022ycjh-bgzxm0004).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hong Yu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhou, R., Yu, H. (2022). MP-KMeans: K-Means with Missing Pattern for Data of Missing Not at Random. In: Yao, J., Fujita, H., Yue, X., Miao, D., Grzymala-Busse, J., Li, F. (eds) Rough Sets. IJCRS 2022. Lecture Notes in Computer Science(), vol 13633. Springer, Cham. https://doi.org/10.1007/978-3-031-21244-4_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-21244-4_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-21243-7

  • Online ISBN: 978-3-031-21244-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics