Skip to main content
Log in

A partial order framework for incomplete data clustering

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

We propose in this paper a partial order framework for clustering incomplete data. The paramount feature of this framework is that it spans over a partial order that can be leveraged to establish data similarity. We present the underlying theoretical foundations and study the convergence of clustering algorithms in this framework. In addition, we present a partial order-based clustering algorithm (POK-means) that illustrates the embedding of K-means clustering algorithm in our framework. The first contribution of our method is that unlike methods based on imputation of the missing values, our method does not make any assumptions about missing data. Another important contribution is that it alleviates false dismissals caused by other interval-based similarity measures. The experimental results show that, although our method do not assume any prior knowledge of (or assumptions about) missing data, it is competitive to most of published incomplete data clustering methods that are based on assumptions about input data or imputation (e.g. methods based on partial or interval kernel distances) in accuracy and performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Basten T, Bosnacki D, Geilen M (2004) Cluster-based partial-order reduction. Autom Softw Eng 11:365–402

    Article  Google Scholar 

  2. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood estimation from incomplete data via the em algorithm. J R Stat Soc Ser B 39:1–38

    MATH  Google Scholar 

  3. Dinh D, Huynh V, Sriboonchitta S (2021) Clustering mixed numerical and categorical data with missing values. Inf Sci 571:418–442

    Article  MathSciNet  Google Scholar 

  4. Dua D, Graff C (2019) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences, http://archive.ics.uci.edu/ml. Last visit: May 2022

  5. Fahad A, Alshatri N, Tari Z, Alamri A, Khalid I, Zomaya A, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2:267–279

    Article  Google Scholar 

  6. Faloutos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. SIGMOD Rec 23:419–429

    Article  Google Scholar 

  7. Hathaway R, Bezdek J (2001) Fuzzy c-means clustering of incomplete data. IEEE Trans Syst Man Cybern B 31:735–744

    Article  Google Scholar 

  8. Hendriksen M, Francis A (2020) A partial order and cluster-similarity metric on rooted phylogenetic trees. J Math Biol 80:1265–1290

    Article  MathSciNet  MATH  Google Scholar 

  9. Kang H (2013) The prevention and handling of the missing data. Korean J Anesthesiol 64:402–406

    Article  Google Scholar 

  10. Kline RB (2015) Principles and Practices of Structural Equation Modeling, Fourth Edition. Guilford Press, New York. ISBN: 978-1-4625-2335-1

    Google Scholar 

  11. Li D, Gu H, Zhang L (2010) A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data. Expert Syst Appl 37:6942–6947

    Article  Google Scholar 

  12. Li T, Zhang L, Wei L, Hou H, Liu X, Pedrycz W (2017) Interval kernel fuzzy c-means clustering of incomplete data. Neurocomputing 237:316–331

    Article  Google Scholar 

  13. Lin J, Keogh E, Wei L, Leonardi S (2007) Experiencing SAX: a novel symbolic representation of time series, vol 15

  14. Matyja A, Siminski K (2014) Comparison of algorithms for clustering incomplete data. Found Comput Decis Sci 39:107–127

    Article  MATH  Google Scholar 

  15. Meidan Y, Bohadana M, Mathov Y, Mirsky Y, Breitenbacher D, Shabtai A, Elovici Y (2018) N-baiot: network-based detection of iot botnet attacks using deep autoencoders. IEEE Pervasive Computing, Special Issue - Securing the IoT 17:12–22

    Article  Google Scholar 

  16. Quilan JR (1986) Induction of decision trees. Mach Learn 1:81–106

    Article  Google Scholar 

  17. Raskin A (2014) Comparison of partial orders clustering techniques. Proc ISP RAS 26:91–98

    Article  Google Scholar 

  18. Rodrigues A, Ospina R, Ferreira M (2021) Adaptive kernel fuzzy clustering for missing data. PLoS ONE 16:1–33

    Article  Google Scholar 

  19. Rodriguez M, Comin C, Casanova D, Bruno M, Amancio D, Costa L, Rodrigues A (2019) Clustering algorithms: a comparitive approach. PLoS ONE 14:1–34

    Article  Google Scholar 

  20. Sammut C, Webb G (2017) Encyclopedia of Machine Learning and Data Mining, Second Edition. Springer, New York. ISBN: 978-1-4899-7685-7

    Book  MATH  Google Scholar 

  21. Schafer JL, Olsen MK (1998) Multiple imputation for multivariate missing data problems: a data analyst’s perspective. Multivar Behav Res 33:545–571

    Article  Google Scholar 

  22. Schlomer G, Bauman S, Card N (2010) Best practices for missing data managemnt in counseling psychology. Journal of Counseling Psychology American Psychological Assocition 57:1–10

    Google Scholar 

  23. Shi H, Wang P, Yang X, Yu H (2020) An improved mean imputation clustering algorithm for incomplete data. Neural Process Lett, https://doi.org/10.1007/s11063-020-10298-5

  24. Siwei W, Miaomiao L, Ning H, En Z, Jingtao H, Xinwang L, Jianping Y (2019) K-means clustering with incomplete data. IEEE Access 7:69162–69171

    Article  Google Scholar 

  25. Tellaroli P, Bazzi M, Donato M, Brazzale AR, Drăghici S (2016) Cross-clustering: a partial clustering algorithm with automatic estimation of the number of clusters. PLoS ONE 11:1–14

    Article  Google Scholar 

  26. Ukkonen A (2011) Clustering algorithms for chains. J Mach Learn Res 12:1389–1423

    MathSciNet  MATH  Google Scholar 

  27. Zhang Y, Li M, Wang S, Dai S, Luo L, Zgu E, Xu H, Zhu X, Yao C, Zhou H (2021) K-Means Clustering with incomplete data. ACM Trans Multimed Comput Commun Appl 17:1–14

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hamdi Yahyaoui.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: A

Appendix: A

Proposition 1

≼ is a partial order.

Proof

  • Reflexivity: [aL,aR] ≼ [aL,aR] since aLaLaRaR.

  • Asymmetry: Let us assume that [aL,aR] ≼ [\(a_{L}^{\prime },a_{R}^{\prime }\)] and [\(a_{L}^{\prime },a_{R}^{\prime }\)] ≼ [aL,aR]. This means that \(a_{L} \leq a_{L}^{\prime }\) and \(a_{L}^{\prime } \leq a_{L}\). Hence, \(a_{L} = a_{L}^{\prime }\). By similar reasoning, we have \(a_{R} = a_{R}^{\prime }\). So [aL,aR] = [\(a_{L}^{\prime },a_{R}^{\prime }\)].

  • Transitivity: Let us assume that [aL,aR] ≼ [\(a_{L}^{\prime },a_{R}^{\prime }\)] and [\(a_{L}^{\prime },a_{R}^{\prime }\)] ≼ [\(a_{L}^{\prime \prime },a_{R}^{\prime \prime }\)]. This means that \(a_{L} \leq a_{L}^{\prime }\) and \(a_{L}^{\prime } \leq a_{L}^{\prime \prime }\). So \(a_{L} \leq a_{L}^{\prime \prime }\). By similar reasoning, we have \(a_{R} \leq a_{R}^{\prime \prime }\). Therefore, [aL,aR] ≼ [\(a_{L}^{\prime \prime },a_{R}^{\prime \prime }\)].

Proposition 2

(\({{\mathscr{L}}_{\mathcal {I}}},\preceq \)) is a complete lattice.

Proof

Let \(S \subseteq {\mathscr{L}}_{\mathcal {I}}\) such that S = {[a1,b1], [a2,b2], …, [an,bb]}. Then, an upper bound of the elements in S is \(u=[\max \limits (a_{i}), \max \limits (b_{i})]\). Let [x,y] another upper bound of S. Based on the partial order ≼, we can prove easily that \(\max \limits (a_{i}) \leq x\) and \(\max \limits (b_{i}) \leq y\), which means that \([\max \limits (a_{i}), \max \limits (b_{i})] \preceq [x,y]\). Therefore, u is the least upper bound of S. By similar reasoning, S has a greatest lower bound \(l = [\min \limits (a_{i}\)), \(\min \limits (b_{i})]\). □

Theorem 1

Any approximation based on the weighted distance POWD satisfies the lower bounding constraint.

Proof

Let di be an approximation value of a distance between two multidimensional data x and y for the ith feature. We assume that di belongs to the interval distance [\(d_{\min \limits } = {\sum }_{i} d_{\min \limits }^{i},d_{\max \limits } = {\sum }_{i} d_{\max \limits }^{i}\)] in POF. This means that di\(d_{\max \limits }\). So, wi diwi \(d_{\max \limits }\). Thus, \(\sum \) wi di ≤ (\(\sum \) wi) \(d_{\max \limits }\). Since by the definition of POWD, we have \(\sum \) wi = 1, we conclude that POWD satisfies the lower bounding constraint. □

Theorem 2

POK-Means in its strict version converges to a local minimum of the lattice (\({\mathscr{L}}_{\mathcal {I}},\preceq \)).

Proof

The convergence proof is based on the following facts:

  • The set of (\({\mathscr{L}}_{\mathcal {I}},\preceq \)) is a complete lattice.

  • In each iteration the SSE is minimized. So the sequence of iterations leads to a decreasing chain in term of SSE. Let S be the set of I0, I1, …, In. S is finite since the number of configurations is finite. Since S is a subset of I has a greatest lower bound.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yahyaoui, H., AboElfotoh, H. & Shu, Y. A partial order framework for incomplete data clustering. Appl Intell 53, 7439–7454 (2023). https://doi.org/10.1007/s10489-022-03887-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03887-5

Keywords

Navigation