Abstract
We propose in this paper a partial order framework for clustering incomplete data. The paramount feature of this framework is that it spans over a partial order that can be leveraged to establish data similarity. We present the underlying theoretical foundations and study the convergence of clustering algorithms in this framework. In addition, we present a partial order-based clustering algorithm (POK-means) that illustrates the embedding of K-means clustering algorithm in our framework. The first contribution of our method is that unlike methods based on imputation of the missing values, our method does not make any assumptions about missing data. Another important contribution is that it alleviates false dismissals caused by other interval-based similarity measures. The experimental results show that, although our method do not assume any prior knowledge of (or assumptions about) missing data, it is competitive to most of published incomplete data clustering methods that are based on assumptions about input data or imputation (e.g. methods based on partial or interval kernel distances) in accuracy and performance.
Similar content being viewed by others
References
Basten T, Bosnacki D, Geilen M (2004) Cluster-based partial-order reduction. Autom Softw Eng 11:365–402
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood estimation from incomplete data via the em algorithm. J R Stat Soc Ser B 39:1–38
Dinh D, Huynh V, Sriboonchitta S (2021) Clustering mixed numerical and categorical data with missing values. Inf Sci 571:418–442
Dua D, Graff C (2019) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences, http://archive.ics.uci.edu/ml. Last visit: May 2022
Fahad A, Alshatri N, Tari Z, Alamri A, Khalid I, Zomaya A, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2:267–279
Faloutos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. SIGMOD Rec 23:419–429
Hathaway R, Bezdek J (2001) Fuzzy c-means clustering of incomplete data. IEEE Trans Syst Man Cybern B 31:735–744
Hendriksen M, Francis A (2020) A partial order and cluster-similarity metric on rooted phylogenetic trees. J Math Biol 80:1265–1290
Kang H (2013) The prevention and handling of the missing data. Korean J Anesthesiol 64:402–406
Kline RB (2015) Principles and Practices of Structural Equation Modeling, Fourth Edition. Guilford Press, New York. ISBN: 978-1-4625-2335-1
Li D, Gu H, Zhang L (2010) A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data. Expert Syst Appl 37:6942–6947
Li T, Zhang L, Wei L, Hou H, Liu X, Pedrycz W (2017) Interval kernel fuzzy c-means clustering of incomplete data. Neurocomputing 237:316–331
Lin J, Keogh E, Wei L, Leonardi S (2007) Experiencing SAX: a novel symbolic representation of time series, vol 15
Matyja A, Siminski K (2014) Comparison of algorithms for clustering incomplete data. Found Comput Decis Sci 39:107–127
Meidan Y, Bohadana M, Mathov Y, Mirsky Y, Breitenbacher D, Shabtai A, Elovici Y (2018) N-baiot: network-based detection of iot botnet attacks using deep autoencoders. IEEE Pervasive Computing, Special Issue - Securing the IoT 17:12–22
Quilan JR (1986) Induction of decision trees. Mach Learn 1:81–106
Raskin A (2014) Comparison of partial orders clustering techniques. Proc ISP RAS 26:91–98
Rodrigues A, Ospina R, Ferreira M (2021) Adaptive kernel fuzzy clustering for missing data. PLoS ONE 16:1–33
Rodriguez M, Comin C, Casanova D, Bruno M, Amancio D, Costa L, Rodrigues A (2019) Clustering algorithms: a comparitive approach. PLoS ONE 14:1–34
Sammut C, Webb G (2017) Encyclopedia of Machine Learning and Data Mining, Second Edition. Springer, New York. ISBN: 978-1-4899-7685-7
Schafer JL, Olsen MK (1998) Multiple imputation for multivariate missing data problems: a data analyst’s perspective. Multivar Behav Res 33:545–571
Schlomer G, Bauman S, Card N (2010) Best practices for missing data managemnt in counseling psychology. Journal of Counseling Psychology American Psychological Assocition 57:1–10
Shi H, Wang P, Yang X, Yu H (2020) An improved mean imputation clustering algorithm for incomplete data. Neural Process Lett, https://doi.org/10.1007/s11063-020-10298-5
Siwei W, Miaomiao L, Ning H, En Z, Jingtao H, Xinwang L, Jianping Y (2019) K-means clustering with incomplete data. IEEE Access 7:69162–69171
Tellaroli P, Bazzi M, Donato M, Brazzale AR, Drăghici S (2016) Cross-clustering: a partial clustering algorithm with automatic estimation of the number of clusters. PLoS ONE 11:1–14
Ukkonen A (2011) Clustering algorithms for chains. J Mach Learn Res 12:1389–1423
Zhang Y, Li M, Wang S, Dai S, Luo L, Zgu E, Xu H, Zhu X, Yao C, Zhou H (2021) K-Means Clustering with incomplete data. ACM Trans Multimed Comput Commun Appl 17:1–14
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: A
Appendix: A
Proposition 1
≼ is a partial order.
Proof
-
Reflexivity: [aL,aR] ≼ [aL,aR] since aL ≤ aL ∧ aR ≤ aR.
-
Asymmetry: Let us assume that [aL,aR] ≼ [\(a_{L}^{\prime },a_{R}^{\prime }\)] and [\(a_{L}^{\prime },a_{R}^{\prime }\)] ≼ [aL,aR]. This means that \(a_{L} \leq a_{L}^{\prime }\) and \(a_{L}^{\prime } \leq a_{L}\). Hence, \(a_{L} = a_{L}^{\prime }\). By similar reasoning, we have \(a_{R} = a_{R}^{\prime }\). So [aL,aR] = [\(a_{L}^{\prime },a_{R}^{\prime }\)].
-
Transitivity: Let us assume that [aL,aR] ≼ [\(a_{L}^{\prime },a_{R}^{\prime }\)] and [\(a_{L}^{\prime },a_{R}^{\prime }\)] ≼ [\(a_{L}^{\prime \prime },a_{R}^{\prime \prime }\)]. This means that \(a_{L} \leq a_{L}^{\prime }\) and \(a_{L}^{\prime } \leq a_{L}^{\prime \prime }\). So \(a_{L} \leq a_{L}^{\prime \prime }\). By similar reasoning, we have \(a_{R} \leq a_{R}^{\prime \prime }\). Therefore, [aL,aR] ≼ [\(a_{L}^{\prime \prime },a_{R}^{\prime \prime }\)].
□
Proposition 2
(\({{\mathscr{L}}_{\mathcal {I}}},\preceq \)) is a complete lattice.
Proof
Let \(S \subseteq {\mathscr{L}}_{\mathcal {I}}\) such that S = {[a1,b1], [a2,b2], …, [an,bb]}. Then, an upper bound of the elements in S is \(u=[\max \limits (a_{i}), \max \limits (b_{i})]\). Let [x,y] another upper bound of S. Based on the partial order ≼, we can prove easily that \(\max \limits (a_{i}) \leq x\) and \(\max \limits (b_{i}) \leq y\), which means that \([\max \limits (a_{i}), \max \limits (b_{i})] \preceq [x,y]\). Therefore, u is the least upper bound of S. By similar reasoning, S has a greatest lower bound \(l = [\min \limits (a_{i}\)), \(\min \limits (b_{i})]\). □
Theorem 1
Any approximation based on the weighted distance POWD satisfies the lower bounding constraint.
Proof
Let di be an approximation value of a distance between two multidimensional data x and y for the ith feature. We assume that di belongs to the interval distance [\(d_{\min \limits } = {\sum }_{i} d_{\min \limits }^{i},d_{\max \limits } = {\sum }_{i} d_{\max \limits }^{i}\)] in POF. This means that di ≤ \(d_{\max \limits }\). So, wi di ≤ wi \(d_{\max \limits }\). Thus, \(\sum \) wi di ≤ (\(\sum \) wi) \(d_{\max \limits }\). Since by the definition of POWD, we have \(\sum \) wi = 1, we conclude that POWD satisfies the lower bounding constraint. □
Theorem 2
POK-Means in its strict version converges to a local minimum of the lattice (\({\mathscr{L}}_{\mathcal {I}},\preceq \)).
Proof
The convergence proof is based on the following facts:
-
The set of (\({\mathscr{L}}_{\mathcal {I}},\preceq \)) is a complete lattice.
-
In each iteration the SSE is minimized. So the sequence of iterations leads to a decreasing chain in term of SSE. Let S be the set of I0, I1, …, In. S is finite since the number of configurations is finite. Since S is a subset of I has a greatest lower bound.
□
Rights and permissions
About this article
Cite this article
Yahyaoui, H., AboElfotoh, H. & Shu, Y. A partial order framework for incomplete data clustering. Appl Intell 53, 7439–7454 (2023). https://doi.org/10.1007/s10489-022-03887-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03887-5