A partial order framework for incomplete data clustering

Yahyaoui, Hamdi; AboElfotoh, Hosam; Shu, Yanjun

doi:10.1007/s10489-022-03887-5

A partial order framework for incomplete data clustering

Published: 02 August 2022

Volume 53, pages 7439–7454, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

260 Accesses
1 Altmetric
Explore all metrics

Abstract

We propose in this paper a partial order framework for clustering incomplete data. The paramount feature of this framework is that it spans over a partial order that can be leveraged to establish data similarity. We present the underlying theoretical foundations and study the convergence of clustering algorithms in this framework. In addition, we present a partial order-based clustering algorithm (POK-means) that illustrates the embedding of K-means clustering algorithm in our framework. The first contribution of our method is that unlike methods based on imputation of the missing values, our method does not make any assumptions about missing data. Another important contribution is that it alleviates false dismissals caused by other interval-based similarity measures. The experimental results show that, although our method do not assume any prior knowledge of (or assumptions about) missing data, it is competitive to most of published incomplete data clustering methods that are based on assumptions about input data or imputation (e.g. methods based on partial or interval kernel distances) in accuracy and performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imputation Strategies for Clustering Mixed-Type Data with Missing Values

Article Open access 26 November 2022

Clustering with missing features: a penalized dissimilarity measure based approach

Article 12 June 2018

Fuzzy c-Means Clustering of Incomplete Data Using Dimension-Wise Fuzzy Variances of Clusters

References

Basten T, Bosnacki D, Geilen M (2004) Cluster-based partial-order reduction. Autom Softw Eng 11:365–402
Article Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood estimation from incomplete data via the em algorithm. J R Stat Soc Ser B 39:1–38
MATH Google Scholar
Dinh D, Huynh V, Sriboonchitta S (2021) Clustering mixed numerical and categorical data with missing values. Inf Sci 571:418–442
Article MathSciNet Google Scholar
Dua D, Graff C (2019) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences, http://archive.ics.uci.edu/ml. Last visit: May 2022
Fahad A, Alshatri N, Tari Z, Alamri A, Khalid I, Zomaya A, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2:267–279
Article Google Scholar
Faloutos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. SIGMOD Rec 23:419–429
Article Google Scholar
Hathaway R, Bezdek J (2001) Fuzzy c-means clustering of incomplete data. IEEE Trans Syst Man Cybern B 31:735–744
Article Google Scholar
Hendriksen M, Francis A (2020) A partial order and cluster-similarity metric on rooted phylogenetic trees. J Math Biol 80:1265–1290
Article MathSciNet MATH Google Scholar
Kang H (2013) The prevention and handling of the missing data. Korean J Anesthesiol 64:402–406
Article Google Scholar
Kline RB (2015) Principles and Practices of Structural Equation Modeling, Fourth Edition. Guilford Press, New York. ISBN: 978-1-4625-2335-1
Google Scholar
Li D, Gu H, Zhang L (2010) A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data. Expert Syst Appl 37:6942–6947
Article Google Scholar
Li T, Zhang L, Wei L, Hou H, Liu X, Pedrycz W (2017) Interval kernel fuzzy c-means clustering of incomplete data. Neurocomputing 237:316–331
Article Google Scholar
Lin J, Keogh E, Wei L, Leonardi S (2007) Experiencing SAX: a novel symbolic representation of time series, vol 15
Matyja A, Siminski K (2014) Comparison of algorithms for clustering incomplete data. Found Comput Decis Sci 39:107–127
Article MATH Google Scholar
Meidan Y, Bohadana M, Mathov Y, Mirsky Y, Breitenbacher D, Shabtai A, Elovici Y (2018) N-baiot: network-based detection of iot botnet attacks using deep autoencoders. IEEE Pervasive Computing, Special Issue - Securing the IoT 17:12–22
Article Google Scholar
Quilan JR (1986) Induction of decision trees. Mach Learn 1:81–106
Article Google Scholar
Raskin A (2014) Comparison of partial orders clustering techniques. Proc ISP RAS 26:91–98
Article Google Scholar
Rodrigues A, Ospina R, Ferreira M (2021) Adaptive kernel fuzzy clustering for missing data. PLoS ONE 16:1–33
Article Google Scholar
Rodriguez M, Comin C, Casanova D, Bruno M, Amancio D, Costa L, Rodrigues A (2019) Clustering algorithms: a comparitive approach. PLoS ONE 14:1–34
Article Google Scholar
Sammut C, Webb G (2017) Encyclopedia of Machine Learning and Data Mining, Second Edition. Springer, New York. ISBN: 978-1-4899-7685-7
Book MATH Google Scholar
Schafer JL, Olsen MK (1998) Multiple imputation for multivariate missing data problems: a data analyst’s perspective. Multivar Behav Res 33:545–571
Article Google Scholar
Schlomer G, Bauman S, Card N (2010) Best practices for missing data managemnt in counseling psychology. Journal of Counseling Psychology American Psychological Assocition 57:1–10
Google Scholar
Shi H, Wang P, Yang X, Yu H (2020) An improved mean imputation clustering algorithm for incomplete data. Neural Process Lett, https://doi.org/10.1007/s11063-020-10298-5
Siwei W, Miaomiao L, Ning H, En Z, Jingtao H, Xinwang L, Jianping Y (2019) K-means clustering with incomplete data. IEEE Access 7:69162–69171
Article Google Scholar
Tellaroli P, Bazzi M, Donato M, Brazzale AR, Drăghici S (2016) Cross-clustering: a partial clustering algorithm with automatic estimation of the number of clusters. PLoS ONE 11:1–14
Article Google Scholar
Ukkonen A (2011) Clustering algorithms for chains. J Mach Learn Res 12:1389–1423
MathSciNet MATH Google Scholar
Zhang Y, Li M, Wang S, Dai S, Luo L, Zgu E, Xu H, Zhu X, Yao C, Zhou H (2021) K-Means Clustering with incomplete data. ACM Trans Multimed Comput Commun Appl 17:1–14
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, Kuwait University, Safat, 13060, State of Kuwait
Hamdi Yahyaoui & Hosam AboElfotoh
School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
Yanjun Shu

Authors

Hamdi Yahyaoui
View author publications
You can also search for this author in PubMed Google Scholar
Hosam AboElfotoh
View author publications
You can also search for this author in PubMed Google Scholar
Yanjun Shu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hamdi Yahyaoui.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: A

Proposition 1

≼ is a partial order.

Proof

Reflexivity: [a_L,a_R] ≼ [a_L,a_R] since a_L ≤ a_L ∧ a_R ≤ a_R.
Asymmetry: Let us assume that [a_L,a_R] ≼ [\(a_{L}^{\prime },a_{R}^{\prime }\)] and [\(a_{L}^{\prime },a_{R}^{\prime }\)] ≼ [a_L,a_R]. This means that \(a_{L} \leq a_{L}^{\prime }\) and \(a_{L}^{\prime } \leq a_{L}\). Hence, \(a_{L} = a_{L}^{\prime }\). By similar reasoning, we have \(a_{R} = a_{R}^{\prime }\). So [a_L,a_R] = [\(a_{L}^{\prime },a_{R}^{\prime }\)].
Transitivity: Let us assume that [a_L,a_R] ≼ [\(a_{L}^{\prime },a_{R}^{\prime }\)] and [\(a_{L}^{\prime },a_{R}^{\prime }\)] ≼ [\(a_{L}^{\prime \prime },a_{R}^{\prime \prime }\)]. This means that \(a_{L} \leq a_{L}^{\prime }\) and \(a_{L}^{\prime } \leq a_{L}^{\prime \prime }\). So \(a_{L} \leq a_{L}^{\prime \prime }\). By similar reasoning, we have \(a_{R} \leq a_{R}^{\prime \prime }\). Therefore, [a_L,a_R] ≼ [\(a_{L}^{\prime \prime },a_{R}^{\prime \prime }\)].

□

Proposition 2

(\({{\mathscr{L}}_{\mathcal {I}}},\preceq \)) is a complete lattice.

Proof

Let \(S \subseteq {\mathscr{L}}_{\mathcal {I}}\) such that S = {[a₁,b₁], [a₂,b₂], …, [a_n,b_b]}. Then, an upper bound of the elements in S is \(u=[\max \limits (a_{i}), \max \limits (b_{i})]\). Let [x,y] another upper bound of S. Based on the partial order ≼, we can prove easily that \(\max \limits (a_{i}) \leq x\) and \(\max \limits (b_{i}) \leq y\), which means that \([\max \limits (a_{i}), \max \limits (b_{i})] \preceq [x,y]\). Therefore, u is the least upper bound of S. By similar reasoning, S has a greatest lower bound \(l = [\min \limits (a_{i}\)), \(\min \limits (b_{i})]\). □

Theorem 1

Any approximation based on the weighted distance POWD satisfies the lower bounding constraint.

Proof

Let d_i be an approximation value of a distance between two multidimensional data x and y for the i^th feature. We assume that d_i belongs to the interval distance [\(d_{\min \limits } = {\sum }_{i} d_{\min \limits }^{i},d_{\max \limits } = {\sum }_{i} d_{\max \limits }^{i}\)] in POF. This means that d_i ≤ \(d_{\max \limits }\). So, w_i d_i ≤ w_i \(d_{\max \limits }\). Thus, \(\sum \) w_i d_i ≤ (\(\sum \) w_i) \(d_{\max \limits }\). Since by the definition of POWD, we have \(\sum \) w_i = 1, we conclude that POWD satisfies the lower bounding constraint. □

Theorem 2

POK-Means in its strict version converges to a local minimum of the lattice (\({\mathscr{L}}_{\mathcal {I}},\preceq \)).

Proof

The convergence proof is based on the following facts:

The set of (\({\mathscr{L}}_{\mathcal {I}},\preceq \)) is a complete lattice.
In each iteration the SSE is minimized. So the sequence of iterations leads to a decreasing chain in term of SSE. Let S be the set of I₀, I₁, …, I_n. S is finite since the number of configurations is finite. Since S is a subset of I has a greatest lower bound.

□

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yahyaoui, H., AboElfotoh, H. & Shu, Y. A partial order framework for incomplete data clustering. Appl Intell 53, 7439–7454 (2023). https://doi.org/10.1007/s10489-022-03887-5

Download citation

Accepted: 11 June 2022
Published: 02 August 2022
Issue Date: April 2023
DOI: https://doi.org/10.1007/s10489-022-03887-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A partial order framework for incomplete data clustering

Abstract

Access this article

Similar content being viewed by others

Imputation Strategies for Clustering Mixed-Type Data with Missing Values

Clustering with missing features: a penalized dissimilarity measure based approach

Fuzzy c-Means Clustering of Incomplete Data Using Dimension-Wise Fuzzy Variances of Clusters

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix: A

Proposition 1

Proof

Proposition 2

Proof

Theorem 1

Proof

Theorem 2

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A partial order framework for incomplete data clustering

Abstract

Access this article

Similar content being viewed by others

Imputation Strategies for Clustering Mixed-Type Data with Missing Values

Clustering with missing features: a penalized dissimilarity measure based approach

Fuzzy c-Means Clustering of Incomplete Data Using Dimension-Wise Fuzzy Variances of Clusters

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix: A

Appendix: A

Proposition 1

Proof

Proposition 2

Proof

Theorem 1

Proof

Theorem 2

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation