Abstract
A major challenge in subspace clustering is that subspace clustering may generate an explosive number of clusters with high computational complexity, which severely restricts the usage of subspace clustering. The problem gets even worse with the increase of the data’s dimensionality. In this paper, we propose to summarize the set of subspace clusters into k representative clusters to alleviate the problem. Typically, subspace clusters can be clustered further into k groups, and the set of representative clusters can be selected from each group. In such a way, only the most representative subspace clusters will be returned to user. Unfortunately, when the size of the set of representative clusters is specified, the problem of finding the optimal set is NP-hard. To solve this problem efficiently, we present two approximate methods: PCoC and HCoC. The greatest advantage of our methods is that we only need a subset of subspace clusters as the input instead of the complete set of subspace clusters. Precisely, only the clusters in low-dimensional subspaces are computed and assembled into representative clusters in high-dimensional subspaces. The approximate results can be found in polynomial time. Our performance study shows both the effectiveness and efficiency of these methods.
Similar content being viewed by others
References
Afrati F, Gionis A, Mannila H (2004) Approximating a collection of frequent sets. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (KDD’04), pp 12–19
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of international conference on management of data (SIGMOD’98), pp 94–105
Assent I, Krieger R, Müller E, Seidl T (2007) DUSC: dimensionality unbiased subspace clustering. In: Proceedings of the seventh IEEE international conference on data mining (ICDM’07), pp 409–414
Baumgartner C, Kailing K, Kriegel H-P, Kroger P, Plant C (2004) Subspace selection for clustering high dimensional data. In: Proceedings of the fourth IEEE international conference on data mining (ICDM’04), pp 11–18
Bohm C, Kailing K, Kriegel H-P, Kroger P (2004) Density connected clustering with local subspace preferences. In: Proceedings of the fourth IEEE international conference on data mining (ICDM’04), pp 27–34
Chen G, Ma X, Yang D, Tang S (2008) Discovering the skyline of subspace clusters in high-dimensional data, In: Proceedings of the fifth international conference on fuzzy systems and knowledge discovery (FSKD ‘08), vol 2, pp 439–443
Cheng CH, Fu AC, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (KDD’99), pp 84–93
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second ACM SIGKDD international conference on knowledge discovery and data mining (KDD’96), pp 226–231
Jin R, Abu-Ata M, Xiang Y, Ruan N (2008) Effective and efficient itemset pattern summarization: regression-based approaches, In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’08), pp 399–407
Kriegel HP, Kroger P, Renz M, Wurst S (2005) A generic framework for efficient subspace clustering of high-dimensional data. In: Proceedings of the fifth IEEE international conference on data mining (ICDM’05)
Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Lecture notes in computer science: database theory, ICDT, pp 398–416
Wang C, Parthasarathy S (2006) Summarizing itemset patterns using probabilistic models. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’06), pp 730–735
Xin D, Han J, Yan X, Cheng H (2005) Mining compressed frequent-pattern sets. In: Proceedings of the 31st international conference on very large data bases (VLDB’05), pp 709–720
Yan X, Cheng H, Han J, Xin D (2005) Summarizing itemset patterns: a profile-based approach. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery and data mining (KDD’05), pp 314–323
Acknowledgments
This work was supported by the National High-Tech Research and Development Plan of China (863) under Grant No. 2007AA120502; NSFC 60874082.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, G., Ma, X., Yang, D. et al. Efficient approaches for summarizing subspace clusters into k representatives. Soft Comput 15, 845–853 (2011). https://doi.org/10.1007/s00500-010-0552-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-010-0552-8