ABSTRACT
Subspace clustering mines clusters hidden in subspaces of high-dimensional data sets. Density-based approaches have been shown to successfully mine clusters of arbitrary shape even in the presence of noise in full space clustering. Exhaustive search of all density-based subspace clusters, however, results in infeasible runtimes for large high-dimensional data sets. This is due to the exponential number of possible subspace projections in addition to the high computational cost of density-based clustering.
In this paper, we propose lossless efficient detection of density-based subspace clusters. In our EDSC (efficient density-based subspace clustering) algorithm we reduce the high computational cost of density-based subspace clustering by a complete multistep filter-and-refine algorithm. Our first hypercube filter step avoids exhaustive search of all regions in all subspaces by enclosing potentially density-based clusters in hypercubes. Our second filter step provides additional pruning based on a density monotonicity property. In the final refinement step, the exact unbiased density-based subspace clustering result is detected. As we prove that pruning is lossless in both filter steps, we guarantee completeness of the result.
In thorough experiments on synthetic and real world data sets, we demonstrate substantial efficiency gains. Our lossless EDSC approach outperforms existing density-based subspace clustering algorithms by orders of magnitude.
- C. Aggarwal, J. Wolf, P. Yu, C. Procopiuc, and J. Park. Fast algorithms for projected clustering. In SIGMOD, pages 61--72, 1999. Google ScholarDigital Library
- C. Aggarwal and P. Yu. Finding generalized projected clusters in high dimensional spaces. In SIGMOD, pages 70--81, 2000. Google ScholarDigital Library
- R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In SIGMOD, pages 94--105, 1998. Google ScholarDigital Library
- I. Assent, R. Krieger, E. Müller, and T. Seidl. DUSC: Dimensionality unbiased subspace clustering. In ICDM, pages 409--414, 2007. Google ScholarDigital Library
- K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is nearest neighbors meaningful. In IDBT, pages 217--235, 1999. Google ScholarDigital Library
- C.-H. Cheng, A. W. Fu, and Y. Zhang. Entropy-based subspace clustering for mining numerical data. In KDD, pages 84--93, 1999. Google ScholarDigital Library
- M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. In KDD, pages 226--231, 1996.Google Scholar
- A. Hinneburg and D. Keim. An efficient approach to clustering in large multimedia databases with noise. In KDD, pages 58--65, 1998.Google ScholarDigital Library
- I. Joliffe. Principal Component Analysis. Springer, New York, 1986.Google ScholarCross Ref
- K. Kailing, H.-P. Kriegel, and P. Kröger. Density-connected subspace clustering for high-dimensional data. In SDM, pages 246--257, 2004.Google ScholarCross Ref
- K. Kailing, H.-P. Kriegel, P. Kröger, and S. Wanka. Ranking interesting subspaces for clustering high dimensional data. In PKDD, pages 241--252, 2003.Google ScholarCross Ref
- H.-P. Kriegel, P. Kröger, M. Renz, and S. Wurst. A generic framework for efficient subspace clustering of high-dimensional data. In ICDM, pages 250--257, 2005. Google ScholarDigital Library
- S. Lauritzen. The EM algorithm for graphical association models with missing data. Comp. Statistics & Data Analysis, 19:191--201, 1995. Google ScholarDigital Library
- J. MacQueen. Some methods for classification and analysis of multivariate observations. In Berkeley Symp. Math. stat. & prob., pages 281--297, 1967.Google Scholar
- G. Moise, J. Sander, and M. Ester. P3C: A robust projected clustering algorithm. In ICDM, pages 414--425, 2006. Google ScholarDigital Library
- H. Nagesh, S. Goil, and A. Choudhary. MAFIA: Efficient and scalable subspace clustering for very large data sets. In TR 9906--010, NWU, 1999.Google Scholar
- D. Newman, S. Hettich, C. Blake, and C. Merz. UCI repository of MLDBs, 1998.Google Scholar
- K. Sequeira and M. Zaki. SCHISM: A new approach for interesting subspace mining. In ICDM, pages 186--193, 2004. Google ScholarDigital Library
- I. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, USA, 2005. Google ScholarDigital Library
Index Terms
- EDSC: efficient density-based subspace clustering
Recommendations
Scalable density-based subspace clustering
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge managementFor knowledge discovery in high dimensional databases, subspace clustering detects clusters in arbitrary subspace projections. Scalability is a crucial issue, as the number of possible projections is exponential in the number of dimensions. We propose a ...
Efficient Density-Based Subspace Clustering in High Dimensions
Revised Selected Papers of the First International Workshop on Clustering High--Dimensional Data - Volume 7627Density-based clustering defines clusters as dense areas in feature space separated by sparsely populated areas. It is known to successfully identify clusters of arbitrary shapes even in noisy data. Today, we face increasingly high-dimensional data, ...
Efficient approaches for summarizing subspace clusters into k representatives
Recent progress in natural computation and knowledge discoveryA major challenge in subspace clustering is that subspace clustering may generate an explosive number of clusters with high computational complexity, which severely restricts the usage of subspace clustering. The problem gets even worse with the ...
Comments