Skip to main content
Log in

A survey on enhanced subspace clustering

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Subspace clustering finds sets of objects that are homogeneous in subspaces of high-dimensional datasets, and has been successfully applied in many domains. In recent years, a new breed of subspace clustering algorithms, which we denote as enhanced subspace clustering algorithms, have been proposed to (1) handle the increasing abundance and complexity of data and to (2) improve the clustering results. In this survey, we present these enhanced approaches to subspace clustering by discussing the problems they are solving, their cluster definitions and algorithms. Besides enhanced subspace clustering, we also present the basic subspace clustering and the related works in high-dimensional clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Achtert E, Böhm C, Kriegel HP, Kröger P, Müller-Gorman I, Zimek A (2006a) Finding hierarchies of subspace clusters. In: Proceedings of the 10th European conference on principles and practice of knowledge discovery in databases (PKDD), pp 446–453

  • Achtert E, Böhm C, Kriegel HP, Kröger P, Zimek A (2006b) Deriving quantitative models for correlation clusters. In: Proceedings of the 12th ACM international conference on knowledge discovery and data mining (KDD), pp 4–13

  • Achtert E, Böhm C, Kriegel HP, Kröger P, Müller-Gorman I, Zimek A (2007) Detection and visualization of subspace cluster hierarchies. In: Proceedings of the 12th international conference on database systems for advanced applications (DASFAA), pp 152–163

  • Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of 20th international conference on very large data bases (VLDB), pp 487–499

  • Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM international conference on management of data (SIGMOD), pp 94–105

  • Aggarwal CC, Wolf JL, Yu PS, Procopiuc C, Park JS (1999) Fast algorithms for projected clustering. In: Proceedings of the ACM international conference on management of data (SIGMOD), pp 61–72

  • Aggarwal CC, Hinneburg A, Keim D (2001) On the surprising behavior of distance metrics in high dimensional space. In: Proceedings of the 8th international conference on database theory (ICDT), pp 420–434

  • Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of 30th international conference on very large data bases (VLDB), pp 852–863

  • Assent I, Krieger R, Müller E, Seidl T (2007) DUSC: dimensionality unbiased subspace clustering. In: Proceedings of the 7th IEEE international conference on data mining (ICDM), pp 409–414

  • Assent I, Krieger R, Müller E, Seidl T (2008a) EDSC: efficient density-based subspace clustering. In: Proceedings of the 17th ACM conference on information and knowledge management (CIKM), pp 1093–1102

  • Assent I, Krieger R, Müller E, Seidl T (2008b) INSCY: indexing subspace clusters with in-process-removal of redundancy. In: Proceedings of the 8th IEEE international conference on data mining (ICDM), pp 719–724

  • Avis D, Fukuda K (1996) Reverse search for enumeration. Discr Appl Math 65(1-3): 21–46

    Article  MathSciNet  MATH  Google Scholar 

  • Bennett KP, Fayyad U, Geiger D (1999) Density-based indexing for approximate nearest-neighbor queries. In: Proceedings of the 5th ACM international conference on knowledge discovery and data mining KDD, pp 233–243

  • Berkhin P (2006) A survey of clustering data mining techniques. In: Kogan J, Nicholas C, Teboulle M (eds) Grouping multidimensional data, chap 2. Springer, New York, pp 25–71

    Chapter  Google Scholar 

  • Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful?. In: Proceedings of the 7th international conference on database theory (ICDT), pp 217–235

  • Böhm C, Kailing K, Kröger P, Zimek A (2004) Computing clusters of correlation connected objects. In: Proceedings of the ACM international conference on management of data (SIGMOD), pp 455–466

  • Breiman L (2001) Statistical modeling: the two cultures. Stat Sci 16(3): 199–231

    Article  MathSciNet  MATH  Google Scholar 

  • Cerf L, Besson J, Robardet C, Boulicaut JF (2008) Data peeler: contraint-based closed pattern mining in n-ary relations. In: Proceedings of the 8th SIAM international conference on data mining (SDM), pp 37–48

  • Cerf L, Besson J, Robardet C, Boulicaut JF (2009) Closed patterns meet n-ary relations. Trans Knowl Discov Data 3(1): 1–36

    Article  Google Scholar 

  • Chan EY, Ching WK, Ng MK, Huang JZ (2004) An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recog 37(5): 943–952

    Article  MATH  Google Scholar 

  • Cheng CH, Fu AW, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: Proceedings of the 5th ACM international conference on knowledge discovery and data mining (KDD), pp 84–93

  • Cheng Y, Church GM (2000) Biclustering of expression data. In: Proceedings of the 18th international conference on intelligent systems for molecular biology (ISMB), pp 93–103

  • Chiaravalloti AD, Greco G, Guzzo A, Pontieri L (2006) An information-theoretic framework for process structure and data mining. In: Proceedings of the 8th international conference on data warehousing and knowledge discovery (DaWaK), pp 248–259

  • Dai W, Yang Q, Xue GR, Yu Y (2008) Self-taught clustering. In: Proceedings of the 25th international conference on machine learning (ICML), pp 200–207

  • Dash M, Choi K, Scheuermann P, Liu H (2002) Feature selection for clustering - a filter solution. In: Proceedings of the 2nd IEEE international conference on data mining (ICDM), pp 115–122

  • Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the 7th ACM international conference on knowledge discovery and data mining (KDD), pp 269–274

  • Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of the 9th ACM international conference on knowledge discovery and data mining (KDD), pp 89–98

  • Ding CHQ, He X, Zha H, Simon HD (2002) Adaptive dimension reduction for clustering high dimensional data. In: Proceedings of the 2nd IEEE international conference on data mining (ICDM), pp 147–154

  • Domeniconi C, Papadopoulos D, Gunopulos D, Ma S (2004) Subspace clustering of high dimensional data. In: Proceedings of the 4th SIAM international conference on data mining (SDM), pp 517–521

  • Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley, New York

    MATH  Google Scholar 

  • Faloutsos C, Megalooikonomou V (2007) On data mining, compression, and kolmogorov complexity. Data Mining Knowl Discov 15(1): 3–20

    Article  MathSciNet  Google Scholar 

  • Färber I, Günnemann S, Kriegel HP, Kröger P, Müller E, Schubert E, Seidl T, Zimek A (2010) On using class-labels in evaluation of clusterings. In: Proceedings of the 1st international workshop on discovering, summarizing and using multiple clusterings (MultiClust) held in conjunction with KDD 2010

  • Francois D, Wertz V, Verleysen M (2007) The concentration of fractional distances. IEEE Trans Knowl Data Eng 19(7): 873–886

    Article  Google Scholar 

  • Fromont É, Prado A, Robardet C (2009) Constraint-based subspace clustering. In: Proceedings of the 9th SIAM international conference on data mining (SDM), pp 26–37

  • Fu Q, Banerjee A (2009) Bayesian overlapping subspace clustering. In: Proceedings of the 9th IEEE international conference on data mining (ICDM), pp 776–781

  • Gao B, Liu TY, Ma WY (2006) Star-structured high-order heterogeneous data co-clustering based on consistent information theory. In: Proceedings of the 6th IEEE international conference on data mining (ICDM), pp 880–884

  • Georgii E, Tsuda K, Schölkopf B (2010) Multi-way set enumeration in weight tensors. Mach Learn 82(2): 123–155

    Article  Google Scholar 

  • Guha S, Rastogi R, Shim K (1999) ROCK: a robust clustering algorithm for categorical attributes. In: Proceedings of the 15th international conference on data engineering (ICDE), pp 512–521

  • Günnemann S, Müller E, Färber I, Seidl T (2009) Detection of orthogonal concepts in subspaces of high dimensional data. In: Proceedings of the 18th ACM conference on information and knowledge management (CIKM), pp 1317–1326

  • Günnemann S, Färber I, Boden B, Seidl T (2010a) Subspace clustering meets dense subgraph mining: a synthesis of two paradigms. In: Proceedings of the 10th IEEE international conference on data mining (ICDM), pp 845–850

  • Günnemann S, Färber I, Müller E, Seidl T (2010b) ASCLU: alternative subspace clustering. In: Proceedings of the 1st international workshop on discovering, summarizing and using multiple clusterings (MultiClust) held in conjunction with KDD 2010

  • Günnemann S, Kremer H, Seidl T (2010c) Subspace clustering for uncertain data. In: Proceedings of the 10th SIAM international conference on data mining (SDM), pp 385–396

  • Hinneburg A, Aggarwal CC, Keim DA (2000) What is the nearest neighbor in high dimensional spaces?. In: Proceedings of the 26th international conference on very large data bases (VLDB), pp 506–515

  • Houle ME, Kriegel HP, Kröger P, Schubert E, Zimek A (2010) Can shared-neighbor distances defeat the curse of dimensionality?. In: Proceedings of the 22nd international conference on scientific and statistical database management (SSDBM)

  • Hsu CM, Chen MS (2004) Subspace clustering of high dimensional spatial data with noises. In: Proceedings of the 8th Pacific-Asia conference advances in knowledge discovery and data mining (PAKDD), pp 31–40

  • Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3): 264–323

    Article  Google Scholar 

  • Jaschke R, Hotho A, Schmitz C, Ganter B, Stumme G (2006) TRIAS–an algorithm for mining iceberg tri-lattices. In: Proceedings of the 6th IEEE international conference on data mining (ICDM), pp 907–911

  • Ji L, Tan KL, Tung AKH (2006) Mining frequent closed cubes in 3D datasets. In: Proceedings of the 32nd international conference on very large data bases (VLDB), pp 811–822

  • Jiang D, Pei J, Ramanathan M, Tang C, Zhang A (2004a) Mining coherent gene clusters from gene-sample-time microarray data. In: Proceedings of the 10th ACM international conference on knowledge discovery and data mining (KDD), pp 430–439

  • Jiang D, Tang C, Zhang A (2004b) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11): 1370–1386

    Article  Google Scholar 

  • Jing L, Ng MK, Huang JZ (2007) An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans Knowl Data Eng 19(8): 1026–1041

    Article  Google Scholar 

  • Kailing K, Kriegel HP, Kröger P, Wanka S (2003) Ranking interesting subspaces for clustering high dimensional data. In: Proceedings of the 7th European conference on principles and practice of knowledge discovery in databases (PKDD), pp 241–252

  • Kailing K, Kröger P, Kriegel HP (2004) Density-connected subspace clustering for high-dimensional data. In: Proceedings of the 4th SIAM international conference on data mining (SDM), pp 246–257

  • Ke Y, Cheng J, Ng W (2006) Mining quantitative correlated patterns using an information-theoretic approach. In: Proceedings of the 12th ACM international conference on knowledge discovery and data mining (KDD), pp 227–236

  • Keogh EJ, Lonardi S, Ratanamahatana CA (2004) Towards parameter-free data mining. In: Proceedings of the 10th ACM international conference on knowledge discovery and data mining (KDD), pp 206–215

  • Kleinberg J, Papadimitriou C, Raghavan P (1998) A microeconomic view of data mining. Data Mining Knowl Discov 2(4): 311–324

    Article  Google Scholar 

  • Kohavi R, John G (1997) Wrappers for feature subset selection. Artif Intell 97(1-2): 273–324

    Article  MATH  Google Scholar 

  • Kontaki M, Papadopoulos AN, Manolopoulos Y (2008) Continuous subspace clustering in streaming time series. Inf Syst 33(2): 240–260

    Article  Google Scholar 

  • Kriegel HP, Zimek A (2010) Subspace clustering, ensemble clustering, alternative clustering, multiview clustering: what can we learn from each other? In: Proceedings of the 1st international workshop on discovering, summarizing and using multiple clusterings (MultiClust) held in conjunction with KDD 2010

  • Kriegel HP, Kröger P, Renz M, Wurst S (2005) A generic framework for efficient subspace clustering of high-dimensional data. In: Proceedings of the 5th IEEE international conference on data mining (ICDM), pp 250–257

  • Kriegel HP, Borgwardt KM, Kröger P, Pryakhin A, Schubert M, Zimek A (2007) Future trends in data mining. Data Mining Knowl Discov 15(1): 87–97

    Article  Google Scholar 

  • Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1): 1–58

    Article  Google Scholar 

  • Kriegel HP, Kröger P, Ntoutsi I, Zimek A (2011) Density based subspace clustering over dynamic data. In: Proceedings of the 23rd international conference on scientific and statistical database management (SSDBM), pp 387–404

  • Li T, Ma S, Ogihara M (2004) Document clustering via adaptive subspace iteration. In: Proceedings of the 27th ACM international conference on research and development in information retrieval (SIGIR), ACM, pp 218–225

  • Li J, Li H, Soh D, Wong L (2005) A correspondence between maximal complete bipartite subgraphs and closed patterns. In: Proceedings of the 9th European conference on principles and practice of knowledge discovery in databases (PKDD), pp 146–156

  • Li J, Sim K, Liu G, Wong L (2008) Maximal quasi-bicliques with balanced noise tolerance: concepts and co-clustering applications. In: Proceedings of the 8th SIAM international conference on data mining (SDM), pp 72–83

  • Liu G, Sim K, Li J (2006) Efficient mining of large maximal bicliques. In: Proceedings of the 8th international conference on data warehousing and knowledge discovery (DaWak), pp 437–448

  • Liu G, Sim K, Li J, Wong L (2009) Efficient mining of distance-based subspace clusters. Stat Anal Data Mining 2(5-6): 427–444

    Article  MathSciNet  Google Scholar 

  • Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 1(1): 24–45

    Article  Google Scholar 

  • Mishra N, Ron D, Swaminathan R (2005) A new conceptual clustering framework. Mach Learn 56(1-3): 115–151

    Article  Google Scholar 

  • Moise G, Sander J (2008) Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In: Proceedings of the 14th ACM international conference on knowledge discovery and data mining (KDD), pp 533–541

  • Moise G, Zimek A, Kröger P, Kriegel HP, Sander J (2009) Subspace and projected clustering: experimental evaluation and analysis. Knowl Inf Syst 21(3): 299–326

    Article  Google Scholar 

  • Müller E, Assent I, Krieger R, Jansen T, Seidl T (2008) Morpheus: interactive exploration of subspace clustering. In: Proceedings of the 14th ACM international conference on knowledge discovery and data mining (KDD), pp 1089–1092

  • Müller E, Assent I, Günnemann S, Krieger R, Seidl T (2009a) Relevant subspace clustering: mining the most interesting non-redundant concepts in high dimensional data. In: Proceedings of the 9th IEEE international conference on data mining (ICDM), pp 377–386

  • Müller E, Assent I, Krieger R, Günnemann S, Seidl T (2009b) DensEst: density estimation for data mining in high dimensional spaces. In: Proceedings of the 9th SIAM international conference on data mining (SDM), pp 173–184

  • Müller E, Assent I, Seidl T (2009c) HSM: heterogeneous subspace mining in high dimensional. In: Proceedings of the 21st international conference on scientific and statistical database management (SSDBM), pp 497–516

  • Müller E, Günnemann S, Assent I, Seidl T (2009d) Evaluating clustering in subspace projections of high dimensional data. Proc VLDB Endow 2(1): 1270–1281

    Google Scholar 

  • Nagesh H, Goil S, Choudhary A (2001) Adaptive grids for clustering massive data sets. In: Proceedings of the 1st SIAM international conference on data mining (SDM)

  • Nocedal J, Wright SJ (2006) Numerical optimization. Springer, New York, pp 497–528

    Book  MATH  Google Scholar 

  • Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10): 1345–1359

    Article  Google Scholar 

  • Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6(1): 90–105

    Article  Google Scholar 

  • Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of the 7th international conference on database theory (ICDT), pp 398–416

  • Patrikainen A, Meila M (2006) Comparing subspace clusterings. IEEE Trans Knowl Data Eng 18(7): 902–916

    Article  Google Scholar 

  • Pensa R, Boulicaut J (2008) Constrained co-clustering of gene expression data. In: Proceedings of the 8th SIAM international conference on data mining (SDM), pp 25–36

  • Rege M, Dong M, Fotouhi F (2006) Co-clustering documents and words using bipartite isoperimetric graph partitioning. In: Proceedings of the 6th IEEE international conference on data mining (ICDM), pp 532–541

  • Rymon R (1992) Search through systematic set enumeration. In: Proceedings of the 8th international conference on principles and knowledge representation and reasoning (KR), pp 539–550

  • Sequeira K, Zaki MJ (2004) SCHISM: a new approach for interesting subspace mining. In: Proceedings of the 4th IEEE international conference on data mining (ICDM), pp 186–193

  • Silverman BW (1986) Density estimation for statistics and data analysis (Chapman and Hall/CRC monographs on statistics and applied probability), 1st edn. Chapman and Hall/CRC, London

    Google Scholar 

  • Sim K, Li J, Gopalkrishnan V, Liu G (2006) Mining maximal quasi-bicliques to co-cluster stocks and financial ratios for value investment. In: Proceedings of the 6th IEEE international conference on data mining (ICDM), pp 1059–1063

  • Sim K, Gopalkrishnan V, Chua HN, Ng SK (2009a) MACs: multi-attribute co-clusters with high correlation information. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD), pp 398–413

  • Sim K, Li J, Gopalkrishnan V, Liu G (2009b) Mining maximal quasi-bicliques: novel algorithm and applications in the stock market and protein networks. Stat Anal Data Mining 2(4): 255–273

    Article  MathSciNet  Google Scholar 

  • Sim K, Aung A, Vivekanand G (2010a) Discovering correlated subspace clusters in 3D continuous-valued data. In: Proceedings of the 10th IEEE international conference on data mining (ICDM), pp 471–480

  • Sim K, Poernomo AK, Gopalkrishnan V (2010b) Mining actionable subspace clusters in sequential data. In: Proceedings of the 10th SIAM international conference on data mining (SDM), pp 442–453

  • Sim K, Liu G, Gopalkrishna V, Li J (2011) A case study on financial ratios via cross-graph quasi-bicliques. Inf Sci 181(1): 201–216

    Article  Google Scholar 

  • Snedecor GW, Cochran WG (1989) Statistical methods, 8th edn. Iowa State University Press, Ames

    MATH  Google Scholar 

  • Srikant R, Agrawal R (1996) Mining quantitative association rules in large relational tables. In: Proceedings of the ACM international conference on management of data (SIGMOD), pp 1–12

  • Sun J, Faloutsos C, Papadimitriou S, Yu PS (2007) Graphscope: parameter-free mining of large time-evolving graphs. In: Proceedings of the 13th ACM international conference on knowledge discovery and data mining (KDD), pp 687–696

  • Tanay A, Sharan R, Shamir R (2004) Biclustering algorithms: a survey. Handbook of computational molecular biology. Chapman & Hall/CRC, London

  • Tomita E, Tanaka A, Takahashi H (2004) The worst-case time complexity for generating all maximal cliques. In: Proceedings of the 10th international computing and combinatorics conference (COCOON), pp 161–170

  • Uno T, Kiyomi M, Arimura H (2004) LCM ver. 2: efficient mining algorithms for frequent/closed/maximal itemsets. In: Proceedings of the 2nd international workshop on frequent itemset mining implementations (FIMI) held in conjuction with ICDM 2004

  • Vreeken J, Zimek A (2011) When pattern met subspace cluster—a relationship story. In: Proceedings of the 2nd international workshop on discovering, summarizing and using multiple clusterings (MultiClust) held in conjunction with ECML PKDD 2011, pp 7–18

  • Wagstaff K, Cardie C, Rogers S, Schrödl S (2001) Constrained k-means clustering with background knowledge. In: Proceedings of the 18th international conference on machine learning (ICML), pp 577–584

  • Wang H, Wang W, Yang J, Yu PS (2002) Clustering by pattern similarity in large data sets. In: Proceedings of the ACM international conference on management of data (SIGMOD), pp 394–405

  • Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3): 645–678

    Article  Google Scholar 

  • Xu X, Lu Y, Tung AKH, Wang W (2006) Mining shifting-and-scaling co-regulation patterns on gene expression profiles. In: Proceedings of the 22nd international conference on data engineering (ICDE), p 89

  • Xu X, Lu Y, Tan KL, Tung AKH (2009) Finding time-lagged 3D clusters. In: Proceedings of the 25th international conference on data engineering (ICDE), pp 445–456

  • Yan C, Burleigh JG, Eulenstein O (2005) Identifying optimal incomplete phylogenetic data sets from sequence databases. Mol Phylogenet Evol 35: 528–535

    Article  Google Scholar 

  • Yang J, Wang W, Wang H, Yu P (2002) δ-clusters: capturing subspace correlation in a large data set. In: Proceedings of the 19th international conference on data engineering (ICDE), pp 517–528

  • Zaki MJ, Peters M, Assent I, Seidl T (2005) CLICKS: an effective algorithm for mining subspace clusters in categorical datasets. In: Proceedings of the 11th ACM international conference on knowledge discovery and data mining (KDD), pp 736–742

  • Zhang X, Wang W (2007) An efficient algorithm for mining coherent patterns from heterogeneous microarrays. In: Proceedings of the 19th international conference on scientific and statistical database management (SSDBM), p 32

  • Zhang Q, Liu J, Wang W (2007) Incremental subspace clustering over multiple data streams. In: Proceedings of the 7th IEEE international conference on data mining (ICDM), pp 727–732

  • Zhao L, Zaki MJ (2005) TRICLUSTER: an effective algorithm for mining coherent clusters in 3D microarray data. In: Proceedings of the 25th ACM international conference on management of data (SIGMOD), pp 694–705

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kelvin Sim.

Additional information

Responsible editor: Charu Aggarwal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sim, K., Gopalkrishnan, V., Zimek, A. et al. A survey on enhanced subspace clustering. Data Min Knowl Disc 26, 332–397 (2013). https://doi.org/10.1007/s10618-012-0258-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-012-0258-x

Keywords

Navigation