Skip to main content
Log in

Locally adaptive metrics for clustering high dimensional data

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Clustering suffers from the curse of dimensionality, and similarity functions that use all input features with equal relevance may not be effective. We introduce an algorithm that discovers clusters in subspaces spanned by different combinations of dimensions via local weightings of features. This approach avoids the risk of loss of information encountered in global dimensionality reduction techniques, and does not assume any data distribution model. Our method associates to each cluster a weight vector, whose values capture the relevance of features within the corresponding cluster. We experimentally demonstrate the gain in perfomance our method achieves with respect to competitive methods, using both synthetic and real datasets. In particular, our results show the feasibility of the proposed technique to perform simultaneous clustering of genes and conditions in gene expression data, and clustering of very high-dimensional data such as text data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Aggarwal C, Procopiuc C, Wolf JL, Yu PS, Park JS (1999) Fast algorithms for projected clustering. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 61–72

  • Aggarwal C, Yu PS (2000) Finding generalized projected clusters in high dimensional spaces. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 70–81

  • Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 94–105

  • Alizadeh A. et al (2000) Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403(6769):503–511

    Article  Google Scholar 

  • Al-Razgan M, Domeniconi C (2006) Weighted clustering ensembles. In: Proceedings of the SIAM international conference on data mining, pp 258–269

  • Arabie P, Hubert LJ (1996) An overview of combinatorial data analysis. clustering and classification. World Scientific, Singapore pp 5–63

  • Bottou L, Vapnik V (1992) Local learning algorithms. Neural Comput 4(6):888–900

    Google Scholar 

  • Chakrabarti K, Mehrotra S (2000) Local dimensionality reduction: a new approach to indexing high dimensional spaces. In: Proceedings of VLDB, pp 89–100

  • Cheng Y, Church GM (2000) Biclustering of expression data. In: Proceedings of the 8th international conference on intelligent systems for molecular biology, pp 93–103

  • Cheeseman P, Stutz J (1996) Bayesian classification (autoclass): theory and results. In: Advances in knowledge discovery and data mining, Chap. 6. AAAI/MIT Press, pp 153–180

  • Dempster AP, Laird NM, Rubin DB (1997) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 39(1):1–38

    MathSciNet  Google Scholar 

  • Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining pp 269–274

  • Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining pp 89–98

  • Domeniconi C, Papadopoulos D, Gunopulos D, Ma S (2004) Subspace clustering of high dimensional data. In: Proceedings of the SIAM international conference on data mining, pp 517–520

  • Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York

    MATH  Google Scholar 

  • Dy JG, Brodley CE (2000) Feature subset selection and order identification for unsupervised learning. In: Proceedings of the international conference on machine learning, pp 247–254

  • Ester M, Kriegel HP, Xu X (1995) A database interface for clustering in large spatial databases. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining pp 94–99

  • Fern XZ, Brodley CE (2004) Solving cluster ensemble problems by bipartite graph partitioning. In: Proceedings of the international conference on machine learning, pp 281–288

  • Friedman J, Meulman J (2002) Clustering objects on subsets of attributes. Technical report, Stanford University

  • Fukunaga K (1990) Introduction to statistical pattern recognition. Academic, New York

    MATH  Google Scholar 

  • Ghahramani Z, Hinton GE (1996) The EM algorithm for mixtures of factor analyzers. Technical report CRG-TR-96-1, Department of Computer Science, University of Toronto

  • Hartigan JA (1972) Direct clustering of a data matrix. J Am Stat Assoc 67(337):123–129

    Article  Google Scholar 

  • Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Kallioniemi OP, Wilfond B, Borg A, Trent J (2001) Gene expression profiles in hereditary breast cancer. N Engl J Med 344:539–548

    Article  Google Scholar 

  • Keogh E, Chakrabarti K, Mehrotra S, Pazzani M (2001) Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings of the ACM SIGMOD conference on management of data, pp 151–162

  • Kharypis G, Kumar V (1995) Multilevel k-way partitioning scheme for irregular graphs. Technical report, Department of Computer Science, University of Minnesota and Army HPC Research Center

  • Michalski RS, Stepp RE (1983) Learning from observation: conceptual clustering. In: Michalski RS, Carbonell JG, Mitchell TM (eds) Machine learning: an artificial intelligence approach, vol 2. Palo Alto TIOGA Publishing Co., pp 331–363

  • Mladenović N, Brimberg J (1996) A degeneracy property in continuous location-allocation problems. In: Les Cahiers du GERAD, G-96-37, Montreal, Canada

  • Modha D, Spangler S (2003) Feature weighting in K-means clustering. Mach Learn 52(3):217–237

    Article  MATH  Google Scholar 

  • Ng RT, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the VLDB conference, pp 144–155

  • Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6(1):90–105

    Article  Google Scholar 

  • Procopiuc CM, Jones M, Agarwal PK, Murali TM (2002) A Monte Carlo algorithm for fast projective clustering. In: Proceedings of the ACM SIGMOD conference on management of data, pp 418–427

  • Strehl A, Ghosh J (2003) Cluster ensemble—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617

    Article  MATH  MathSciNet  Google Scholar 

  • Tipping ME, Bishop CM (1999) Mixtures of principal component analyzers. Neural Comput 1(2):443–482

    Article  Google Scholar 

  • Thomasian A, Castelli V, Li CS (1998) Clustering and singular value decomposition for approximate indexing in high dimensional spaces. In: Proceedings of CIKM, pp 201–207

  • Wang H, Wang W, Yang J, Yu PS (2002) Clustering by pattern similarity in large data sets. In: Proceedings of the ACM SIGMOD conference on management of data, pp 394–405

  • Wu CFJ (1983) On the convergence properties of the EM algorithm. Ann Stat 11(1):95–103

    MATH  Google Scholar 

  • Yang J, Wang W, Wang H, Yu P (2002) δ-Clusters: capturing subspace correlation in a large data set. In: Proceedings of the international conference on data engineering, pp 517–528

  • Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: An efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD conference on management of data, pp 103–114

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carlotta Domeniconi.

Additional information

Editor: Johannes Gehrke.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Domeniconi, C., Gunopulos, D., Ma, S. et al. Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Disc 14, 63–97 (2007). https://doi.org/10.1007/s10618-006-0060-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-006-0060-8

Keywords

Navigation