Locally adaptive metrics for clustering high dimensional data

Domeniconi, Carlotta; Gunopulos, Dimitrios; Ma, Sheng; Yan, Bojun; Al-Razgan, Muna; Papadopoulos, Dimitris

doi:10.1007/s10618-006-0060-8

Locally adaptive metrics for clustering high dimensional data

Published: 26 January 2007

Volume 14, pages 63–97, (2007)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Carlotta Domeniconi¹,
Dimitrios Gunopulos²,
Sheng Ma³,
Bojun Yan¹,
Muna Al-Razgan¹ &
…
Dimitris Papadopoulos²

883 Accesses
166 Citations
Explore all metrics

Abstract

Clustering suffers from the curse of dimensionality, and similarity functions that use all input features with equal relevance may not be effective. We introduce an algorithm that discovers clusters in subspaces spanned by different combinations of dimensions via local weightings of features. This approach avoids the risk of loss of information encountered in global dimensionality reduction techniques, and does not assume any data distribution model. Our method associates to each cluster a weight vector, whose values capture the relevance of features within the corresponding cluster. We experimentally demonstrate the gain in perfomance our method achieves with respect to competitive methods, using both synthetic and real datasets. In particular, our results show the feasibility of the proposed technique to perform simultaneous clustering of genes and conditions in gene expression data, and clustering of very high-dimensional data such as text data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aggarwal C, Procopiuc C, Wolf JL, Yu PS, Park JS (1999) Fast algorithms for projected clustering. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 61–72
Aggarwal C, Yu PS (2000) Finding generalized projected clusters in high dimensional spaces. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 70–81
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 94–105
Alizadeh A. et al (2000) Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403(6769):503–511
Article Google Scholar
Al-Razgan M, Domeniconi C (2006) Weighted clustering ensembles. In: Proceedings of the SIAM international conference on data mining, pp 258–269
Arabie P, Hubert LJ (1996) An overview of combinatorial data analysis. clustering and classification. World Scientific, Singapore pp 5–63
Bottou L, Vapnik V (1992) Local learning algorithms. Neural Comput 4(6):888–900
Google Scholar
Chakrabarti K, Mehrotra S (2000) Local dimensionality reduction: a new approach to indexing high dimensional spaces. In: Proceedings of VLDB, pp 89–100
Cheng Y, Church GM (2000) Biclustering of expression data. In: Proceedings of the 8th international conference on intelligent systems for molecular biology, pp 93–103
Cheeseman P, Stutz J (1996) Bayesian classification (autoclass): theory and results. In: Advances in knowledge discovery and data mining, Chap. 6. AAAI/MIT Press, pp 153–180
Dempster AP, Laird NM, Rubin DB (1997) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 39(1):1–38
MathSciNet Google Scholar
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining pp 269–274
Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining pp 89–98
Domeniconi C, Papadopoulos D, Gunopulos D, Ma S (2004) Subspace clustering of high dimensional data. In: Proceedings of the SIAM international conference on data mining, pp 517–520
Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York
MATH Google Scholar
Dy JG, Brodley CE (2000) Feature subset selection and order identification for unsupervised learning. In: Proceedings of the international conference on machine learning, pp 247–254
Ester M, Kriegel HP, Xu X (1995) A database interface for clustering in large spatial databases. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining pp 94–99
Fern XZ, Brodley CE (2004) Solving cluster ensemble problems by bipartite graph partitioning. In: Proceedings of the international conference on machine learning, pp 281–288
Friedman J, Meulman J (2002) Clustering objects on subsets of attributes. Technical report, Stanford University
Fukunaga K (1990) Introduction to statistical pattern recognition. Academic, New York
MATH Google Scholar
Ghahramani Z, Hinton GE (1996) The EM algorithm for mixtures of factor analyzers. Technical report CRG-TR-96-1, Department of Computer Science, University of Toronto
Hartigan JA (1972) Direct clustering of a data matrix. J Am Stat Assoc 67(337):123–129
Article Google Scholar
Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Kallioniemi OP, Wilfond B, Borg A, Trent J (2001) Gene expression profiles in hereditary breast cancer. N Engl J Med 344:539–548
Article Google Scholar
Keogh E, Chakrabarti K, Mehrotra S, Pazzani M (2001) Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings of the ACM SIGMOD conference on management of data, pp 151–162
Kharypis G, Kumar V (1995) Multilevel k-way partitioning scheme for irregular graphs. Technical report, Department of Computer Science, University of Minnesota and Army HPC Research Center
Michalski RS, Stepp RE (1983) Learning from observation: conceptual clustering. In: Michalski RS, Carbonell JG, Mitchell TM (eds) Machine learning: an artificial intelligence approach, vol 2. Palo Alto TIOGA Publishing Co., pp 331–363
Mladenović N, Brimberg J (1996) A degeneracy property in continuous location-allocation problems. In: Les Cahiers du GERAD, G-96-37, Montreal, Canada
Modha D, Spangler S (2003) Feature weighting in K-means clustering. Mach Learn 52(3):217–237
Article MATH Google Scholar
Ng RT, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the VLDB conference, pp 144–155
Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6(1):90–105
Article Google Scholar
Procopiuc CM, Jones M, Agarwal PK, Murali TM (2002) A Monte Carlo algorithm for fast projective clustering. In: Proceedings of the ACM SIGMOD conference on management of data, pp 418–427
Strehl A, Ghosh J (2003) Cluster ensemble—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617
Article MATH MathSciNet Google Scholar
Tipping ME, Bishop CM (1999) Mixtures of principal component analyzers. Neural Comput 1(2):443–482
Article Google Scholar
Thomasian A, Castelli V, Li CS (1998) Clustering and singular value decomposition for approximate indexing in high dimensional spaces. In: Proceedings of CIKM, pp 201–207
Wang H, Wang W, Yang J, Yu PS (2002) Clustering by pattern similarity in large data sets. In: Proceedings of the ACM SIGMOD conference on management of data, pp 394–405
Wu CFJ (1983) On the convergence properties of the EM algorithm. Ann Stat 11(1):95–103
MATH Google Scholar
Yang J, Wang W, Wang H, Yu P (2002) δ-Clusters: capturing subspace correlation in a large data set. In: Proceedings of the international conference on data engineering, pp 517–528
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: An efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD conference on management of data, pp 103–114

Download references

Author information

Authors and Affiliations

George Mason University, Fairfax, VA, USA
Carlotta Domeniconi, Bojun Yan & Muna Al-Razgan
UC Riverside, Riverside, CA, USA
Dimitrios Gunopulos & Dimitris Papadopoulos
Vivido Media Inc., Suite 319, Digital Media Tower 7 Information Rd., Shangdi Development Zone, Beijing, 100085, China
Sheng Ma

Authors

Carlotta Domeniconi
View author publications
You can also search for this author in PubMed Google Scholar
Dimitrios Gunopulos
View author publications
You can also search for this author in PubMed Google Scholar
Sheng Ma
View author publications
You can also search for this author in PubMed Google Scholar
Bojun Yan
View author publications
You can also search for this author in PubMed Google Scholar
Muna Al-Razgan
View author publications
You can also search for this author in PubMed Google Scholar
Dimitris Papadopoulos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carlotta Domeniconi.

Additional information

Editor: Johannes Gehrke.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Domeniconi, C., Gunopulos, D., Ma, S. et al. Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Disc 14, 63–97 (2007). https://doi.org/10.1007/s10618-006-0060-8

Download citation

Received: 11 January 2006
Accepted: 20 November 2006
Published: 26 January 2007
Issue Date: February 2007
DOI: https://doi.org/10.1007/s10618-006-0060-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Locally adaptive metrics for clustering high dimensional data

Abstract

Access this article

Similar content being viewed by others

Clustering

Band-based similarity indices for gene expression classification and clustering

Optimizing Gene Expression Analysis Using Clustering Algorithms

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Locally adaptive metrics for clustering high dimensional data

Abstract

Access this article

Similar content being viewed by others

Clustering

Band-based similarity indices for gene expression classification and clustering

Optimizing Gene Expression Analysis Using Clustering Algorithms

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation