research-article

Performance of Johnson-Lindenstrauss transform for k-means and k-medians clustering

Authors:

Konstantin Makarychev,

Yury Makarychev,

Ilya RazenshteynAuthors Info & Claims

STOC 2019: Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing

Pages 1027 - 1038

https://doi.org/10.1145/3313276.3316350

Published: 23 June 2019 Publication History

Abstract

Consider an instance of Euclidean k-means or k-medians clustering. We show that the cost of the optimal solution is preserved up to a factor of (1+ε) under a projection onto a random O(log(k /ε) / ε²)-dimensional subspace. Further, the cost of every clustering is preserved within (1+ε). More generally, our result applies to any dimension reduction map satisfying a mild sub-Gaussian-tail condition. Our bound on the dimension is nearly optimal. Additionally, our result applies to Euclidean k-clustering with the distances raised to the p-th power for any constant p.

For k-means, our result resolves an open problem posed by Cohen, Elder, Musco, Musco, and Persu (STOC 2015); for k-medians, it answers a question raised by Kannan.

References

[1]

{AC06} Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast Johnson–Lindenstrauss transform. In Proceedings of the Symposium on Theory of Computing, pages 557–563, 2006.

Digital Library

[2]

{Ach03} Dimitris Achlioptas. Database-friendly random projections: Johnson– Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671–687, 2003.

Digital Library

[3]

{AL09} Nir Ailon and Edo Liberty. Fast dimension reduction using Rademacher series on dual BCH codes. Discrete & Computational Geometry, 42(4):615, 2009.

Digital Library

[4]

{AL13} Nir Ailon and Edo Liberty. An almost optimal unrestricted fast Johnson– Lindenstrauss transform. ACM Transactions on Algorithms (TALG), 9(3):21, 2013.

Digital Library

[5]

{Alo03} Noga Alon. Problems and results in extremal combinatorics-I. Discrete Mathematics, 273(1-3):31–53, 2003.

Digital Library

[6]

{BBCA + 19} Luca Becchetti, Marc Bury, Vincent Cohen-Addad, Fabrizio Grandoni, and Chris Schwiegelshohn. Oblivious dimension reduction for k-means – beyond subspaces and the Johnson-Lindenstrauss lemma. In Proceedings of the Symposium on Theory of Computing, 2019.

Digital Library

[7]

{BDM09} Christos Boutsidis, Petros Drineas, and Michael W Mahoney. Unsupervised feature selection for the k-means clustering problem. In Advances in Neural Information Processing Systems, pages 153–161, 2009.

Digital Library

[8]

{BMI13} Christos Boutsidis and Malik Magdon-Ismail. Deterministic feature selection for k-means clustering. IEEE Transactions on Information Theory, 59(9):6099–6110, 2013.

Digital Library

[9]

{BZD10} Christos Boutsidis, Anastasios Zouzias, and Petros Drineas. Random projections for k-means clustering. In Advances in Neural Information Processing Systems, pages 298–306, 2010.

Digital Library

[10]

{BZMD15} Christos Boutsidis, Anastasios Zouzias, Michael W Mahoney, and Petros Drineas. Randomized dimensionality reduction for k-means clustering. IEEE Transactions on Information Theory, 61(2):1045–1062, 2015.

Digital Library

[11]

{CEM + 15} Michael B Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the Symposium on Theory of Computing, pages 163–172, 2015.

Digital Library

[12]

{DFK + 99} Petros Drineas, Alan M Frieze, Ravi Kannan, Santosh Vempala, and V Vinay. Clustering in large graphs and matrices. In Proceedings of the Symposium on Discrete Algorithms, pages 291–299, 1999.

Digital Library

[13]

{DG03} Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures & Algorithms, 22(1):60– 65, 2003.

Digital Library

[14]

{DKS10} Anirban Dasgupta, Ravi Kumar, and Tamás Sarlós. A sparse Johnson– Lindenstrauss transform. In Prooceedings of the Symposium on Theory of Computing, pages 341–350, 2010.

[15]

{Far90} Nariman Farvardin. A study of vector quantization for noisy channels. IEEE Transactions on Information Theory, 36(4):799–809, 1990.

Digital Library

[16]

{FSS13} Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering. In Proceedings of the Symposium on Discrete Algorithms, pages 1434–1453, 2013.

Digital Library

[17]

{IM98} Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the Symposium on Theory of Computing, pages 604–613, 1998.

Digital Library

[18]

{ Jai10} Anil K Jain. Data clustering: 50 years beyond k-means. Pattern recognition letters, 31(8):651–666, 2010.

Digital Library

[19]

{ JDS11} Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011.

Digital Library

[20]

{ JL84} William Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Conference in modern analysis and probability (New Haven, Connecticut, 1982), volume 26 of Contemporary Mathematics, pages 189–206. 1984.

[21]

{Kan18} Ravi Kannan. Intro and foundations of data science I. Tutorial at Simons Institute, 2018. Available at https://www.youtube.com/watch?v=9GMT3FnQTGM. {Kir34} M Kirszbraun. Über die zusammenziehende und lipschitzsche transformationen. Fundamenta Mathematicae, 22(1):77–108, 1934.

[22]

{KM + 05} B Klartag, Shahar Mendelson, et al. Empirical processes and random projections. Journal of Functional Analysis, 225(1):229, 2005.

[23]

{KN14} Daniel M Kane and Jelani Nelson. Sparser Johnson-Lindenstrauss transforms. Journal of the ACM, 61(1):4, 2014.

Digital Library

[24]

{KW11} Felix Krahmer and Rachel Ward. New and improved Johnson– Lindenstrauss embeddings via the restricted isometry property. SIAM Journal on Mathematical Analysis, 43(3):1269–1281, 2011.

[25]

{Llo82} Stuart Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, 1982.

Digital Library

[26]

{LM00} Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302–1338, 2000.

[27]

{LN17} Kasper Green Larsen and Jelani Nelson. Optimality of the Johnson– Lindenstrauss lemma. In Symposium on Foundations of Computer Science, pages 633–638. IEEE, 2017.

[28]

{Nao18} Assaf Naor. Metric dimension reduction: A snapshot of the Ribe program. arXiv preprint arXiv:1809.02376, 2018.

[29]

{NPW14} Jelani Nelson, Eric Price, and Mary Wootters. New constructions of RIP matrices with fast multiplication and fewer rows. In Proceedings of the Symposium on Discrete Algorithms, pages 1515–1528, 2014.

Digital Library

[30]

{Sar06} Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In Proceedings of the Foundations of Computer Science, pages 143–152, 2006.

Digital Library

[31]

{SW18} Christian Sohler and David P Woodruff. Strong coresets for k-median and subspace approximation: Goodbye dimension. In Proceedings of the Foundations of Computer Science, pages 802–813, 2018.

Cited By

Woodruff DYasuda TSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Coresets for multiple ℓp regressionProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694250(53202-53233)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694250
La Tour MHenzinger MSaulpic DSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Making old things newProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692550(12046-12086)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3692550
Anagnostopoulos ABecchetti LBöhm MFazzone ALeonardi SMenghini CSchwiegelshohn C(2024)Fair Projections as a Means toward Balanced RecommendationsACM Transactions on Intelligent Systems and Technology10.1145/366492916:1(1-32)Online publication date: 30-Dec-2024
https://dl.acm.org/doi/10.1145/3664929
Show More Cited By

Index Terms

Performance of Johnson-Lindenstrauss transform for k-means and k-medians clustering
1. Theory of computation

Recommendations

Performance of Johnson--Lindenstrauss Transform for $k$-Means and $k$-Medians Clustering

Consider an instance of Euclidean $k$-means or $k$-medians clustering. We show that the cost of the optimal solution is preserved up to a factor of $(1+\varepsilon)$ under a projection onto a random $O(\log(k / \varepsilon) / \varepsilon^2)$-dimensional ...
Survey of Clustering: Algorithms and Applications

This article is a survey into clustering applications and algorithms. A number of important well-known clustering methods are discussed. The authors present a brief history of the development of the field of clustering, discuss various types of ...
Ant clustering algorithm with K-harmonic means clustering

Clustering is an unsupervised learning procedure and there is no a prior knowledge of data distribution. It organizes a set of objects/data into similar groups called clusters, and the objects within one cluster are highly similar and dissimilar with ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

STOC 2019: Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing

June 2019

1258 pages

ISBN:9781450367059

DOI:10.1145/3313276

General Chair:
Moses Charikar
Stanford University
,
Program Chair:
Edith Cohen
Google, USA / Tel Aviv University, Israel

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGACT: ACM Special Interest Group on Algorithms and Computation Theory

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

STOC '19

Sponsor:

SIGACT

STOC '19: 51st Annual ACM SIGACT Symposium on the Theory of Computing

June 23 - 26, 2019

AZ, Phoenix, USA

Acceptance Rates

Overall Acceptance Rate 1,469 of 4,586 submissions, 32%

Upcoming Conference

STOC '25

Sponsor:
sigact

57th Annual ACM Symposium on Theory of Computing (STOC 2025)

June 23 - 27, 2025

Prague , Czech Republic

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

52
Total Citations
View Citations
522
Total Downloads

Downloads (Last 12 months)79
Downloads (Last 6 weeks)9

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Woodruff DYasuda TSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Coresets for multiple ℓp regressionProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694250(53202-53233)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694250
La Tour MHenzinger MSaulpic DSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Making old things newProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692550(12046-12086)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3692550
Anagnostopoulos ABecchetti LBöhm MFazzone ALeonardi SMenghini CSchwiegelshohn C(2024)Fair Projections as a Means toward Balanced RecommendationsACM Transactions on Intelligent Systems and Technology10.1145/366492916:1(1-32)Online publication date: 30-Dec-2024
https://dl.acm.org/doi/10.1145/3664929
Draganov ASaulpic DSchwiegelshohn C(2024)Settling Time vs. Accuracy Tradeoffs for Clustering Big DataProceedings of the ACM on Management of Data10.1145/36549762:3(1-25)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654976
Cao XZhang XZhu CLiu JLiu Y(2024)TS-RTPM-Net: Data-Driven Tensor Sketching for Efficient CP DecompositionIEEE Transactions on Big Data10.1109/TBDATA.2023.331025410:1(1-11)Online publication date: Feb-2024
https://doi.org/10.1109/TBDATA.2023.3310254
Psarros IRohde D(2024)Random Projections for Curves in High DimensionsDiscrete & Computational Geometry10.1007/s00454-024-00710-5Online publication date: 11-Dec-2024
https://doi.org/10.1007/s00454-024-00710-5
Jain VPham HVuong T(2023)Dimension reduction for maximum matchings and the Fastest Mixing Markov ChainComptes Rendus. Mathématique10.5802/crmath.447361:G5(869-876)Online publication date: 18-Jul-2023
https://doi.org/10.5802/crmath.447
Fan CLi PLi XOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)k-median clustering via metric embeddingProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669350(73817-73838)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669350
Bucarelli MLarsen MSchwiegelshohn CToftrup MOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)On generalization bounds for projective clusteringProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669262(71723-71754)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669262
Dexter GWoodruff DDrineas PYasuda TOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Sketching algorithms for sparse dictionary learningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668223(48431-48443)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3668223
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten