research-article

Clustering for metric and nonmetric distance measures

Authors:

Marcel R. Ackermann,

Johannes Blömer,

Christian SohlerAuthors Info & Claims

ACM Transactions on Algorithms (TALG), Volume 6, Issue 4

Article No.: 59, Pages 1 - 26

https://doi.org/10.1145/1824777.1824779

Published: 03 September 2010 Publication History

Get Access

Abstract

We study a generalization of the k-median problem with respect to an arbitrary dissimilarity measure D. Given a finite set P of size n, our goal is to find a set C of size k such that the sum of errors D(P,C) = ∑_{p ∈ P} min_{c ∈ C} {D(p,c)} is minimized. The main result in this article can be stated as follows: There exists a (1+ϵ)-approximation algorithm for the k-median problem with respect to D, if the 1-median problem can be approximated within a factor of (1+ϵ) by taking a random sample of constant size and solving the 1-median problem on the sample exactly. This algorithm requires time n2^O(mklog(mk/ϵ)), where m is a constant that depends only on ϵ and D. Using this characterization, we obtain the first linear time (1+ϵ)-approximation algorithms for the k-median problem in an arbitrary metric space with bounded doubling dimension, for the Kullback-Leibler divergence (relative entropy), for the Itakura-Saito divergence, for Mahalanobis distances, and for some special cases of Bregman divergences. Moreover, we obtain previously known results for the Euclidean k-median problem and the Euclidean k-means problem in a simplified manner. Our results are based on a new analysis of an algorithm of Kumar et al. [2004].

References

[1]

Ackermann, M. R. and Blömer, J. 2009. Coresets and approximate clustering for Bregman divergences. In Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'09). SIAM, 1088--1097.

Digital Library

Google Scholar

[2]

Ackermann, M. R., Blömer, J., and Sohler, C. 2008. Clustering for metric and non-metric distance measures. In Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'08). SIAM, 799--808.

Digital Library

Google Scholar

[3]

Arthur, D. and Vassilvitskii, S. 2007. k-means++: The advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'07). SIAM, 1027--1035.

Digital Library

Google Scholar

[4]

Bădoiu, M., Har-Peled, S., and Indyk, P. 2002. Approximate clustering via core-sets. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC'02). ACM, 250--257.

Digital Library

Google Scholar

[5]

Bajaj, C. L. 1988. The algebraic degree of geometric optimization problems. Discr. Comput. Geom. 3, 1, 177--191.

Digital Library

Google Scholar

[6]

Baker, L. D. and McCallum, A. K. 1998. Distributional clustering of words for text classification. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98). ACM, 96--103.

Digital Library

Google Scholar

[7]

Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. 2005. Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705--1749.

Crossref

Google Scholar

[8]

Bregman, L. M. 1967. The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7, 200--217.

Crossref

Google Scholar

[9]

Buzo, A., Gray, Jr., A., Gray, R. M., and Markel, J. D. 1980. Speech coding based upon vector quantization. IEEE Trans. Acoust. Speech Signal Process. 28, 5, 562--574.

Crossref

Google Scholar

[10]

Censor, Y. and Zenios, S. A. 1997. Parallel Optimization: Theory, Algorithms, and Applications. Numerical Mathematics and Scientific Computation. Oxford University Press, UK.

Digital Library

Google Scholar

[11]

Chaudhuri, K. and McGregor, A. 2008. Finding metric structure in information theoretic clustering. In Proceedings of the 21st Annual Conference on Learning Theory (COLT'08). Omnipress, 391--402.

Google Scholar

[12]

Chen, K. 2006. On k-median clustering in high dimensions. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'06). SIAM, 1177--1185.

Digital Library

Google Scholar

[13]

Chen, K. 2009. On coresets for k-median and k-means clustering in metric and Euclidean spaces and their applications. SIAM J. Comput. 39, 3, 923--947.

Digital Library

Google Scholar

[14]

Cover, T. M. and Thomas, J. A. 2006. Elements of Information Theory 2nd Ed. Wiley-Interscience, Hoboken, New York.

Digital Library

Google Scholar

[15]

Dhillon, I. S., Mallela, S., and Kumar, R. 2003. A divisive information-theoretic feature clustering algorithm for text classifcation. J. Mach. Learn. Res. 3, 1265--1287.

Digital Library

Google Scholar

[16]

Feldman, D., Monemizadeh, M., and Sohler, C. 2007. A PTAS for k-means clustering based on weak coresets. In Proceedings of the 23rd ACM Symposium on Computational Geometry (SCG'07). ACM, 11--18.

Digital Library

Google Scholar

[17]

Fernandez de la Vega, W., Karpinski, M., Kenyon, C., and Rabani, Y. 2003. Approximation schemes for clustering problems. In Proceedings of the 35th Annual ACM Symposium on Theory of Computing (STOC'03). ACM, 50--58.

Digital Library

Google Scholar

[18]

Gupta, A., Krauthgamer, R., and Lee, J. R. 2003. Bounded geometries, fractals and low-distortion embeddings. In Proceedings of the 44th Symposium on Foundations of Computer Science (FOCS'03). IEEE Computer Society, 534--543.

Digital Library

Google Scholar

[19]

Har-Peled, S. and Mazumdar, S. 2004. On coresets for k-means and k-median clustering. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC'04). ACM, 291--300.

Digital Library

Google Scholar

[20]

Inaba, M., Katoh, N., and Imai, H. 1994. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering. In Proceedings of the 10th ACM Symposium on Computational Geometry (SCG'94). ACM, 332--339.

Digital Library

Google Scholar

[21]

Itakura, F. and Saito, S. 1968. Analysis synthesis telephony based on the maximum likelihood method. In Reports of the 6th International Congress on Acoustics. Elsevier, 17--20.

Google Scholar

[22]

Jain, A. K., Murty, M. N., and Flynn, P. J. 1999. Data clustering: A review. ACM Comput. Surv. 31, 3, 264--323.

Digital Library

Google Scholar

[23]

Jain, K., Mahdian, M., and Saberi, A. 2002. A new greedy approach for facility location problems. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC'02). ACM, 731--740.

Digital Library

Google Scholar

[24]

Kolliopoulos, S. G. and Rao, S. 1999. A nearly linear-time approximation scheme for the euclidean κ-median problem. In Proceedings of the 7th Annual European Symposium on Algorithms (ESA'99). Springer, 378--389.

Digital Library

Google Scholar

[25]

Kullback, S. and Leibler, R. A. 1951. On information and sufficiency. Ann. Math. Statis. 22, 1, 79--86.

Crossref

Google Scholar

[26]

Kumar, A., Sabharwal, Y., and Sen, S. 2004. A simple linear time (1+&epsis;)-approximation algorithm for k-means clustering in any dimensions. In Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science (FOCS'04). IEEE Computer Society, 454--462.

Digital Library

Google Scholar

[27]

Kumar, A., Sabharwal, Y., and Sen, S. 2005. Linear time algorithms for clustering problems in any dimensions. In Proceedings of the 32nd International Colloquium on Automata, Languages and Programming (ICALP'05). Springer, 1374--1385.

Digital Library

Google Scholar

[28]

Lloyd, S. P. 1982. Least squares quantization in PCM. IEEE Trans. Inform. Theory 28, 2, 129--137.

Digital Library

Google Scholar

[29]

Mahalanobis, P. C. 1936. On the generalized distance in statistics. Proc. National Inst. Sci. India, Vol. 2, 1, 49--55.

Google Scholar

[30]

Matoušek, J. 2000. On approximate geometric k-clustering. Discr. Comput. Geom. 24, 1, 61--84.

Crossref

Google Scholar

[31]

Mercer, D. P. 2003. Clustering large datasets. Tech. rep., Linacre College.

Google Scholar

[32]

Mettu, R. R. and Plaxton, C. G. 2004. Optimal time bounds for approximate clustering. Mach. Learn. 56, 1--3, 35--60.

Digital Library

Google Scholar

[33]

Nock, R., Luosto, P., and Kivinen, J. 2008. Mixed Bregman clustering with approximation guarantees. In Proceedings of the European Conference on Machine Learning (ECML'08). Springer, 154--169.

Digital Library

Google Scholar

[34]

Ostrovsky, R., Rabani, Y., Schulman, L. J., and Swamy, C. 2006. The effectiveness of Lloyd-type methods for the k-means problem. In Proceedings of the 47th Annual Symposium on Foundations of Computer Science (FOCS'06). IEEE Computer Society, 165--176.

Digital Library

Google Scholar

[35]

Pereira, F. C. N., Tishby, N., and Lee, L. 1993. Distributional clustering of english words. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (ACL'93). ACL, 183--190.

Digital Library

Google Scholar

[36]

Slonim, N. and Tishby, N. 1999. Agglomerative information bottleneck. In Advances in Neural Information Processing Systems 12 (NIPS 12). The MIT Press, 617--623.

Google Scholar

[37]

Sra, S., Jegelka, S., and Banerjee, A. 2008. Approximation algorithms for Bregman clustering, co-clustering and tensor clustering. Tech. rep. MPIK-TR-177.

Google Scholar

[38]

Thorup, M. 2005. Quick k-median, k-center, and facility location for sparse graphs. SIAM J. Comput. 34, 2, 405--432.

Digital Library

Google Scholar

[39]

Xu, R. and Wunsch, II, D. C. 2005. Survey of clustering algorithms. IEEE Trans. Neural Netw. 16, 3, 645--678.

Digital Library

Google Scholar

Cited By

View all

Vasconcelos DNunes NFörster AGomes J(2024)Optimal 2D audio features estimation for a lightweight application in mosquitoes speciesComputers in Biology and Medicine10.1016/j.compbiomed.2023.107787168:COnline publication date: 12-Apr-2024
https://dl.acm.org/doi/10.1016/j.compbiomed.2023.107787
Feng QFu B(2024)Speeding Up Constrained k-Means Through 2-MeansAlgorithmic Aspects in Information and Management10.1007/978-981-97-7801-0_5(52-63)Online publication date: 19-Sep-2024
https://doi.org/10.1007/978-981-97-7801-0_5
Buchin MDriemel ARohde D(2023)Approximating (k,ℓ)-Median Clustering for Polygonal CurvesACM Transactions on Algorithms10.1145/355976419:1(1-32)Online publication date: 23-Feb-2023
https://dl.acm.org/doi/10.1145/3559764
Show More Cited By

Recommendations

Classification into Kullback-Leibler balls in exponential families

A classification procedure for a two-class problem is introduced and analyzed, where the classes of probability density functions within a regular exponential family are represented by left-sided Kullback-Leibler balls of natural parameter vectors. If ...
A modified Kullback–Leibler divergence for non-additive measures based on Choquet integral
Abstract
The Kullback–Leibler divergence is a very important concept in statistics and probability which helps us in many problems of information systems. The main property of the Kullback–Leibler divergence is non-negativity. The study of ...
Klee sets and Chebyshev centers for the right Bregman distance

We systematically investigate the farthest distance function, farthest points, Klee sets, and Chebyshev centers, with respect to Bregman distances induced by Legendre functions. These objects are of considerable interest in Information Geometry and ...

Reviews

Reviewer: Aris Gkoulalas-Divanis

The k -median problem can be simply stated as follows: Given a set of locations and a maximum number of facilities, decide at which locations facilities should be placed, in order to minimize the total cost, computed as the average distance of each location to its nearest facility. As it has been proved, the k -median problem is nondeterministic polynomial-time (NP) hard in arbitrary metric spaces; for this reason, several polynomial-time approximation algorithms have been developed that offer different guarantees with respect to the level of approximation. The authors study a generalization of the k -median problem where, given a finite set of objects P derived from a ground set D , the goal is to find a (smaller) set C of k objects from D , such that the sum of errors Dist( P , C ) between P and C is minimized. The elements of C are the k -medians of P . No assumptions are made by the authors about the employed dissimilarity measure, other than Dist( x , y ) = 0 if and only if x = y . As the authors prove, the k -median problem can be solved by an (1+?)-approximation algorithm when the 1-median problem can be approximated within a factor of (1+?) by taking a random sample of constant size and optimally solving the 1-median problem on the sample exactly. Using this characterization, [the authors] obtain the first linear time (1+?)-approximation algorithms for the k -median problem in an arbitrary metric space with bounded doubling dimension, for the Kullback-Leibler-divergence, [for the Itakura-Saito divergence, for Mahalanobis distances, and for special cases of Bregman divergences]. This highly technical paper is interesting and makes an important contribution. I recommend it to those who conduct research in this area. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Algorithms

ACM Transactions on Algorithms Volume 6, Issue 4

August 2010

308 pages

ISSN:1549-6325

EISSN:1549-6333

DOI:10.1145/1824777

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 September 2010

Accepted: 01 August 2009

Received: 01 July 2009

Published in TALG Volume 6, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Deutsche Forschungsgemeinschaft

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

65
Total Citations
View Citations
1,207
Total Downloads

Downloads (Last 12 months)71
Downloads (Last 6 weeks)5

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Vasconcelos DNunes NFörster AGomes J(2024)Optimal 2D audio features estimation for a lightweight application in mosquitoes speciesComputers in Biology and Medicine10.1016/j.compbiomed.2023.107787168:COnline publication date: 12-Apr-2024
https://dl.acm.org/doi/10.1016/j.compbiomed.2023.107787
Feng QFu B(2024)Speeding Up Constrained k-Means Through 2-MeansAlgorithmic Aspects in Information and Management10.1007/978-981-97-7801-0_5(52-63)Online publication date: 19-Sep-2024
https://doi.org/10.1007/978-981-97-7801-0_5
Buchin MDriemel ARohde D(2023)Approximating (k,ℓ)-Median Clustering for Polygonal CurvesACM Transactions on Algorithms10.1145/355976419:1(1-32)Online publication date: 23-Feb-2023
https://dl.acm.org/doi/10.1145/3559764
Duroux DVan Steen K(2023)netANOVA: novel graph clustering technique with significance assessment via hierarchical ANOVABriefings in Bioinformatics10.1093/bib/bbad02924:2Online publication date: 4-Feb-2023
https://doi.org/10.1093/bib/bbad029
Cho KOh E(2023)Linear-time approximation scheme for k-means clustering of axis-parallel affine subspacesComputational Geometry10.1016/j.comgeo.2023.101981112(101981)Online publication date: Jun-2023
https://doi.org/10.1016/j.comgeo.2023.101981
Pellizzoni PPietracaprina APucci G(2022)Adaptive k-center and diameter estimation in sliding windowsInternational Journal of Data Science and Analytics10.1007/s41060-022-00318-z14:2(155-173)Online publication date: 2-Apr-2022
https://doi.org/10.1007/s41060-022-00318-z
Mi LSheikholeslami ABento J(2022)A family of pairwise multi-marginal optimal transports that define a generalized metricMachine Language10.1007/s10994-022-06280-y112:1(353-384)Online publication date: 20-Dec-2022
https://dl.acm.org/doi/10.1007/s10994-022-06280-y
Tian XXu DGuo LWu D(2022)Improved local search algorithms for Bregman k-means and its variantsJournal of Combinatorial Optimization10.1007/s10878-021-00771-944:4(2533-2550)Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1007/s10878-021-00771-9
Buchin MDriemel ARohde DMarx D(2021)Approximating (k, ℓ)-median clustering for polygonal curvesProceedings of the Thirty-Second Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3458064.3458224(2697-2716)Online publication date: 10-Jan-2021
https://dl.acm.org/doi/10.5555/3458064.3458224
Ali ZMahmood TUllah K(2021)Picture Hesitant Fuzzy Clustering Based on Generalized Picture Hesitant Fuzzy Distance MeasuresKnowledge10.3390/knowledge10100051:1(40-51)Online publication date: 14-Oct-2021
https://doi.org/10.3390/knowledge1010005
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Classification into Kullback-Leibler balls in exponential families

A modified Kullback–Leibler divergence for non-additive measures based on Choquet integral

Klee sets and Chebyshev centers for the right Bregman distance

Reviews

Access critical reviews of Computing literature here

Comments

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

Login options

Full Access

PDF

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Classification into Kullback-Leibler balls in exponential families

A modified Kullback–Leibler divergence for non-additive measures based on Choquet integral

Klee sets and Chebyshev centers for the right Bregman distance

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations