skip to main content
research-article

Clustering for metric and nonmetric distance measures

Published: 03 September 2010 Publication History

Abstract

We study a generalization of the k-median problem with respect to an arbitrary dissimilarity measure D. Given a finite set P of size n, our goal is to find a set C of size k such that the sum of errors D(P,C) = ∑pP mincC {D(p,c)} is minimized. The main result in this article can be stated as follows: There exists a (1+ϵ)-approximation algorithm for the k-median problem with respect to D, if the 1-median problem can be approximated within a factor of (1+ϵ) by taking a random sample of constant size and solving the 1-median problem on the sample exactly. This algorithm requires time n2O(mklog(mk/ϵ)), where m is a constant that depends only on ϵ and D. Using this characterization, we obtain the first linear time (1+ϵ)-approximation algorithms for the k-median problem in an arbitrary metric space with bounded doubling dimension, for the Kullback-Leibler divergence (relative entropy), for the Itakura-Saito divergence, for Mahalanobis distances, and for some special cases of Bregman divergences. Moreover, we obtain previously known results for the Euclidean k-median problem and the Euclidean k-means problem in a simplified manner. Our results are based on a new analysis of an algorithm of Kumar et al. [2004].

References

[1]
Ackermann, M. R. and Blömer, J. 2009. Coresets and approximate clustering for Bregman divergences. In Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'09). SIAM, 1088--1097.
[2]
Ackermann, M. R., Blömer, J., and Sohler, C. 2008. Clustering for metric and non-metric distance measures. In Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'08). SIAM, 799--808.
[3]
Arthur, D. and Vassilvitskii, S. 2007. k-means++: The advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'07). SIAM, 1027--1035.
[4]
Bădoiu, M., Har-Peled, S., and Indyk, P. 2002. Approximate clustering via core-sets. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC'02). ACM, 250--257.
[5]
Bajaj, C. L. 1988. The algebraic degree of geometric optimization problems. Discr. Comput. Geom. 3, 1, 177--191.
[6]
Baker, L. D. and McCallum, A. K. 1998. Distributional clustering of words for text classification. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98). ACM, 96--103.
[7]
Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. 2005. Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705--1749.
[8]
Bregman, L. M. 1967. The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7, 200--217.
[9]
Buzo, A., Gray, Jr., A., Gray, R. M., and Markel, J. D. 1980. Speech coding based upon vector quantization. IEEE Trans. Acoust. Speech Signal Process. 28, 5, 562--574.
[10]
Censor, Y. and Zenios, S. A. 1997. Parallel Optimization: Theory, Algorithms, and Applications. Numerical Mathematics and Scientific Computation. Oxford University Press, UK.
[11]
Chaudhuri, K. and McGregor, A. 2008. Finding metric structure in information theoretic clustering. In Proceedings of the 21st Annual Conference on Learning Theory (COLT'08). Omnipress, 391--402.
[12]
Chen, K. 2006. On k-median clustering in high dimensions. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'06). SIAM, 1177--1185.
[13]
Chen, K. 2009. On coresets for k-median and k-means clustering in metric and Euclidean spaces and their applications. SIAM J. Comput. 39, 3, 923--947.
[14]
Cover, T. M. and Thomas, J. A. 2006. Elements of Information Theory 2nd Ed. Wiley-Interscience, Hoboken, New York.
[15]
Dhillon, I. S., Mallela, S., and Kumar, R. 2003. A divisive information-theoretic feature clustering algorithm for text classifcation. J. Mach. Learn. Res. 3, 1265--1287.
[16]
Feldman, D., Monemizadeh, M., and Sohler, C. 2007. A PTAS for k-means clustering based on weak coresets. In Proceedings of the 23rd ACM Symposium on Computational Geometry (SCG'07). ACM, 11--18.
[17]
Fernandez de la Vega, W., Karpinski, M., Kenyon, C., and Rabani, Y. 2003. Approximation schemes for clustering problems. In Proceedings of the 35th Annual ACM Symposium on Theory of Computing (STOC'03). ACM, 50--58.
[18]
Gupta, A., Krauthgamer, R., and Lee, J. R. 2003. Bounded geometries, fractals and low-distortion embeddings. In Proceedings of the 44th Symposium on Foundations of Computer Science (FOCS'03). IEEE Computer Society, 534--543.
[19]
Har-Peled, S. and Mazumdar, S. 2004. On coresets for k-means and k-median clustering. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC'04). ACM, 291--300.
[20]
Inaba, M., Katoh, N., and Imai, H. 1994. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering. In Proceedings of the 10th ACM Symposium on Computational Geometry (SCG'94). ACM, 332--339.
[21]
Itakura, F. and Saito, S. 1968. Analysis synthesis telephony based on the maximum likelihood method. In Reports of the 6th International Congress on Acoustics. Elsevier, 17--20.
[22]
Jain, A. K., Murty, M. N., and Flynn, P. J. 1999. Data clustering: A review. ACM Comput. Surv. 31, 3, 264--323.
[23]
Jain, K., Mahdian, M., and Saberi, A. 2002. A new greedy approach for facility location problems. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC'02). ACM, 731--740.
[24]
Kolliopoulos, S. G. and Rao, S. 1999. A nearly linear-time approximation scheme for the euclidean κ-median problem. In Proceedings of the 7th Annual European Symposium on Algorithms (ESA'99). Springer, 378--389.
[25]
Kullback, S. and Leibler, R. A. 1951. On information and sufficiency. Ann. Math. Statis. 22, 1, 79--86.
[26]
Kumar, A., Sabharwal, Y., and Sen, S. 2004. A simple linear time (1+&epsis;)-approximation algorithm for k-means clustering in any dimensions. In Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science (FOCS'04). IEEE Computer Society, 454--462.
[27]
Kumar, A., Sabharwal, Y., and Sen, S. 2005. Linear time algorithms for clustering problems in any dimensions. In Proceedings of the 32nd International Colloquium on Automata, Languages and Programming (ICALP'05). Springer, 1374--1385.
[28]
Lloyd, S. P. 1982. Least squares quantization in PCM. IEEE Trans. Inform. Theory 28, 2, 129--137.
[29]
Mahalanobis, P. C. 1936. On the generalized distance in statistics. Proc. National Inst. Sci. India, Vol. 2, 1, 49--55.
[30]
Matoušek, J. 2000. On approximate geometric k-clustering. Discr. Comput. Geom. 24, 1, 61--84.
[31]
Mercer, D. P. 2003. Clustering large datasets. Tech. rep., Linacre College.
[32]
Mettu, R. R. and Plaxton, C. G. 2004. Optimal time bounds for approximate clustering. Mach. Learn. 56, 1--3, 35--60.
[33]
Nock, R., Luosto, P., and Kivinen, J. 2008. Mixed Bregman clustering with approximation guarantees. In Proceedings of the European Conference on Machine Learning (ECML'08). Springer, 154--169.
[34]
Ostrovsky, R., Rabani, Y., Schulman, L. J., and Swamy, C. 2006. The effectiveness of Lloyd-type methods for the k-means problem. In Proceedings of the 47th Annual Symposium on Foundations of Computer Science (FOCS'06). IEEE Computer Society, 165--176.
[35]
Pereira, F. C. N., Tishby, N., and Lee, L. 1993. Distributional clustering of english words. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (ACL'93). ACL, 183--190.
[36]
Slonim, N. and Tishby, N. 1999. Agglomerative information bottleneck. In Advances in Neural Information Processing Systems 12 (NIPS 12). The MIT Press, 617--623.
[37]
Sra, S., Jegelka, S., and Banerjee, A. 2008. Approximation algorithms for Bregman clustering, co-clustering and tensor clustering. Tech. rep. MPIK-TR-177.
[38]
Thorup, M. 2005. Quick k-median, k-center, and facility location for sparse graphs. SIAM J. Comput. 34, 2, 405--432.
[39]
Xu, R. and Wunsch, II, D. C. 2005. Survey of clustering algorithms. IEEE Trans. Neural Netw. 16, 3, 645--678.

Cited By

View all

Recommendations

Reviews

Aris Gkoulalas-Divanis

The k -median problem can be simply stated as follows: Given a set of locations and a maximum number of facilities, decide at which locations facilities should be placed, in order to minimize the total cost, computed as the average distance of each location to its nearest facility. As it has been proved, the k -median problem is nondeterministic polynomial-time (NP) hard in arbitrary metric spaces; for this reason, several polynomial-time approximation algorithms have been developed that offer different guarantees with respect to the level of approximation. The authors study a generalization of the k -median problem where, given a finite set of objects P derived from a ground set D , the goal is to find a (smaller) set C of k objects from D , such that the sum of errors Dist( P , C ) between P and C is minimized. The elements of C are the k -medians of P . No assumptions are made by the authors about the employed dissimilarity measure, other than Dist( x , y ) = 0 if and only if x = y . As the authors prove, the k -median problem can be solved by an (1+?)-approximation algorithm when the 1-median problem can be approximated within a factor of (1+?) by taking a random sample of constant size and optimally solving the 1-median problem on the sample exactly. Using this characterization, [the authors] obtain the first linear time (1+?)-approximation algorithms for the k -median problem in an arbitrary metric space with bounded doubling dimension, for the Kullback-Leibler-divergence, [for the Itakura-Saito divergence, for Mahalanobis distances, and for special cases of Bregman divergences]. This highly technical paper is interesting and makes an important contribution. I recommend it to those who conduct research in this area. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Algorithms
ACM Transactions on Algorithms  Volume 6, Issue 4
August 2010
308 pages
ISSN:1549-6325
EISSN:1549-6333
DOI:10.1145/1824777
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 September 2010
Accepted: 01 August 2009
Received: 01 July 2009
Published in TALG Volume 6, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. k-means clustering
  2. k-median clustering
  3. Approximation algorithm
  4. Bregman divergences
  5. Itakura-Saito divergence
  6. Kullback-Leibler divergence
  7. Mahalanobis distance
  8. random sampling

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)71
  • Downloads (Last 6 weeks)5
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Optimal 2D audio features estimation for a lightweight application in mosquitoes speciesComputers in Biology and Medicine10.1016/j.compbiomed.2023.107787168:COnline publication date: 12-Apr-2024
  • (2024)Speeding Up Constrained k-Means Through 2-MeansAlgorithmic Aspects in Information and Management10.1007/978-981-97-7801-0_5(52-63)Online publication date: 19-Sep-2024
  • (2023)Approximating (k,ℓ)-Median Clustering for Polygonal CurvesACM Transactions on Algorithms10.1145/355976419:1(1-32)Online publication date: 23-Feb-2023
  • (2023)netANOVA: novel graph clustering technique with significance assessment via hierarchical ANOVABriefings in Bioinformatics10.1093/bib/bbad02924:2Online publication date: 4-Feb-2023
  • (2023)Linear-time approximation scheme for k-means clustering of axis-parallel affine subspacesComputational Geometry10.1016/j.comgeo.2023.101981112(101981)Online publication date: Jun-2023
  • (2022)Adaptive k-center and diameter estimation in sliding windowsInternational Journal of Data Science and Analytics10.1007/s41060-022-00318-z14:2(155-173)Online publication date: 2-Apr-2022
  • (2022)A family of pairwise multi-marginal optimal transports that define a generalized metricMachine Language10.1007/s10994-022-06280-y112:1(353-384)Online publication date: 20-Dec-2022
  • (2022)Improved local search algorithms for Bregman k-means and its variantsJournal of Combinatorial Optimization10.1007/s10878-021-00771-944:4(2533-2550)Online publication date: 1-Nov-2022
  • (2021)Approximating (k, ℓ)-median clustering for polygonal curvesProceedings of the Thirty-Second Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3458064.3458224(2697-2716)Online publication date: 10-Jan-2021
  • (2021)Picture Hesitant Fuzzy Clustering Based on Generalized Picture Hesitant Fuzzy Distance MeasuresKnowledge10.3390/knowledge10100051:1(40-51)Online publication date: 14-Oct-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media