Clustering with or without the approximation

Schalekamp, Frans; Yu, Michael; van Zuylen, Anke

doi:10.1007/s10878-011-9382-6

Clustering with or without the approximation

Published: 04 February 2011

Volume 25, pages 393–429, (2013)
Cite this article

Journal of Combinatorial Optimization Aims and scope Submit manuscript

Frans Schalekamp¹,
Michael Yu³ &
Anke van Zuylen¹^nAff2

153 Accesses
1 Citation
Explore all metrics

Abstract

We study algorithms for clustering data that were recently proposed by Balcan et al. (SODA’09: 19th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1068–1077, 2009a) and that have already given rise to several follow-up papers. The input for the clustering problem consists of points in a metric space and a number k, specifying the desired number of clusters. The algorithms find a clustering that is provably close to a target clustering, provided that the instance has the “(1+α,ε)-property”, which means that the instance is such that all solutions to the k-median problem for which the objective value is at most (1+α) times the optimal objective value correspond to clusterings that misclassify at most an ε fraction of the points with respect to the target clustering. We investigate the theoretical and practical implications of their results.

Our main contributions are as follows. First, we show that instances that have the (1+α,ε)-property and for which, additionally, the clusters in the target clustering are large, are easier than general instances: the algorithm proposed in Balcan et al. (SODA’09: 19th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1068–1077, 2009a) is a constant factor approximation algorithm with an approximation guarantee that is better than the known hardness of approximation for general instances. Further, we show that it is NP-hard to check if an instance satisfies the (1+α,ε)-property for a given (α,ε); the algorithms in Balcan et al. (SODA’09: 19th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1068–1077, 2009a) need such α and ε as input parameters, however. We propose ways to use their algorithms even if we do not know values of α and ε for which the assumption holds. Finally, we implement these methods and other popular methods, and test them on real world data sets. We find that on these data sets there are no α and ε so that the dataset has both (1+α,ε)-property and sufficiently large clusters in the target solution. For the general case where there are no assumptions about the cluster sizes, we show that on our data sets the performance guarantee proved by Balcan et a. (SODA’09: 19th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1068–1077, 2009a) is meaningless for the values of α,ε for which the data set has the (1+α,ε)-property. The algorithm nonetheless gives reasonable results, although it is outperformed by other methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Arora S, Raghavan P, Rao S (1999) Approximation schemes for Euclidean k-medians and related problems. In: STOC ’98: proceedings of the 30th annual ACM symposium on theory of computing, pp 106–113
Google Scholar
Arthur D, Vassilvitskii S (2006) How slow is the k-means method? In: SCG ’06: 22d annual symposium on computational geometry, pp 144–153
Chapter Google Scholar
Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: SODA ’07: 18th annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035
Google Scholar
Arya V, Garg N, Khandekar R, Meyerson A, Munagala K, Pandit V (2004) Local search heuristics for k-median and facility location problems. SIAM J Comput 33(3):544–562
Article MathSciNet MATH Google Scholar
Asuncion A, Newman D (2007) UCI machine learning repository. http://www.ics.uci.edu/mlearn/MLRepository.html
Awasthi P, Blum A, Sheffet O (2010) Clustering under natural stability assumptions. http://repository.cmu.edu/compsci/123/, retrieved on June 9th, 2010
Balcan MF, Braverman M (2009) Finding low error clusterings. In: COLT 2009: 22nd annual conference on learning theory
Google Scholar
Balcan MF, Blum A, Vempala S (2008) A discriminative framework for clustering via similarity functions. In: STOC 2008: 40th annual ACM symposium on theory of computing, pp 671–680
Google Scholar
Balcan MF, Blum A, Gupta A (2009a) Approximate clustering without the approximation. In: SODA ’09: 19th annual ACM-SIAM symposium on discrete algorithms, pp 1068–1077
Google Scholar
Balcan MF, Röglin H, Teng SH (2009b) Agnostic clustering. In: ALT 2009: 20th international conference on algorithmic learning theory. Lecture notes in computer science, vol 5809. Springer, Berlin, pp 384–398
Google Scholar
Balcan MF, Röglin H, Teng S, Voevodski K, Xia Y (2010) Efficient clustering with limited distance information. In: UAI 2010: the 26th conference on uncertainty in artificial intelligence
Google Scholar
Beasley JE (1985a) A note on solving large p-median problems. Eur J Oper Res 21(2):270–273
Article MathSciNet MATH Google Scholar
Beasley JE (1985b) OR-Library p-median—uncapacitated. http://people.brunel.ac.uk/mastjjb/jeb/orlib/pmedinfo.html
Bilu Y, Linial N (2010) Are stable instances easy? In: ICS 2010: the first symposium on innovations in computer science, pp 332–341
Google Scholar
Charikar M, Guha S (2005) Improved combinatorial algorithms for facility location problems. SIAM J Comput 34(4):803–824 (electronic)
Article MathSciNet MATH Google Scholar
Charikar M, Guha S, Tardos É, Shmoys DB (2002) A constant-factor approximation algorithm for the k-median problem. J Comput Syst Sci 65(1):129–149
Article MathSciNet MATH Google Scholar
Feige U (1998) A threshold of ln n for approximating set cover. J ACM 45(4):634–652
Article MathSciNet MATH Google Scholar
Gupta A (2009) personal communication
Jain K, Vazirani VV (2001) Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation. J ACM 48(2):274–296
Article MathSciNet MATH Google Scholar
Jain K, Mahdian M, Markakis E, Saberi A, Vazirani VV (2003) Greedy facility location algorithms analyzed using dual fitting with factor-revealing LP. J ACM 50(6):795–824
Article MathSciNet Google Scholar
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137
Article MathSciNet MATH Google Scholar
Ostrovsky R, Rabani Y, Schulman LJ, Swamy C (2006) The effectiveness of Lloyd-type methods for the k-means problem. In: FOCS ’06: 47th annual IEEE symposium on foundations of computer science, pp 165–176
Google Scholar

Download references

Author information

Anke van Zuylen
Present address: Max Planck Institute for Informatics, Saarbrücken, Germany

Authors and Affiliations

ITCS, Tsinghua University, Beijing, China
Frans Schalekamp & Anke van Zuylen
MIT, Cambridge, USA
Michael Yu

Authors

Frans Schalekamp
View author publications
You can also search for this author in PubMed Google Scholar
Michael Yu
View author publications
You can also search for this author in PubMed Google Scholar
Anke van Zuylen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anke van Zuylen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schalekamp, F., Yu, M. & van Zuylen, A. Clustering with or without the approximation. J Comb Optim 25, 393–429 (2013). https://doi.org/10.1007/s10878-011-9382-6

Download citation

Published: 04 February 2011
Issue Date: April 2013
DOI: https://doi.org/10.1007/s10878-011-9382-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering with or without the approximation

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

$\mathbf{C^{2}}$ -Lusin approximation of strongly convex functions

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Clustering with or without the approximation

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

$\mathbf{C^{2}}$ -Lusin approximation of strongly convex functions

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation