Clustering with or without the Approximation

Schalekamp, Frans; Yu, Michael; van Zuylen, Anke

doi:10.1007/978-3-642-14031-0_10

Frans Schalekamp¹⁸,
Michael Yu¹⁹ &
Anke van Zuylen¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6196))

Included in the following conference series:

International Computing and Combinatorics Conference

919 Accesses
2 Citations

Abstract

We study algorithms for clustering data that were recently proposed by Balcan, Blum and Gupta in SODA’09 [4] and that have already given rise to two follow-up papers. The input for the clustering problem consists of points in a metric space and a number k, specifying the desired number of clusters. The algorithms find a clustering that is provably close to a target clustering, provided that the instance has the “( 1 + α, ε)-property”, which means that the instance is such that all solutions to the k-median problem for which the objective value is at most (1 + α) times the optimal objective value correspond to clusterings that misclassify at most an ε fraction of the points with respect to the target clustering. We investigate the theoretical and practical implications of their results.

Our main contributions are as follows. First, we show that instances that have the ( 1 + α, ε)-property and for which, additionally, the clusters in the target clustering are large, are easier than general instances: the algorithm proposed in [4] is a constant factor approximation algorithm with an approximation guarantee that is better than the known hardness of approximation for general instances. Further, we show that it is NP-hard to check if an instance satisfies the ( 1 + α, ε)-property for a given (α, ε); the algorithms in [4] need such α and ε as input parameters, however. We propose ways to use their algorithms even if we do not know values of α and ε for which the assumption holds. Finally, we implement these methods and other popular methods, and test them on real world data sets. We find that on these data sets there are no α and ε so that the dataset has both ( 1 + α, ε)-property and sufficiently large clusters in the target solution. For the general case, we show that on our data sets the performance guarantee proved by [4] is meaningless for the values of α, ε such that the data set has the ( 1 + α, ε)-property. The algorithm nonetheless gives reasonable results, although it is outperformed by other methods.

This work was supported in part by the National Natural Science Foundation of China Grant 60553001, and the National Basic Research Program of China Grant 2007CB807900, 2007CB807901. Part of this work was done while the second author was visiting the Institute for Theoretical Computer Science at Tsinghua University in the summer of 2009.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: SODA ’07: 18th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035 (2007)
Google Scholar
Arya, V., Garg, N., Khandekar, R., Meyerson, A., Munagala, K., Pandit, V.: Local search heuristics for k-median and facility location problems. SIAM J. Comput. 33(3), 544–562 (2004)
Article MATH MathSciNet Google Scholar
Asuncion, A., Newman, D.: UCI machine learning repository (2007), http://www.ics.uci.edu/~mlearn/MLRepository.html
Balcan, M.-F., Blum, A., Gupta, A.: Approximate clustering without the approximation. In: SODA ’09: 19th Annual ACM -SIAM Symposium on Discrete Algorithms, pp. 1068–1077 (2009)
Google Scholar
Balcan, M.-F., Blum, A., Vempala, S.: A discriminative framework for clustering via similarity functions. In: STOC 2008: 40th Annual ACM Symposium on Theory of Computing, pp. 671–680 (2008)
Google Scholar
Balcan, M.-F., Braverman, M.: Finding low error clusterings. In: COLT 2009: 22nd Annual Conference on Learning Theory (2009)
Google Scholar
Balcan, M.-F., Röglin, H., Teng, S.-H.: Agnostic clustering. In: Gavaldà, R., Lugosi, G., Zeugmann, T., Zilles, S. (eds.) ALT 2009. LNCS, vol. 5809, pp. 384–398. Springer, Heidelberg (2009)
Chapter Google Scholar
Beasley, J.E.: A note on solving large p-median problems. European Journal of Operational Research 21(2), 270–273 (1985)
Article MATH MathSciNet Google Scholar
Beasley, J.E.: OR-Library p-median - uncapacitated (1985), http://people.brunel.ac.uk/~mastjjb/jeb/orlib/pmedinfo.html
Bilu, Y., Linial, N.: Are stable instances easy. In: ICS 2010: The First Symposium on Innovations in Computer Science, pp. 332–341 (2010)
Google Scholar
Gupta, A.: Personal Communication (2009)
Google Scholar
Jain, K., Mahdian, M., Markakis, E., Saberi, A., Vazirani, V.V.: Greedy facility location algorithms analyzed using dual fitting with factor-revealing LP. J. ACM 50(6), 795–824 (2003)
Article MathSciNet Google Scholar
Jain, K., Vazirani, V.V.: Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation. J. ACM 48(2), 274–296 (2001)
Article MATH MathSciNet Google Scholar
Lloyd, S.: Least squares quantization in PCM. IEEE Transactions on Information Theory 28(2), 129–137 (1982)
Article MATH MathSciNet Google Scholar
Ostrovsky, R., Rabani, Y., Schulman, L.J., Swamy, C.: The effectiveness of Lloyd-type methods for the k-means problem. In: FOCS ’06:47th Annual IEEE Symposium on Foundations of Computer Science, pp. 165–176 (2006)
Google Scholar
Schalekamp, F., Yu, M., van Zuylen, A.: Clustering with or without the approximation, http://www.itcs.tsinghua.edu.cn/~frans/pub/ClustCOCOON.pdf

Download references

Author information

Authors and Affiliations

ITCS, Tsinghua University,
Frans Schalekamp & Anke van Zuylen
MIT,
Michael Yu

Authors

Frans Schalekamp
View author publications
You can also search for this author in PubMed Google Scholar
Michael Yu
View author publications
You can also search for this author in PubMed Google Scholar
Anke van Zuylen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Florida, CSE Building, Room 566, P.O. Box 116120, 32611-6120, Gainesville, Florida, USA
My T. Thai
Department of Computer and Information Science and Technology, University of Florida, P.O. Box, USA
Sartaj Sahni

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schalekamp, F., Yu, M., van Zuylen, A. (2010). Clustering with or without the Approximation. In: Thai, M.T., Sahni, S. (eds) Computing and Combinatorics. COCOON 2010. Lecture Notes in Computer Science, vol 6196. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14031-0_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-14031-0_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14030-3
Online ISBN: 978-3-642-14031-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics