Analysis of k-Means++ for Separable Data

Jaiswal, Ragesh; Garg, Nitin

doi:10.1007/978-3-642-32512-0_50

Ragesh Jaiswal²⁰ &
Nitin Garg²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7408))

Included in the following conference series:

International Workshop on Approximation Algorithms for Combinatorial Optimization
International Workshop on Randomization and Approximation Techniques in Computer Science

1650 Accesses
10 Citations

Abstract

k-means++ [5] seeding procedure is a simple sampling based algorithm that is used to quickly find k centers which may then be used to start the Lloyd’s method. There has been some progress recently on understanding this sampling algorithm. Ostrovsky et al. [10] showed that if the data satisfies the separation condition that \(\frac{\Delta_{k-1}(P)}{\Delta_k(P)} \geq c\) (Δ_i(P) is the optimal cost w.r.t. i centers, c > 1 is a constant, and P is the point set), then the sampling algorithm gives an O(1)-approximation for the k-means problem with probability that is exponentially small in k. Here, the distance measure is the squared Euclidean distance. Ackermann and Blömer [2] showed the same result when the distance measure is any μ-similar Bregman divergence. Arthur and Vassilvitskii [5] showed that the k-means++ seeding gives an O(logk) approximation in expectation for the k-means problem. They also give an instance where k-means++ seeding gives Ω(logk) approximation in expectation. However, it was unresolved whether the seeding procedure gives an O(1) approximation with probability \(\Omega \left(\frac{1}{poly(k)}\right)\), even when the data satisfies the above-mentioned separation condition. Brunsch and Röglin [8] addressed this question and gave an instances on which k-means++ achieves an approximation ratio of (2/3 − ε) ·logk only with exponentially small probability. However, the instances that they give satisfy \(\frac{\Delta_{k-1}(P)}{\Delta_k(P)} = 1+o(1)\). In this work, we show that the sampling algorithm gives an O(1) approximation with probability \(\Omega\left(\frac{1}{k}\right)\) for any k-means problem instance where the point set satisfy separation condition \(\frac{\Delta_{k-1}(P)}{\Delta_k(P)} \geq 1 + \gamma\), for some fixed constant γ. Our results hold for any distance measure that is a metric in an approximate sense. For point sets that do not satisfy the above separation condition, we show O(1) approximation with probability Ω(2^− 2k).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ackermann, M.R.: Algorithms for the Bregman k-Median Problem. PhD thesis, University of Paderborn, Department of Computer Science (2009)
Google Scholar
Ackermann, M.R., Blömer, J.: Bregman Clustering for Separable Instances. In: Kaplan, H. (ed.) SWAT 2010. LNCS, vol. 6139, pp. 212–223. Springer, Heidelberg (2010)
Chapter Google Scholar
Aggarwal, A., Deshpande, A., Kannan, R.: Adaptive Sampling for k-Means Clustering. In: Dinur, I., Jansen, K., Naor, J., Rolim, J. (eds.) APPROX and RANDOM 2009. LNCS, vol. 5687, pp. 15–28. Springer, Heidelberg (2009)
Chapter Google Scholar
Ailon, N., Jaiswal, R., Monteleoni, C.: Streaming k-means approximation. In: Advances in Neural Information Processing Systems, vol. 22, pp. 10–18 (2009)
Google Scholar
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2007), pp. 1027–1035 (2007)
Google Scholar
Balcan, M.-F., Blum, A., Gupta, A.: Approximate clustering without the approximation. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2009), pp. 1068–1077 (2009)
Google Scholar
Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman divergences. Journal of Machine Learning Research 6, 1705–1749 (2005)
MathSciNet MATH Google Scholar
Brunsch, T., Röglin, H.: A Bad Instance for k-Means++. In: Ogihara, M., Tarui, J. (eds.) TAMC 2011. LNCS, vol. 6648, pp. 344–352. Springer, Heidelberg (2011)
Chapter Google Scholar
Mahajan, M., Nimbhorkar, P., Varadarajan, K.: The Planar k-Means Problem is NP-Hard. In: Das, S., Uehara, R. (eds.) WALCOM 2009. LNCS, vol. 5431, pp. 274–285. Springer, Heidelberg (2009)
Chapter Google Scholar
Ostrovsky, R., Rabani, Y., Schulman, L.J., Swamy, C.: The effectiveness of lloyd-type methods for the k-means problem. In: Proc. 47th IEEE FOCS, pp. 165–176 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, IIT Delhi, New Delhi, India
Ragesh Jaiswal & Nitin Garg

Authors

Ragesh Jaiswal
View author publications
You can also search for this author in PubMed Google Scholar
Nitin Garg
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Carnegie Mellon University, Gates Building 7203, 15213, Pittsburgh, PA, USA
Anupam Gupta
Department of Computer Science, University Kiel, Olshausenstraße 40, 24098, Kiel, Germany
Klaus Jansen
Centre Universitaire d’Informatique, University of Geneva, Battelle A, 7 route de Drize, 1227, Carouge, Switzerland
José Rolim
Department of Computer Science, Foundation School of Engineering and Applied Science, Columbia University, Amsterdam Avenue 1214, 10027-7003, New York, NY, USA
Rocco Servedio

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jaiswal, R., Garg, N. (2012). Analysis of k-Means++ for Separable Data. In: Gupta, A., Jansen, K., Rolim, J., Servedio, R. (eds) Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques. APPROX RANDOM 2012 2012. Lecture Notes in Computer Science, vol 7408. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32512-0_50

Download citation

DOI: https://doi.org/10.1007/978-3-642-32512-0_50
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32511-3
Online ISBN: 978-3-642-32512-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics