Abstract
k-means++ [5] seeding procedure is a simple sampling based algorithm that is used to quickly find k centers which may then be used to start the Lloyd’s method. There has been some progress recently on understanding this sampling algorithm. Ostrovsky et al. [10] showed that if the data satisfies the separation condition that \(\frac{\Delta_{k-1}(P)}{\Delta_k(P)} \geq c\) (Δ i (P) is the optimal cost w.r.t. i centers, c > 1 is a constant, and P is the point set), then the sampling algorithm gives an O(1)-approximation for the k-means problem with probability that is exponentially small in k. Here, the distance measure is the squared Euclidean distance. Ackermann and Blömer [2] showed the same result when the distance measure is any μ-similar Bregman divergence. Arthur and Vassilvitskii [5] showed that the k-means++ seeding gives an O(logk) approximation in expectation for the k-means problem. They also give an instance where k-means++ seeding gives Ω(logk) approximation in expectation. However, it was unresolved whether the seeding procedure gives an O(1) approximation with probability \(\Omega \left(\frac{1}{poly(k)}\right)\), even when the data satisfies the above-mentioned separation condition. Brunsch and Röglin [8] addressed this question and gave an instances on which k-means++ achieves an approximation ratio of (2/3 − ε) ·logk only with exponentially small probability. However, the instances that they give satisfy \(\frac{\Delta_{k-1}(P)}{\Delta_k(P)} = 1+o(1)\). In this work, we show that the sampling algorithm gives an O(1) approximation with probability \(\Omega\left(\frac{1}{k}\right)\) for any k-means problem instance where the point set satisfy separation condition \(\frac{\Delta_{k-1}(P)}{\Delta_k(P)} \geq 1 + \gamma\), for some fixed constant γ. Our results hold for any distance measure that is a metric in an approximate sense. For point sets that do not satisfy the above separation condition, we show O(1) approximation with probability Ω(2− 2k).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ackermann, M.R.: Algorithms for the Bregman k-Median Problem. PhD thesis, University of Paderborn, Department of Computer Science (2009)
Ackermann, M.R., Blömer, J.: Bregman Clustering for Separable Instances. In: Kaplan, H. (ed.) SWAT 2010. LNCS, vol. 6139, pp. 212–223. Springer, Heidelberg (2010)
Aggarwal, A., Deshpande, A., Kannan, R.: Adaptive Sampling for k-Means Clustering. In: Dinur, I., Jansen, K., Naor, J., Rolim, J. (eds.) APPROX and RANDOM 2009. LNCS, vol. 5687, pp. 15–28. Springer, Heidelberg (2009)
Ailon, N., Jaiswal, R., Monteleoni, C.: Streaming k-means approximation. In: Advances in Neural Information Processing Systems, vol. 22, pp. 10–18 (2009)
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2007), pp. 1027–1035 (2007)
Balcan, M.-F., Blum, A., Gupta, A.: Approximate clustering without the approximation. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2009), pp. 1068–1077 (2009)
Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman divergences. Journal of Machine Learning Research 6, 1705–1749 (2005)
Brunsch, T., Röglin, H.: A Bad Instance for k-Means++. In: Ogihara, M., Tarui, J. (eds.) TAMC 2011. LNCS, vol. 6648, pp. 344–352. Springer, Heidelberg (2011)
Mahajan, M., Nimbhorkar, P., Varadarajan, K.: The Planar k-Means Problem is NP-Hard. In: Das, S., Uehara, R. (eds.) WALCOM 2009. LNCS, vol. 5431, pp. 274–285. Springer, Heidelberg (2009)
Ostrovsky, R., Rabani, Y., Schulman, L.J., Swamy, C.: The effectiveness of lloyd-type methods for the k-means problem. In: Proc. 47th IEEE FOCS, pp. 165–176 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jaiswal, R., Garg, N. (2012). Analysis of k-Means++ for Separable Data. In: Gupta, A., Jansen, K., Rolim, J., Servedio, R. (eds) Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques. APPROX RANDOM 2012 2012. Lecture Notes in Computer Science, vol 7408. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32512-0_50
Download citation
DOI: https://doi.org/10.1007/978-3-642-32512-0_50
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32511-3
Online ISBN: 978-3-642-32512-0
eBook Packages: Computer ScienceComputer Science (R0)