Skip to main content

Abstract

k-means++ [5] seeding procedure is a simple sampling based algorithm that is used to quickly find k centers which may then be used to start the Lloyd’s method. There has been some progress recently on understanding this sampling algorithm. Ostrovsky et al. [10] showed that if the data satisfies the separation condition that \(\frac{\Delta_{k-1}(P)}{\Delta_k(P)} \geq c\) i (P) is the optimal cost w.r.t. i centers, c > 1 is a constant, and P is the point set), then the sampling algorithm gives an O(1)-approximation for the k-means problem with probability that is exponentially small in k. Here, the distance measure is the squared Euclidean distance. Ackermann and Blömer [2] showed the same result when the distance measure is any μ-similar Bregman divergence. Arthur and Vassilvitskii [5] showed that the k-means++ seeding gives an O(logk) approximation in expectation for the k-means problem. They also give an instance where k-means++ seeding gives Ω(logk) approximation in expectation. However, it was unresolved whether the seeding procedure gives an O(1) approximation with probability \(\Omega \left(\frac{1}{poly(k)}\right)\), even when the data satisfies the above-mentioned separation condition. Brunsch and Röglin [8] addressed this question and gave an instances on which k-means++ achieves an approximation ratio of (2/3 − ε) ·logk only with exponentially small probability. However, the instances that they give satisfy \(\frac{\Delta_{k-1}(P)}{\Delta_k(P)} = 1+o(1)\). In this work, we show that the sampling algorithm gives an O(1) approximation with probability \(\Omega\left(\frac{1}{k}\right)\) for any k-means problem instance where the point set satisfy separation condition \(\frac{\Delta_{k-1}(P)}{\Delta_k(P)} \geq 1 + \gamma\), for some fixed constant γ. Our results hold for any distance measure that is a metric in an approximate sense. For point sets that do not satisfy the above separation condition, we show O(1) approximation with probability Ω(2− 2k).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ackermann, M.R.: Algorithms for the Bregman k-Median Problem. PhD thesis, University of Paderborn, Department of Computer Science (2009)

    Google Scholar 

  2. Ackermann, M.R., Blömer, J.: Bregman Clustering for Separable Instances. In: Kaplan, H. (ed.) SWAT 2010. LNCS, vol. 6139, pp. 212–223. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  3. Aggarwal, A., Deshpande, A., Kannan, R.: Adaptive Sampling for k-Means Clustering. In: Dinur, I., Jansen, K., Naor, J., Rolim, J. (eds.) APPROX and RANDOM 2009. LNCS, vol. 5687, pp. 15–28. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  4. Ailon, N., Jaiswal, R., Monteleoni, C.: Streaming k-means approximation. In: Advances in Neural Information Processing Systems, vol. 22, pp. 10–18 (2009)

    Google Scholar 

  5. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2007), pp. 1027–1035 (2007)

    Google Scholar 

  6. Balcan, M.-F., Blum, A., Gupta, A.: Approximate clustering without the approximation. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2009), pp. 1068–1077 (2009)

    Google Scholar 

  7. Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman divergences. Journal of Machine Learning Research 6, 1705–1749 (2005)

    MathSciNet  MATH  Google Scholar 

  8. Brunsch, T., Röglin, H.: A Bad Instance for k-Means++. In: Ogihara, M., Tarui, J. (eds.) TAMC 2011. LNCS, vol. 6648, pp. 344–352. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  9. Mahajan, M., Nimbhorkar, P., Varadarajan, K.: The Planar k-Means Problem is NP-Hard. In: Das, S., Uehara, R. (eds.) WALCOM 2009. LNCS, vol. 5431, pp. 274–285. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  10. Ostrovsky, R., Rabani, Y., Schulman, L.J., Swamy, C.: The effectiveness of lloyd-type methods for the k-means problem. In: Proc. 47th IEEE FOCS, pp. 165–176 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Jaiswal, R., Garg, N. (2012). Analysis of k-Means++ for Separable Data. In: Gupta, A., Jansen, K., Rolim, J., Servedio, R. (eds) Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques. APPROX RANDOM 2012 2012. Lecture Notes in Computer Science, vol 7408. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32512-0_50

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-32512-0_50

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-32511-3

  • Online ISBN: 978-3-642-32512-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics