k-means++ under Approximation Stability

Agarwal, Manu; Jaiswal, Ragesh; Pal, Arindam

doi:10.1007/978-3-642-38236-9_9

Manu Agarwal¹⁹,
Ragesh Jaiswal²⁰ &
Arindam Pal²¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7876))

Included in the following conference series:

International Conference on Theory and Applications of Models of Computation

954 Accesses
6 Citations

Abstract

The Lloyd’s algorithm, also known as the k-means algorithm, is one of the most popular algorithms for solving the k-means clustering problem in practice. However, it does not give any performance guarantees. This means that there are datasets on which this algorithm can behave very badly. One reason for poor performance on certain datasets is bad initialization. The following simple sampling based seeding algorithm tends to fix this problem: pick the first center randomly from among the given points and then for i ≥ 2, pick a point to be the i ^th center with probability proportional to the squared distance of this point from the previously chosen centers. This algorithm is more popularly known as the k-means++ seeding algorithm and is known to exhibit some nice properties. These have been studied in a number of previous works [AV07, AJM09, ADK09, BR11]. The algorithm tends to perform well when the optimal clusters are separated in some sense. This is because the algorithm gives preference to further away points when picking centers. Ostrovsky et al.[ORSS06] discuss one such separation condition on the data. Jaiswal and Garg [JG12] show that if the dataset satisfies the separation condition of [ORSS06], then the sampling algorithm gives a constant approximation with probability Ω(1/k). Another separation condition that is strictly weaker than [ORSS06] is the approximation stability condition discussed by Balcan et al.[BBG09]. In this work, we show that the sampling algorithm gives a constant approximation with probability Ω(1/k) if the dataset satisfies the separation condition of [BBG09] and the optimal clusters are not too small. We give a negative result for datasets that have small optimal clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aggarwal, A., Deshpande, A., Kannan, R.: Adaptive sampling for k-means clustering. In: Dinur, I., Jansen, K., Naor, J., Rolim, J. (eds.) APPROX and RANDOM 2009. LNCS, vol. 5687, pp. 15–28. Springer, Heidelberg (2009)
Chapter Google Scholar
Ailon, N., Jaiswal, R., Monteleoni, C.: Streaming k-means approximation. In: Advances in Neural Information Processing Systems (NIPS 2009), pp. 10–18 (2009)
Google Scholar
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2007), pp. 1027–1035 (2007)
Google Scholar
Awasthi, P., Blum, A., Sheffet, O.: Stability yields a PTAS for k-median and k-means clustering. In: Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS 2010), pp. 309–318 (2010)
Google Scholar
Balcan, M.-F., Blum, A., Gupta, A.: Approximate clustering without the approximation. In: Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2009), pp. 1068–1077 (2009)
Google Scholar
Brunsch, T., Röglin, H.: A bad instance for k-means++. In: Ogihara, M., Tarui, J. (eds.) TAMC 2011. LNCS, vol. 6648, pp. 344–352. Springer, Heidelberg (2011)
Chapter Google Scholar
Jaiswal, R., Garg, N.: Analysis of k-means++ for separable data. In: Proceedings of the 16th International Workshop on Randomization and Computation, pp. 591–602 (2012)
Google Scholar
Jaiswal, R., Kumar, A., Sen, S.: A Simple D ²-sampling based PTAS for k-means and other Clustering Problems. In: Gudmundsson, J., Mestre, J., Viglas, T. (eds.) COCOON 2012. LNCS, vol. 7434, pp. 13–24. Springer, Heidelberg (2012)
Chapter Google Scholar
Ostrovsky, R., Rabani, Y., Schulman, L.J., Swamy, C.: The effectiveness of lloyd-type methods for the k-means problem. In: Proceedings of the 47th IEEE FOCS, pp. 165–176 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

IIT Rajasthan, India
Manu Agarwal
IIT Delhi, India
Ragesh Jaiswal
TCS Innovations Lab, India
Arindam Pal

Authors

Manu Agarwal
View author publications
You can also search for this author in PubMed Google Scholar
Ragesh Jaiswal
View author publications
You can also search for this author in PubMed Google Scholar
Arindam Pal
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

The University of Hong Kong, Hong Kong
T-H. Hubert Chan
The Chinese University of Hong Kong, Hong Kong
Lap Chi Lau
Department of Computer Science, Stanford University, 94305, Stanford, CA, USA
Luca Trevisan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Agarwal, M., Jaiswal, R., Pal, A. (2013). k-means++ under Approximation Stability. In: Chan, TH.H., Lau, L.C., Trevisan, L. (eds) Theory and Applications of Models of Computation. TAMC 2013. Lecture Notes in Computer Science, vol 7876. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38236-9_9

Download citation

DOI: https://doi.org/10.1007/978-3-642-38236-9_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38235-2
Online ISBN: 978-3-642-38236-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics