Acceleration of K-Means and Related Clustering Algorithms

Phillips, Steven J.

doi:10.1007/3-540-45643-0_13

Acceleration of K-Means and Related Clustering Algorithms

Steven J. Phillips⁶

Conference paper
First Online: 01 January 2002

764 Accesses
43 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2409))

Abstract

This paper describes two simple modification of K-means and related algorithms for clustering, that improve the running time without changing the output. The two resulting algorithms are called Compare-means and Sort-means. The time for an iteration of K-means is reduced from O(ndk), where n is the number of data points, k the number of clusters and d the dimension, to O(ndγ + k ² d + k ² log k) for Sort-means. Here γ ≤ k is the average over all points p of the number of means that are no more than twice as far as p is from the mean p was assigned to in the previous iteration. Compare-means performs a similar number of distance calculations as Sort-means, and is faster when the number of means is very large. Both modifications are extremely simple, and could easily be added to existing clustering implementations.

We investigate the empirical performance of the algorithms on three datasets drawn from practical applications. As a primary test case, we use the Isodata variant of K-means on a sample of 2.3 million 6-dimensional points drawn from a Landsat-7 satellite image. For this dataset, γ quickly drops to less than log₂ k, and the running time decreases accordingly. For example, a run with k = 100 drops from an hour and a half to sixteen minutes for Compare-means and six and a half minutes for Sort-means. Further experiments show similar improvements on datasets derived from a forestry application and from the analysis of BGP updates in an IP network.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

S. D. Bay. The UCI KDD Archive [http://kdd.ics.uci.edu]. Irvine, CA: University of California, Department of Information and Computer Science., 1999.
Google Scholar
P. S. Bradley, U. M. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Knowledge Discovery and Data Mining, pages 9–15, 1998.
Google Scholar
A. Dempster, N. Laird, and D. Rubin. Maximum-likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society B, 39:1–38, 1977.
MATH MathSciNet Google Scholar
F. Farnstrom, J. Lewis, and C. Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2(1):51–57, 2000.
Article Google Scholar
T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293–306, 1985.
Article MATH MathSciNet Google Scholar
J. R. Jensen. Introductory Digital Image Processing, A Remote Sensing Perspective. Prentice Hall, Upper Saddle River, NJ, 1996.
Google Scholar
J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. Fifth Berkeley Symposium on Mathematics, Statistics and Probability, volume 1, pages 281–296, 1967.
MathSciNet Google Scholar
A. W. Moore. The anchors hierarchy: Using the triangle inequality to survive high dimensional data. In Proc. UAI-2000: The Sixteenth Conference on Uncertainty in Artificial Intelligence, 2000.
Google Scholar
D. Pelleg and A. W. Moore. Accelerating exact k-means algorithms with geometric reasoning. In Proc. Fifth International Conference on Knowledge Discovery and Data Mining. AAAI Press, 1999.
Google Scholar
J. T. Tou and R. C. Gonzalez. Pattern Recognition Principles. Addison-Wesley, Reading, MA, 1977.
Google Scholar

Download references

Author information

Authors and Affiliations

AT&T Labs-Research, 180 Park Avenue, Florham Park, NJ, 07932
Steven J. Phillips

Authors

Steven J. Phillips
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Maryland, College Park, MD, 20742, USA
David M. Mount
Department of IEOR, Columbia University, 500W. 120 St., MC 4704, New York, NY, 10027, USA
Clifford Stein

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Phillips, S.J. (2002). Acceleration of K-Means and Related Clustering Algorithms. In: Mount, D.M., Stein, C. (eds) Algorithm Engineering and Experiments. ALENEX 2002. Lecture Notes in Computer Science, vol 2409. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45643-0_13

Download citation

DOI: https://doi.org/10.1007/3-540-45643-0_13
Published: 12 July 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43977-6
Online ISBN: 978-3-540-45643-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics