skip to main content
10.1145/3132847.3133091acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper

Fast K-means for Large Scale Clustering

Published: 06 November 2017 Publication History

Abstract

K-means algorithm has been widely used in machine learning and data mining due to its simplicity and good performance. However, the standard k-means algorithm would be quite slow for clustering millions of data into thousands of or even tens of thousands of clusters. In this paper, we propose a fast k-means algorithm named multi-stage k-means (MKM) which uses a multi-stage filtering approach. The multi-stage filtering approach greatly accelerates the k-means algorithm via a coarse-to-fine search strategy. To further speed up the algorithm, hashing is introduced to accelerate the assignment step which is the most time-consuming part in k-means. Extensive experiments on several massive datasets show that the proposed algorithm can obtain up to 600X speed-up over the k-means algorithm with comparable accuracy.

References

[1]
Yannis Avrithis, Yannis Kalantidis, Evangelos Anagnostopoulos, and Ioannis Z Emiris. 2015. Web-scale image clustering revisited. In Proceedings of the IEEE International Conference on Computer Vision. 1502--1510.
[2]
Moses S Charikar. 2002. Similarity estimation techniques from rounding algorithms Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. ACM, 380--388.
[3]
Jian Cheng, Cong Leng, Jiaxiang Wu, Hainan Cui, and Hanqing Lu. 2014. Fast and accurate image matching with cascade hashing for 3d reconstruction Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--8.
[4]
Yufei Ding, Yue Zhao, Xipeng Shen, Madanlal Musuvathi, and Todd Mytkowicz. 2015. Yinyang k-means: A drop-in replacement of the classic k-means with consistent speedup Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 579--587.
[5]
Charles Elkan. 2003. Using the triangle inequality to accelerate k-means ICML, Vol. Vol. 3. 147--153.
[6]
Yunchao Gong, Marcin Pawlowski, Fei Yang, Louis Brandy, Lubomir Bourdev, and Rob Fergus. 2015. Web scale photo hash clustering on a single machine Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 19--27.
[7]
Greg Hamerly. 2010. Making k-means Even Faster. In SDM. SIAM, 130--140.
[8]
Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2011. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, Vol. 33, 1 (2011), 117--128.
[9]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks Advances in neural information processing systems. 1097--1105.
[10]
Cong Leng, Jiaxiang Wu, Jian Cheng, Xiao Bai, and Hanqing Lu. 2015. Online sketching hashing. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2503--2511.
[11]
James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. 2007. Object retrieval with large vocabularies and fast spatial matching 2007 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--8.
[12]
David Sculley. 2010. Web-scale k-means clustering. In Proceedings of the 19th international conference on World wide web. ACM, 1177--1178.

Cited By

View all
  • (2024)Settling Time vs. Accuracy Tradeoffs for Clustering Big DataProceedings of the ACM on Management of Data10.1145/36549762:3(1-25)Online publication date: 30-May-2024
  • (2024)Learning a Subspace and Clustering Simultaneously with Manifold Regularized Nonnegative Matrix FactorizationGuidance, Navigation and Control10.1142/S273748072450013404:03Online publication date: 26-Jul-2024
  • (2024)K-Means Clustering With Natural Density Peaks for Discovering Arbitrary-Shaped ClustersIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.324806435:8(11077-11090)Online publication date: Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
November 2017
2604 pages
ISBN:9781450349185
DOI:10.1145/3132847
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. clustering
  2. hashing
  3. k-means

Qualifiers

  • Short-paper

Funding Sources

Conference

CIKM '17
Sponsor:

Acceptance Rates

CIKM '17 Paper Acceptance Rate 171 of 855 submissions, 20%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)38
  • Downloads (Last 6 weeks)2
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Settling Time vs. Accuracy Tradeoffs for Clustering Big DataProceedings of the ACM on Management of Data10.1145/36549762:3(1-25)Online publication date: 30-May-2024
  • (2024)Learning a Subspace and Clustering Simultaneously with Manifold Regularized Nonnegative Matrix FactorizationGuidance, Navigation and Control10.1142/S273748072450013404:03Online publication date: 26-Jul-2024
  • (2024)K-Means Clustering With Natural Density Peaks for Discovering Arbitrary-Shaped ClustersIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.324806435:8(11077-11090)Online publication date: Aug-2024
  • (2022)K-means-G*: Accelerating k-means clustering algorithm utilizing primitive geometric conceptsInformation Sciences10.1016/j.ins.2022.11.001618(298-316)Online publication date: Dec-2022
  • (2022)A Fast Heuristic k-means Algorithm Based on Nearest Neighbor Information3D Imaging—Multidimensional Signal Processing and Deep Learning10.1007/978-981-19-2448-4_11(111-119)Online publication date: 2-Jul-2022
  • (2020)A Fast Adaptive k-means with No BoundsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2020.3008694(1-1)Online publication date: 2020
  • (2020)Predictive intelligence of reliable analytics in distributed computing environmentsApplied Intelligence10.1007/s10489-020-01712-5Online publication date: 14-May-2020
  • (2020)ProxyBNN: Learning Binarized Neural Networks via Proxy MatricesComputer Vision – ECCV 202010.1007/978-3-030-58580-8_14(223-241)Online publication date: 23-Aug-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media