Abstract
We introduce a highly efficient k-means clustering approach. We show that the classical central limit theorem addresses a special case (k = 1) of the k-means problem and then extend it to the general case. Instead of using the full dataset, our algorithm named k-means-lite applies the standard k-means to the combination C (size nk) of all sample centroids obtained from n independent small samples. Unlike ordinary uniform sampling, the approach asymptotically preserves the performance of the original algorithm. In our experiments with a wide range of synthetic and real-world datasets, k-means-lite matches the performance of k-means when C is constructed using 30 samples of size 40 + 2k. Although the 30-sample choice proves to be a generally reliable rule, when the proposed approach is used to scale k-means++ (we call this scaled version k-means-lite++), k-means++’ performance is matched in several cases, using only five samples. These two new algorithms are presented to demonstrate the proposed approach, but the approach can be applied to create a constant-time version of any other k-means clustering algorithm, since it does not modify the internal workings of the base algorithm.
Similar content being viewed by others
References
Philbeck T, Davis N (2019) The Fourth Industrial Revolution. J Int Aff 72(1):17–22
Gunal MM (2019) Simulation and the fourth industrial revolution. In: Simulation for Industry 4.0, Springer, pp 1–17
Vassakis K, Petrakis E, Kopanakis I (2018) Big data analytics: applications, prospects and challenges. In Mobile big data, Springer, pp 3–20
Fahim AM, Salem AM, Torkey FA, Ramadan MA (2006) An efficient enhanced k-means clustering algorithm. J Zhejiang Univ Sci A 7(10):1626–1633
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Milligan GW (1980) An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika 45(3):325–342
Bindra K, Mishra A (2019) Effective data clustering algorithms. In: Soft computing: theories and applications, Springer, pp 419–432
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recogn Lett 31(8):651–666
Gondeau A, Aouabed Z, Hijri M, Peres-Neto P, Makarenkov V (2019) Object weighting: a new clustering approach to deal with outliers and cluster overlap in computational biology. IEEE/ACM Trans Comput Biol Bioinform. https://doi.org/10.1109/TCBB.2019.2921577
Brusco MJ, Steinley D, Stevens J, Cradit JD (2019) Affinity propagation: an exemplar-based tool for clustering in psychological research. Br J Math Stat Psychol 72(1):155–182
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv CSUR 31(3):264–323
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs
Wong K-C (2015) A short survey on data clustering algorithms. In: 2015 Second international conference on soft computing and machine intelligence (ISCMI), pp 64–68
Li T, Ding C (2018) Nonnegative matrix factorizations for clustering: a survey. In: Data clustering. Chapman and Hall/CRC, pp 149–176
He Z, Yu C (2019) Clustering stability-based evolutionary k-means. Soft Comput 23(1):305–321
Melnykov V, Michael S (2019) Clustering large datasets by merging K-means solutions. J Classif. https://doi.org/10.1007/s00357-019-09314-8
Lücke J, Forster D (2019) k-means as a variational EM approximation of Gaussian mixture models. Pattern Recognit Lett 125:349–356
Wu X et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp 1027–1035
Mitra P, Shankar BU, Pal SK (2004) Segmentation of multispectral remote sensing images using active support vector machines. Pattern Recogn Lett 25(9):1067–1074
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. KDD Workshop Text Min 400:525–526
Celebi ME (2011) Improving the performance of k-means for color quantization. Image Vis Comput 29(4):260–271
Kuo RJ, Ho LM, Hu CM (2002) Integration of self-organizing feature map and K-means algorithm for market segmentation. Comput Oper Res 29(11):1475–1493
Wagh S, Prasad R (2014) Power backup density based clustering algorithm for maximizing lifetime of wireless sensor networks. In: 2014 4th International conference on wireless communications, vehicular technology, information theory and aerospace & electronic systems (VITAE), pp 1–5
Le Roch KG et al (2003) Discovery of gene function by expression profiling of the malaria parasite life cycle. Science 301(5639):1503–1508
Ng HP, Ong SH, Foong KWC, Goh PS, Nowinski WL (2006) Medical image segmentation using k-means clustering and improved watershed algorithm. In: 2006 IEEE southwest symposium on image analysis and interpretation, pp 61–65
Su M-C, Chou C-H (2001) A modified version of the K-means algorithm with a distance based on cluster symmetry. IEEE Trans Pattern Anal Mach Intell 23(6):674–680
Olukanmi PO, Twala B (2017) Sensitivity analysis of an outlier-aware k-means clustering algorithm. In: Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech), pp 68–73
Olukanmi PO, Twala B (2017) K-means-sharp: modified centroid update for outlier-robust k-means clustering. In: Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech), pp 14–19
Fränti P, Sieranoja S (2017) K-means properties on six clustering benchmark datasets. Appl Intell 48:1–17
Shrivastava P, Sahoo L, Pandey M, Agrawal S (2018) AKM—augmentation of K-means clustering algorithm for big data. In: Intelligent engineering informatics, Springer, pp 103–109
Meng Y, Liang J, Cao F, He Y (2018) A new distance with derivative information for functional k-means clustering algorithm. Information Science
Joshi E, Parikh DA (2018) An improved K-means clustering algorithm
Ismkhan H (2018) Ik-means- + : an iterative clustering algorithm based on an enhanced version of the k-means. Pattern Recogn 79:402–413
Ye S, Huang X, Teng Y, Li Y (2018) K-means clustering algorithm based on improved Cuckoo search algorithm and its application. In: 2018 IEEE 3rd international conference on big data analysis (ICBDA), pp 422–426
Yu S-S, Chu S-W, Wang C-M, Chan Y-K, Chang T-C (2018) Two improved k-means algorithms. Appl Soft Comput 68:747–755
Steinley D (2006) K-means clustering: a half-century synthesis. Br J Math Stat Psychol 59(1):1–34
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137
Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable k-means++. Proc VLDB Endow 5(7):622–633
Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2002) An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell 7:881–892
Elkan C (2003) Using the triangle inequality to accelerate k-means. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 147–153
Hamerly G (2010) Making k-means even faster. In: Proceedings of the 2010 SIAM international conference on data mining, pp 130–140
Drake J, Hamerly G (2012) Accelerated k-means with adaptive distance bounds. In: 5th NIPS workshop on optimization for machine learning, pp 42–53
Agustsson E, Timofte R, Van Gool L (2017) “$$ k^ 2$$ k 2-means for fast and accurate large scale clustering. In: Joint European conference on machine learning and knowledge discovery in databases, pp 775–791
Alsabti K, Ranka S, Singh V (1997) An efficient k-means clustering algorithm. Elect Eng Comput Sci 43. https://surface.syr.edu/eecs/43
Pelleg D, Moore A (1999) Accelerating exact k-means algorithms with geometric reasoning. In: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 277–281
Capó M, Pérez A, Lozano JA (2017) An efficient approximation to the K-means clustering for massive data. Knowl-Based Syst 117:56–69
Sculley D (2010) Web-scale k-means clustering. In: Proceedings of the 19th international conference on World wide Web, pp 1177–1178
Wang J, Wang J, Ke Q, Zeng G, Li S (2015) Fast approximate K-means via cluster closures. In: Multimedia data mining and analytics, Springer, pp 373–395
Bachem O, Lucic M, Hassani H, Krause A (2016) Fast and provably good seedings for k-means. In: Advances in neural information processing systems, pp 55–63
Newling J, Fleuret F (2017) K-medoids for k-means seeding. In: Advances in neural information processing systems, pp 5195–5203
Sherkat E, Velcin J, Milios EE (2018) Fast and simple deterministic seeding of K-means for text document clustering. In: International conference of the cross-language evaluation forum for European languages, pp 76–88
Ostrovsky R, Rabani Y, Schulman LJ, Swamy C (2012) The effectiveness of Lloyd-type methods for the k-means problem. JACM 59(6):28
Bachem O, Lucic M, Hassani H, Krause A (2016) Approximate K-means++ in sublinear time. In: AAAI, pp 1459–1467
Bachem O, Lucic M, Hassani H, Krause A (2016) K-mc2: approximate k-means++ in sublinear time. In: AAAI
Trotter HF (1959) An elementary proof of the central limit theorem. Arch Math 10(1):226–234
Filmus Y (2010) Two proofs of the central limit theorem. Recuperado de http://www.cs.toronto.edu/yuvalf/CLT.pdf
Fischer H (2010) A history of the central limit theorem: from classical to modern probability theory. Springer, Berlin
Mether M (2003) The history of the central limit theorem. Sovelletun Matematiikan erikoistyöt 2(1):08
Le Cam L (1986) The central limit theorem around 1935. Stat Sci 1(1):78–91
Adams WJ (2009) The life and times of the central limit theorem, vol 35. American Mathematical Society, Providence
Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. ACM Sigmod Record 27:73–84
Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2004) A local search approximation algorithm for k-means clustering. Comput Geom 28(2–3):89–112
Har-Peled S, Sadri B (2005) How fast is the k-means method? Algorithmica 41(3):185–202
Kaufman L, Rousseeuw PJ (2008) Clustering large applications (Program CLARA). In: Finding groups in data: an introduction to cluster analysis, pp 126–146
Ng RT, Han J (2002) CLARANS: a method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng 14(5):1003–1016
Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854
Guyon I, Von Luxburg U, Williamson RC (2009) Clustering: science or art. In: NIPS 2009 workshop on clustering theory, pp 1–11
Kärkkäinen I, Fränti P (2002) Dynamic local search algorithm for the clustering problem. University of Joensuu, Joensuu
Fränti P, Virmajoki O (2006) Iterative shrinking method for clustering problems. Pattern Recogn 39(5):761–775
Franti P, Virmajoki O, Hautamaki V (2006) Fast agglomerative clustering using a k-nearest neighbor graph. IEEE Trans Pattern Anal Mach Intell 28(11):1875–1881
Rezaei M, Fränti P (2016) Set matching measures for external cluster validity. IEEE Trans Knowl Data Eng 28(8):2173–2186
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. ACM Sigmod Record 25:103–114
Acknowledgements
The first author was supported by the Global Excellence and Stature Scholarship Fund of the University of Johannesburg, South Africa.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors do not have any conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Olukanmi, P., Nelwamondo, F. & Marwala, T. Rethinking k-means clustering in the age of massive datasets: a constant-time approach. Neural Comput & Applic 32, 15445–15467 (2020). https://doi.org/10.1007/s00521-019-04673-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-019-04673-0