Skip to main content
Log in

Rethinking k-means clustering in the age of massive datasets: a constant-time approach

  • S.I. : 2018 India Intl. Congress on Computational Intelligence
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

We introduce a highly efficient k-means clustering approach. We show that the classical central limit theorem addresses a special case (k = 1) of the k-means problem and then extend it to the general case. Instead of using the full dataset, our algorithm named k-means-lite applies the standard k-means to the combination C (size nk) of all sample centroids obtained from n independent small samples. Unlike ordinary uniform sampling, the approach asymptotically preserves the performance of the original algorithm. In our experiments with a wide range of synthetic and real-world datasets, k-means-lite matches the performance of k-means when C is constructed using 30 samples of size 40 + 2k. Although the 30-sample choice proves to be a generally reliable rule, when the proposed approach is used to scale k-means++ (we call this scaled version k-means-lite++), k-means++’ performance is matched in several cases, using only five samples. These two new algorithms are presented to demonstrate the proposed approach, but the approach can be applied to create a constant-time version of any other k-means clustering algorithm, since it does not modify the internal workings of the base algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. Available at https://www.mathworks.com/matlabcentral/fileexchange/37435-generate-data-for-clustering.

References

  1. Philbeck T, Davis N (2019) The Fourth Industrial Revolution. J Int Aff 72(1):17–22

    Google Scholar 

  2. Gunal MM (2019) Simulation and the fourth industrial revolution. In: Simulation for Industry 4.0, Springer, pp 1–17

  3. Vassakis K, Petrakis E, Kopanakis I (2018) Big data analytics: applications, prospects and challenges. In Mobile big data, Springer, pp 3–20

  4. Fahim AM, Salem AM, Torkey FA, Ramadan MA (2006) An efficient enhanced k-means clustering algorithm. J Zhejiang Univ Sci A 7(10):1626–1633

    MATH  Google Scholar 

  5. Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678

    Google Scholar 

  6. Milligan GW (1980) An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika 45(3):325–342

    Google Scholar 

  7. Bindra K, Mishra A (2019) Effective data clustering algorithms. In: Soft computing: theories and applications, Springer, pp 419–432

  8. Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recogn Lett 31(8):651–666

    Google Scholar 

  9. Gondeau A, Aouabed Z, Hijri M, Peres-Neto P, Makarenkov V (2019) Object weighting: a new clustering approach to deal with outliers and cluster overlap in computational biology. IEEE/ACM Trans Comput Biol Bioinform. https://doi.org/10.1109/TCBB.2019.2921577

    Article  Google Scholar 

  10. Brusco MJ, Steinley D, Stevens J, Cradit JD (2019) Affinity propagation: an exemplar-based tool for clustering in psychological research. Br J Math Stat Psychol 72(1):155–182

    MATH  Google Scholar 

  11. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv CSUR 31(3):264–323

    Google Scholar 

  12. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs

    MATH  Google Scholar 

  13. Wong K-C (2015) A short survey on data clustering algorithms. In: 2015 Second international conference on soft computing and machine intelligence (ISCMI), pp 64–68

  14. Li T, Ding C (2018) Nonnegative matrix factorizations for clustering: a survey. In: Data clustering. Chapman and Hall/CRC, pp 149–176

  15. He Z, Yu C (2019) Clustering stability-based evolutionary k-means. Soft Comput 23(1):305–321

    MATH  Google Scholar 

  16. Melnykov V, Michael S (2019) Clustering large datasets by merging K-means solutions. J Classif. https://doi.org/10.1007/s00357-019-09314-8

    Article  MATH  Google Scholar 

  17. Lücke J, Forster D (2019) k-means as a variational EM approximation of Gaussian mixture models. Pattern Recognit Lett 125:349–356

    Google Scholar 

  18. Wu X et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37

    Google Scholar 

  19. Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp 1027–1035

  20. Mitra P, Shankar BU, Pal SK (2004) Segmentation of multispectral remote sensing images using active support vector machines. Pattern Recogn Lett 25(9):1067–1074

    Google Scholar 

  21. Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. KDD Workshop Text Min 400:525–526

    Google Scholar 

  22. Celebi ME (2011) Improving the performance of k-means for color quantization. Image Vis Comput 29(4):260–271

    Google Scholar 

  23. Kuo RJ, Ho LM, Hu CM (2002) Integration of self-organizing feature map and K-means algorithm for market segmentation. Comput Oper Res 29(11):1475–1493

    MATH  Google Scholar 

  24. Wagh S, Prasad R (2014) Power backup density based clustering algorithm for maximizing lifetime of wireless sensor networks. In: 2014 4th International conference on wireless communications, vehicular technology, information theory and aerospace & electronic systems (VITAE), pp 1–5

  25. Le Roch KG et al (2003) Discovery of gene function by expression profiling of the malaria parasite life cycle. Science 301(5639):1503–1508

    Google Scholar 

  26. Ng HP, Ong SH, Foong KWC, Goh PS, Nowinski WL (2006) Medical image segmentation using k-means clustering and improved watershed algorithm. In: 2006 IEEE southwest symposium on image analysis and interpretation, pp 61–65

  27. Su M-C, Chou C-H (2001) A modified version of the K-means algorithm with a distance based on cluster symmetry. IEEE Trans Pattern Anal Mach Intell 23(6):674–680

    Google Scholar 

  28. Olukanmi PO, Twala B (2017) Sensitivity analysis of an outlier-aware k-means clustering algorithm. In: Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech), pp 68–73

  29. Olukanmi PO, Twala B (2017) K-means-sharp: modified centroid update for outlier-robust k-means clustering. In: Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech), pp 14–19

  30. Fränti P, Sieranoja S (2017) K-means properties on six clustering benchmark datasets. Appl Intell 48:1–17

    Google Scholar 

  31. Shrivastava P, Sahoo L, Pandey M, Agrawal S (2018) AKM—augmentation of K-means clustering algorithm for big data. In: Intelligent engineering informatics, Springer, pp 103–109

  32. Meng Y, Liang J, Cao F, He Y (2018) A new distance with derivative information for functional k-means clustering algorithm. Information Science

  33. Joshi E, Parikh DA (2018) An improved K-means clustering algorithm

  34. Ismkhan H (2018) Ik-means- + : an iterative clustering algorithm based on an enhanced version of the k-means. Pattern Recogn 79:402–413

    Google Scholar 

  35. Ye S, Huang X, Teng Y, Li Y (2018) K-means clustering algorithm based on improved Cuckoo search algorithm and its application. In: 2018 IEEE 3rd international conference on big data analysis (ICBDA), pp 422–426

  36. Yu S-S, Chu S-W, Wang C-M, Chan Y-K, Chang T-C (2018) Two improved k-means algorithms. Appl Soft Comput 68:747–755

    Google Scholar 

  37. Steinley D (2006) K-means clustering: a half-century synthesis. Br J Math Stat Psychol 59(1):1–34

    MathSciNet  Google Scholar 

  38. Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137

    MathSciNet  MATH  Google Scholar 

  39. Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable k-means++. Proc VLDB Endow 5(7):622–633

    Google Scholar 

  40. Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2002) An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell 7:881–892

    MATH  Google Scholar 

  41. Elkan C (2003) Using the triangle inequality to accelerate k-means. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 147–153

  42. Hamerly G (2010) Making k-means even faster. In: Proceedings of the 2010 SIAM international conference on data mining, pp 130–140

  43. Drake J, Hamerly G (2012) Accelerated k-means with adaptive distance bounds. In: 5th NIPS workshop on optimization for machine learning, pp 42–53

  44. Agustsson E, Timofte R, Van Gool L (2017) “$$ k^ 2$$ k 2-means for fast and accurate large scale clustering. In: Joint European conference on machine learning and knowledge discovery in databases, pp 775–791

  45. Alsabti K, Ranka S, Singh V (1997) An efficient k-means clustering algorithm. Elect Eng Comput Sci 43. https://surface.syr.edu/eecs/43

  46. Pelleg D, Moore A (1999) Accelerating exact k-means algorithms with geometric reasoning. In: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 277–281

  47. Capó M, Pérez A, Lozano JA (2017) An efficient approximation to the K-means clustering for massive data. Knowl-Based Syst 117:56–69

    Google Scholar 

  48. Sculley D (2010) Web-scale k-means clustering. In: Proceedings of the 19th international conference on World wide Web, pp 1177–1178

  49. Wang J, Wang J, Ke Q, Zeng G, Li S (2015) Fast approximate K-means via cluster closures. In: Multimedia data mining and analytics, Springer, pp 373–395

  50. Bachem O, Lucic M, Hassani H, Krause A (2016) Fast and provably good seedings for k-means. In: Advances in neural information processing systems, pp 55–63

  51. Newling J, Fleuret F (2017) K-medoids for k-means seeding. In: Advances in neural information processing systems, pp 5195–5203

  52. Sherkat E, Velcin J, Milios EE (2018) Fast and simple deterministic seeding of K-means for text document clustering. In: International conference of the cross-language evaluation forum for European languages, pp 76–88

  53. Ostrovsky R, Rabani Y, Schulman LJ, Swamy C (2012) The effectiveness of Lloyd-type methods for the k-means problem. JACM 59(6):28

    MathSciNet  MATH  Google Scholar 

  54. Bachem O, Lucic M, Hassani H, Krause A (2016) Approximate K-means++ in sublinear time. In: AAAI, pp 1459–1467

  55. Bachem O, Lucic M, Hassani H, Krause A (2016) K-mc2: approximate k-means++ in sublinear time. In: AAAI

  56. Trotter HF (1959) An elementary proof of the central limit theorem. Arch Math 10(1):226–234

    MathSciNet  MATH  Google Scholar 

  57. Filmus Y (2010) Two proofs of the central limit theorem. Recuperado de http://www.cs.toronto.edu/yuvalf/CLT.pdf

  58. Fischer H (2010) A history of the central limit theorem: from classical to modern probability theory. Springer, Berlin

    Google Scholar 

  59. Mether M (2003) The history of the central limit theorem. Sovelletun Matematiikan erikoistyöt 2(1):08

    Google Scholar 

  60. Le Cam L (1986) The central limit theorem around 1935. Stat Sci 1(1):78–91

    MathSciNet  MATH  Google Scholar 

  61. Adams WJ (2009) The life and times of the central limit theorem, vol 35. American Mathematical Society, Providence

    MATH  Google Scholar 

  62. Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. ACM Sigmod Record 27:73–84

    MATH  Google Scholar 

  63. Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2004) A local search approximation algorithm for k-means clustering. Comput Geom 28(2–3):89–112

    MathSciNet  MATH  Google Scholar 

  64. Har-Peled S, Sadri B (2005) How fast is the k-means method? Algorithmica 41(3):185–202

    MathSciNet  MATH  Google Scholar 

  65. Kaufman L, Rousseeuw PJ (2008) Clustering large applications (Program CLARA). In: Finding groups in data: an introduction to cluster analysis, pp 126–146

  66. Ng RT, Han J (2002) CLARANS: a method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng 14(5):1003–1016

    Google Scholar 

  67. Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854

    MathSciNet  MATH  Google Scholar 

  68. Guyon I, Von Luxburg U, Williamson RC (2009) Clustering: science or art. In: NIPS 2009 workshop on clustering theory, pp 1–11

  69. Kärkkäinen I, Fränti P (2002) Dynamic local search algorithm for the clustering problem. University of Joensuu, Joensuu

    MATH  Google Scholar 

  70. Fränti P, Virmajoki O (2006) Iterative shrinking method for clustering problems. Pattern Recogn 39(5):761–775

    MATH  Google Scholar 

  71. Franti P, Virmajoki O, Hautamaki V (2006) Fast agglomerative clustering using a k-nearest neighbor graph. IEEE Trans Pattern Anal Mach Intell 28(11):1875–1881

    Google Scholar 

  72. Rezaei M, Fränti P (2016) Set matching measures for external cluster validity. IEEE Trans Knowl Data Eng 28(8):2173–2186

    Google Scholar 

  73. Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. ACM Sigmod Record 25:103–114

    Google Scholar 

Download references

Acknowledgements

The first author was supported by the Global Excellence and Stature Scholarship Fund of the University of Johannesburg, South Africa.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to P. Olukanmi.

Ethics declarations

Conflict of interest

The authors do not have any conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Olukanmi, P., Nelwamondo, F. & Marwala, T. Rethinking k-means clustering in the age of massive datasets: a constant-time approach. Neural Comput & Applic 32, 15445–15467 (2020). https://doi.org/10.1007/s00521-019-04673-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-019-04673-0

Keywords

Navigation