Skip to main content
Log in

A hybrid approach to speed-up the k-means clustering method

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

k-means clustering method is an iterative partition-based method which for finite data-sets converges to a solution in a finite time. The running time of this method grows linearly with respect to the size of the data-set. Many variants have been proposed to speed-up the conventional k-means clustering method. In this paper, we propose a prototype-based hybrid approach to speed-up the k-means clustering method. The proposed method, first partitions the data-set into small clusters (grouplets), which are of varying sizes. Each grouplet is represented by a prototype. Later, the set of prototypes is partitioned into k clusters using the modified k-means method. The modified k-means clustering method is similar to the conventional k-means method but it avoids empty clusters (the clusters to which no pattern is assigned) in the iterative process. In each cluster of prototypes, each prototype is replaced by its corresponding set of patterns (which formed the grouplet) to derive a partition of the data-set. Since this partition of the data-set can deviate from the partition obtained using the conventional k-means method over the entire data-set, a correcting step is proposed. Both theoretically and experimentally, the conventional k-means method and the proposed hybrid method (augmented with the correcting step) are shown to yield the same result (provided, the initial k seed points are same). But, the proposed method is much faster than the conventional one. Experimentally, the proposed method is compared with the conventional method and the other recent methods that are proposed to speed-up the k-means method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. For the sake of simplicity, we assume that the patterns are from a Euclidean space and Euclidean distance is used, whereas the proposed methods are applicable with any distance metric.

References

  1. Almedia MB, Braga A, Braga JP (2000) Svm-km: speeding svms learning with a priory cluster selection and k-means. In: Proceedings of the sixth Brazilian Symposium on Neural Networks 162–167

  2. Alsabti K, Ranka S, Singh V (1998) An efficient k-means clustering algorithm. In: Proceedings of First Workshop High Performance Data Mining (March 1998)

  3. Ananthanarayana V, Murty M, Subramanian D (2001) An incremental data mining algorithm for compact realization of prototypes. Pattern Recognit 34:2249–2251

    Article  MATH  Google Scholar 

  4. Babu TR, Murty MN (2001) Comparison of genetic algorithms based prototype selection schemes. Pattern Recognit 34:523–525

    Article  Google Scholar 

  5. Berkhin P (2002) Survey of clustering data mining techniques. Technical Report, Accure Software

  6. Bidyut Kr. Patra, Sukumar Nandi, Viswanath P (2011) A distance based clustering method for arbitrary shaped clusters in large data-sets. Pattern Recognit 44:2862–2870

  7. Bottou L, Bengio Y (1995) Convergence properties of the k-means algorithms. In: Advances in Neural Information Processing Systems 7, MIT Press, Cambridge, pp. 585–592

  8. Bradley PS, Fayyad U, Raina C (1998) Scaling clustering algorithms to large databases. In: Proceedings of Fourth International Conference on Knowledge Discovery and Data Mining, AAAI Press, pp 9–15

  9. Chitta R, Murthy MN (2010) Two-level k-means clustering algorithm for k-\(\tau\) relationship establishment and linear-time classification. Pattern Recognit 43:796–804

    Article  MATH  Google Scholar 

  10. Davidson I, Satyanarayana A (2004) Speeding up k-means clustering by bootstrap averaging. IEEE ICDM

  11. Domingos P, Hulten G (2001) A general method for scaling up machine learning algorithms and its application to clustering. Eighteenth International Conference on Machine Learning

  12. Farnstrom F, Lewis J, Elkan C (2000) Scalability for clustering algorithms revisited. SIGKDD Explor 2(1):51–57

    Article  Google Scholar 

  13. Gereon Frahling, Christian Sohler (2006) A fast k-means implementation using coresets. In: Proceedings of the twenty-second annual symposium on Computational geometry (SCG ’06), pp. 135–143. ACM Press, New York, USA

  14. Gongde Guo, Si Chen, Lifei Chen (2011) Soft subspace clustering with an improved feature weight self-adjustment mechanism. Int J Mach Learn Cybern. doi:10.1007/s13042-011-0038

  15. Grunbaum B (2003) Convex Polytopes, 2nd edn. Springer, New York

    Book  Google Scholar 

  16. Guha S, Rastogi R, Shim K (1998) Cure:an efficient clustering algorithm for large databases. In: Proceedings of Conference Management of Data (ACM SIGMOD’98), pp 73–84

  17. Han J, Kamber M (2000) Data mining: concepts and techniques. 2nd edn. Morgan Kaufmann, USA

  18. Hartigan JA (1975) Clustering algorithms. Wiley, New York

    MATH  Google Scholar 

  19. Jain A, Dubes R (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs

    MATH  Google Scholar 

  20. Jain A, Murthy MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323

    Article  Google Scholar 

  21. Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666

    Article  Google Scholar 

  22. Jain AK, Duin P, Mao J (2000) Statistical pattern recognition: A review. IEEE Transact Pattern Anal Mach Intell 22(1):4–37

    Article  Google Scholar 

  23. Liang J, Song W (2011) Clustering based on steiner points. Int J Mach Learn Cybern. doi:10.1007/s13042-011-0047-7

  24. Abdul Nazeer KA, Sebastian MP (2009) Improving the accuracy and efficiency of the k-means clustering algorithm. In: Proceedings of the World Congress on Engineering 2009 vol I. London, U.K.

  25. Kanungo T, Mount D, Netanyahu N, Piatko C, Silverman R, Wu A The analysis of a simple k-means clustering algorithm. In: Proceedings of 16th Annual ACM Symposium Computational Geometry, pp 100–109 (June 2000)

  26. Kanungo T, Mount DM, Netanyahu NS, christine D Piatko, Silverman R, Wu AY (2002) An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transact Pattern Anal Mach Intell 24(7):881–892

    Article  Google Scholar 

  27. Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York

    Book  Google Scholar 

  28. Krishna K, Murty M (1999) Genetic k-means algorithm. IEEE Transact Syst Man Cybern Part B Cybern 29(3):433–439

    Article  Google Scholar 

  29. Lloyd SP (1982) Least squares quantization in PCM. IEEE Transact Inf Theory 28:129–137

    Article  MathSciNet  MATH  Google Scholar 

  30. Lu Y, Lu S, Fotouhi F, Deng Y, Brown SJ (2004) Fgka: A fast genetic k-means clustering algorithm. In: Proceedings of ACM Symposium on Applied Computing. pp 622–623

  31. MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, University of California Press, Berkeley, pp. 281–297

  32. Pakhira MK (2009) A modified k-means algorithm to avoid empty clusters. Int J Recent Trends Eng 1(1):220–226

    Google Scholar 

  33. parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6(1): 90–105

    Article  Google Scholar 

  34. Pelleg D, Moore A Accelerating exact k-means algorithms with geometric reasoning. In: Proceedings ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 277–281 (Aug 1999)

  35. Pelleg D, Moore A (2000) x-means: Extending k-means with efficient estimation of the number of clusters. In: Proceedings of the 17th International Conference on Machine Learning (July 2000)

  36. Murphy PM (1994) UCI Repository of Machine Learning Databases (http://www.ics.uci.edu/mlearn/MLRepository.html]. Department of Information and Computer Science, University of California, Irvine

  37. Jin R, Goswami A, Agarwal G (2006) Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10(1):17–40

    Article  Google Scholar 

  38. Fahim AM, Salem AM, Torkey A, Ramadan MA (2006) An efficient enhanced k-means clustering algorithm. J Zhejiang Univ 10(7):1626–1633

    Article  Google Scholar 

  39. Na S, Xumin L, Yong G (2010) Research on k-means clustering algorithm: an improved k-means clustering algorithm. In: Third International Symposium on Intelligent Information Technology and Security Informatics (IITSI), pp 63–67 (April 2010)

  40. Spath H (1980) Cluster analysis algorithms for data reduction and classification. Ellis Horwood, Chichester

    Google Scholar 

  41. Phillips SJ (2002) Acceleration of k-means and related clustering algorithms. In: Proceedings of Algorithms Engineering and Experiments(ALENEX02), pp. 166–177. Springer, Berlin

  42. Hitendra Sarma T, Viswanath P (2009) Speeding-up the k-means clustering method: A prototype based approach. In: Proceedings of 3rd International Conference on Pattern Recognition and Machine Intelligence(PReMI)LNCS 5909, pp. 56–61. Springer, Berlin

  43. Vijaya P, Murty MN, Subramanian DK (2004) Leaders-subleaders: an efficient hierarchical clustering algorithm for large data sets. Pattern Recognit Lett 25:505–513

    Article  Google Scholar 

  44. Viswanath P, Pinkesh R (2006) l-dbscan : a fast hybrid density based clustering method. In: Proceedings of the 18th Intl. Conf. on Pattern Recognition (ICPR-06), vol. 1, pp. 912–915. IEEE Computer Society, Hong Kong

  45. Viswanath P, Suresh Babu V (2009) Rough-DBSCAN : a fast hybrid density based clustering mehtod for large data sets. Pattern Recognition latters (2009), doi:10.1016/j.patrec.2009.08.008

  46. Vitter J (1985) Random sampling with a reservoir. ACM Transact Math Softw 11(1):37–57

    Article  MathSciNet  MATH  Google Scholar 

  47. Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu P (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37

    Article  Google Scholar 

  48. Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Transact Neural Netw 16(3):645–678

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to T. Hitendra Sarma.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sarma, T.H., Viswanath, P. & Reddy, B.E. A hybrid approach to speed-up the k-means clustering method. Int. J. Mach. Learn. & Cyber. 4, 107–117 (2013). https://doi.org/10.1007/s13042-012-0079-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-012-0079-7

Keywords