Abstract
k-means clustering method is an iterative partition-based method which for finite data-sets converges to a solution in a finite time. The running time of this method grows linearly with respect to the size of the data-set. Many variants have been proposed to speed-up the conventional k-means clustering method. In this paper, we propose a prototype-based hybrid approach to speed-up the k-means clustering method. The proposed method, first partitions the data-set into small clusters (grouplets), which are of varying sizes. Each grouplet is represented by a prototype. Later, the set of prototypes is partitioned into k clusters using the modified k-means method. The modified k-means clustering method is similar to the conventional k-means method but it avoids empty clusters (the clusters to which no pattern is assigned) in the iterative process. In each cluster of prototypes, each prototype is replaced by its corresponding set of patterns (which formed the grouplet) to derive a partition of the data-set. Since this partition of the data-set can deviate from the partition obtained using the conventional k-means method over the entire data-set, a correcting step is proposed. Both theoretically and experimentally, the conventional k-means method and the proposed hybrid method (augmented with the correcting step) are shown to yield the same result (provided, the initial k seed points are same). But, the proposed method is much faster than the conventional one. Experimentally, the proposed method is compared with the conventional method and the other recent methods that are proposed to speed-up the k-means method.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
For the sake of simplicity, we assume that the patterns are from a Euclidean space and Euclidean distance is used, whereas the proposed methods are applicable with any distance metric.
References
Almedia MB, Braga A, Braga JP (2000) Svm-km: speeding svms learning with a priory cluster selection and k-means. In: Proceedings of the sixth Brazilian Symposium on Neural Networks 162–167
Alsabti K, Ranka S, Singh V (1998) An efficient k-means clustering algorithm. In: Proceedings of First Workshop High Performance Data Mining (March 1998)
Ananthanarayana V, Murty M, Subramanian D (2001) An incremental data mining algorithm for compact realization of prototypes. Pattern Recognit 34:2249–2251
Babu TR, Murty MN (2001) Comparison of genetic algorithms based prototype selection schemes. Pattern Recognit 34:523–525
Berkhin P (2002) Survey of clustering data mining techniques. Technical Report, Accure Software
Bidyut Kr. Patra, Sukumar Nandi, Viswanath P (2011) A distance based clustering method for arbitrary shaped clusters in large data-sets. Pattern Recognit 44:2862–2870
Bottou L, Bengio Y (1995) Convergence properties of the k-means algorithms. In: Advances in Neural Information Processing Systems 7, MIT Press, Cambridge, pp. 585–592
Bradley PS, Fayyad U, Raina C (1998) Scaling clustering algorithms to large databases. In: Proceedings of Fourth International Conference on Knowledge Discovery and Data Mining, AAAI Press, pp 9–15
Chitta R, Murthy MN (2010) Two-level k-means clustering algorithm for k-\(\tau\) relationship establishment and linear-time classification. Pattern Recognit 43:796–804
Davidson I, Satyanarayana A (2004) Speeding up k-means clustering by bootstrap averaging. IEEE ICDM
Domingos P, Hulten G (2001) A general method for scaling up machine learning algorithms and its application to clustering. Eighteenth International Conference on Machine Learning
Farnstrom F, Lewis J, Elkan C (2000) Scalability for clustering algorithms revisited. SIGKDD Explor 2(1):51–57
Gereon Frahling, Christian Sohler (2006) A fast k-means implementation using coresets. In: Proceedings of the twenty-second annual symposium on Computational geometry (SCG ’06), pp. 135–143. ACM Press, New York, USA
Gongde Guo, Si Chen, Lifei Chen (2011) Soft subspace clustering with an improved feature weight self-adjustment mechanism. Int J Mach Learn Cybern. doi:10.1007/s13042-011-0038
Grunbaum B (2003) Convex Polytopes, 2nd edn. Springer, New York
Guha S, Rastogi R, Shim K (1998) Cure:an efficient clustering algorithm for large databases. In: Proceedings of Conference Management of Data (ACM SIGMOD’98), pp 73–84
Han J, Kamber M (2000) Data mining: concepts and techniques. 2nd edn. Morgan Kaufmann, USA
Hartigan JA (1975) Clustering algorithms. Wiley, New York
Jain A, Dubes R (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs
Jain A, Murthy MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666
Jain AK, Duin P, Mao J (2000) Statistical pattern recognition: A review. IEEE Transact Pattern Anal Mach Intell 22(1):4–37
Liang J, Song W (2011) Clustering based on steiner points. Int J Mach Learn Cybern. doi:10.1007/s13042-011-0047-7
Abdul Nazeer KA, Sebastian MP (2009) Improving the accuracy and efficiency of the k-means clustering algorithm. In: Proceedings of the World Congress on Engineering 2009 vol I. London, U.K.
Kanungo T, Mount D, Netanyahu N, Piatko C, Silverman R, Wu A The analysis of a simple k-means clustering algorithm. In: Proceedings of 16th Annual ACM Symposium Computational Geometry, pp 100–109 (June 2000)
Kanungo T, Mount DM, Netanyahu NS, christine D Piatko, Silverman R, Wu AY (2002) An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transact Pattern Anal Mach Intell 24(7):881–892
Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Krishna K, Murty M (1999) Genetic k-means algorithm. IEEE Transact Syst Man Cybern Part B Cybern 29(3):433–439
Lloyd SP (1982) Least squares quantization in PCM. IEEE Transact Inf Theory 28:129–137
Lu Y, Lu S, Fotouhi F, Deng Y, Brown SJ (2004) Fgka: A fast genetic k-means clustering algorithm. In: Proceedings of ACM Symposium on Applied Computing. pp 622–623
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, University of California Press, Berkeley, pp. 281–297
Pakhira MK (2009) A modified k-means algorithm to avoid empty clusters. Int J Recent Trends Eng 1(1):220–226
parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6(1): 90–105
Pelleg D, Moore A Accelerating exact k-means algorithms with geometric reasoning. In: Proceedings ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 277–281 (Aug 1999)
Pelleg D, Moore A (2000) x-means: Extending k-means with efficient estimation of the number of clusters. In: Proceedings of the 17th International Conference on Machine Learning (July 2000)
Murphy PM (1994) UCI Repository of Machine Learning Databases (http://www.ics.uci.edu/mlearn/MLRepository.html]. Department of Information and Computer Science, University of California, Irvine
Jin R, Goswami A, Agarwal G (2006) Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10(1):17–40
Fahim AM, Salem AM, Torkey A, Ramadan MA (2006) An efficient enhanced k-means clustering algorithm. J Zhejiang Univ 10(7):1626–1633
Na S, Xumin L, Yong G (2010) Research on k-means clustering algorithm: an improved k-means clustering algorithm. In: Third International Symposium on Intelligent Information Technology and Security Informatics (IITSI), pp 63–67 (April 2010)
Spath H (1980) Cluster analysis algorithms for data reduction and classification. Ellis Horwood, Chichester
Phillips SJ (2002) Acceleration of k-means and related clustering algorithms. In: Proceedings of Algorithms Engineering and Experiments(ALENEX02), pp. 166–177. Springer, Berlin
Hitendra Sarma T, Viswanath P (2009) Speeding-up the k-means clustering method: A prototype based approach. In: Proceedings of 3rd International Conference on Pattern Recognition and Machine Intelligence(PReMI)LNCS 5909, pp. 56–61. Springer, Berlin
Vijaya P, Murty MN, Subramanian DK (2004) Leaders-subleaders: an efficient hierarchical clustering algorithm for large data sets. Pattern Recognit Lett 25:505–513
Viswanath P, Pinkesh R (2006) l-dbscan : a fast hybrid density based clustering method. In: Proceedings of the 18th Intl. Conf. on Pattern Recognition (ICPR-06), vol. 1, pp. 912–915. IEEE Computer Society, Hong Kong
Viswanath P, Suresh Babu V (2009) Rough-DBSCAN : a fast hybrid density based clustering mehtod for large data sets. Pattern Recognition latters (2009), doi:10.1016/j.patrec.2009.08.008
Vitter J (1985) Random sampling with a reservoir. ACM Transact Math Softw 11(1):37–57
Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu P (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Transact Neural Netw 16(3):645–678
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sarma, T.H., Viswanath, P. & Reddy, B.E. A hybrid approach to speed-up the k-means clustering method. Int. J. Mach. Learn. & Cyber. 4, 107–117 (2013). https://doi.org/10.1007/s13042-012-0079-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-012-0079-7