A hybrid approach to speed-up the k-means clustering method

Sarma, T. Hitendra; Viswanath, P.; Reddy, B. Eswara

doi:10.1007/s13042-012-0079-7

A hybrid approach to speed-up the k-means clustering method

Original Article
Published: 15 February 2012

Volume 4, pages 107–117, (2013)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

T. Hitendra Sarma¹,
P. Viswanath¹ &
B. Eswara Reddy²

805 Accesses
34 Citations
Explore all metrics

Abstract

k-means clustering method is an iterative partition-based method which for finite data-sets converges to a solution in a finite time. The running time of this method grows linearly with respect to the size of the data-set. Many variants have been proposed to speed-up the conventional k-means clustering method. In this paper, we propose a prototype-based hybrid approach to speed-up the k-means clustering method. The proposed method, first partitions the data-set into small clusters (grouplets), which are of varying sizes. Each grouplet is represented by a prototype. Later, the set of prototypes is partitioned into k clusters using the modified k-means method. The modified k-means clustering method is similar to the conventional k-means method but it avoids empty clusters (the clusters to which no pattern is assigned) in the iterative process. In each cluster of prototypes, each prototype is replaced by its corresponding set of patterns (which formed the grouplet) to derive a partition of the data-set. Since this partition of the data-set can deviate from the partition obtained using the conventional k-means method over the entire data-set, a correcting step is proposed. Both theoretically and experimentally, the conventional k-means method and the proposed hybrid method (augmented with the correcting step) are shown to yield the same result (provided, the initial k seed points are same). But, the proposed method is much faster than the conventional one. Experimentally, the proposed method is compared with the conventional method and the other recent methods that are proposed to speed-up the k-means method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

Notes

For the sake of simplicity, we assume that the patterns are from a Euclidean space and Euclidean distance is used, whereas the proposed methods are applicable with any distance metric.

References

Almedia MB, Braga A, Braga JP (2000) Svm-km: speeding svms learning with a priory cluster selection and k-means. In: Proceedings of the sixth Brazilian Symposium on Neural Networks 162–167
Alsabti K, Ranka S, Singh V (1998) An efficient k-means clustering algorithm. In: Proceedings of First Workshop High Performance Data Mining (March 1998)
Ananthanarayana V, Murty M, Subramanian D (2001) An incremental data mining algorithm for compact realization of prototypes. Pattern Recognit 34:2249–2251
Article MATH Google Scholar
Babu TR, Murty MN (2001) Comparison of genetic algorithms based prototype selection schemes. Pattern Recognit 34:523–525
Article Google Scholar
Berkhin P (2002) Survey of clustering data mining techniques. Technical Report, Accure Software
Bidyut Kr. Patra, Sukumar Nandi, Viswanath P (2011) A distance based clustering method for arbitrary shaped clusters in large data-sets. Pattern Recognit 44:2862–2870
Bottou L, Bengio Y (1995) Convergence properties of the k-means algorithms. In: Advances in Neural Information Processing Systems 7, MIT Press, Cambridge, pp. 585–592
Bradley PS, Fayyad U, Raina C (1998) Scaling clustering algorithms to large databases. In: Proceedings of Fourth International Conference on Knowledge Discovery and Data Mining, AAAI Press, pp 9–15
Chitta R, Murthy MN (2010) Two-level k-means clustering algorithm for k-$\tau$ relationship establishment and linear-time classification. Pattern Recognit 43:796–804
Article MATH Google Scholar
Davidson I, Satyanarayana A (2004) Speeding up k-means clustering by bootstrap averaging. IEEE ICDM
Domingos P, Hulten G (2001) A general method for scaling up machine learning algorithms and its application to clustering. Eighteenth International Conference on Machine Learning
Farnstrom F, Lewis J, Elkan C (2000) Scalability for clustering algorithms revisited. SIGKDD Explor 2(1):51–57
Article Google Scholar
Gereon Frahling, Christian Sohler (2006) A fast k-means implementation using coresets. In: Proceedings of the twenty-second annual symposium on Computational geometry (SCG ’06), pp. 135–143. ACM Press, New York, USA
Gongde Guo, Si Chen, Lifei Chen (2011) Soft subspace clustering with an improved feature weight self-adjustment mechanism. Int J Mach Learn Cybern. doi:10.1007/s13042-011-0038
Grunbaum B (2003) Convex Polytopes, 2nd edn. Springer, New York
Book Google Scholar
Guha S, Rastogi R, Shim K (1998) Cure:an efficient clustering algorithm for large databases. In: Proceedings of Conference Management of Data (ACM SIGMOD’98), pp 73–84
Han J, Kamber M (2000) Data mining: concepts and techniques. 2nd edn. Morgan Kaufmann, USA
Hartigan JA (1975) Clustering algorithms. Wiley, New York
MATH Google Scholar
Jain A, Dubes R (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs
MATH Google Scholar
Jain A, Murthy MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Article Google Scholar
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666
Article Google Scholar
Jain AK, Duin P, Mao J (2000) Statistical pattern recognition: A review. IEEE Transact Pattern Anal Mach Intell 22(1):4–37
Article Google Scholar
Liang J, Song W (2011) Clustering based on steiner points. Int J Mach Learn Cybern. doi:10.1007/s13042-011-0047-7
Abdul Nazeer KA, Sebastian MP (2009) Improving the accuracy and efficiency of the k-means clustering algorithm. In: Proceedings of the World Congress on Engineering 2009 vol I. London, U.K.
Kanungo T, Mount D, Netanyahu N, Piatko C, Silverman R, Wu A The analysis of a simple k-means clustering algorithm. In: Proceedings of 16th Annual ACM Symposium Computational Geometry, pp 100–109 (June 2000)
Kanungo T, Mount DM, Netanyahu NS, christine D Piatko, Silverman R, Wu AY (2002) An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transact Pattern Anal Mach Intell 24(7):881–892
Article Google Scholar
Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Book Google Scholar
Krishna K, Murty M (1999) Genetic k-means algorithm. IEEE Transact Syst Man Cybern Part B Cybern 29(3):433–439
Article Google Scholar
Lloyd SP (1982) Least squares quantization in PCM. IEEE Transact Inf Theory 28:129–137
Article MathSciNet MATH Google Scholar
Lu Y, Lu S, Fotouhi F, Deng Y, Brown SJ (2004) Fgka: A fast genetic k-means clustering algorithm. In: Proceedings of ACM Symposium on Applied Computing. pp 622–623
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, University of California Press, Berkeley, pp. 281–297
Pakhira MK (2009) A modified k-means algorithm to avoid empty clusters. Int J Recent Trends Eng 1(1):220–226
Google Scholar
parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6(1): 90–105
Article Google Scholar
Pelleg D, Moore A Accelerating exact k-means algorithms with geometric reasoning. In: Proceedings ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 277–281 (Aug 1999)
Pelleg D, Moore A (2000) x-means: Extending k-means with efficient estimation of the number of clusters. In: Proceedings of the 17th International Conference on Machine Learning (July 2000)
Murphy PM (1994) UCI Repository of Machine Learning Databases (http://www.ics.uci.edu/mlearn/MLRepository.html]. Department of Information and Computer Science, University of California, Irvine
Jin R, Goswami A, Agarwal G (2006) Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10(1):17–40
Article Google Scholar
Fahim AM, Salem AM, Torkey A, Ramadan MA (2006) An efficient enhanced k-means clustering algorithm. J Zhejiang Univ 10(7):1626–1633
Article Google Scholar
Na S, Xumin L, Yong G (2010) Research on k-means clustering algorithm: an improved k-means clustering algorithm. In: Third International Symposium on Intelligent Information Technology and Security Informatics (IITSI), pp 63–67 (April 2010)
Spath H (1980) Cluster analysis algorithms for data reduction and classification. Ellis Horwood, Chichester
Google Scholar
Phillips SJ (2002) Acceleration of k-means and related clustering algorithms. In: Proceedings of Algorithms Engineering and Experiments(ALENEX02), pp. 166–177. Springer, Berlin
Hitendra Sarma T, Viswanath P (2009) Speeding-up the k-means clustering method: A prototype based approach. In: Proceedings of 3rd International Conference on Pattern Recognition and Machine Intelligence(PReMI)LNCS 5909, pp. 56–61. Springer, Berlin
Vijaya P, Murty MN, Subramanian DK (2004) Leaders-subleaders: an efficient hierarchical clustering algorithm for large data sets. Pattern Recognit Lett 25:505–513
Article Google Scholar
Viswanath P, Pinkesh R (2006) l-dbscan : a fast hybrid density based clustering method. In: Proceedings of the 18th Intl. Conf. on Pattern Recognition (ICPR-06), vol. 1, pp. 912–915. IEEE Computer Society, Hong Kong
Viswanath P, Suresh Babu V (2009) Rough-DBSCAN : a fast hybrid density based clustering mehtod for large data sets. Pattern Recognition latters (2009), doi:10.1016/j.patrec.2009.08.008
Vitter J (1985) Random sampling with a reservoir. ACM Transact Math Softw 11(1):37–57
Article MathSciNet MATH Google Scholar
Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu P (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Article Google Scholar
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Transact Neural Netw 16(3):645–678
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Rajeev Gandhi Memorial College of Engineering and Technology, Nandyal, 518501, Andhra Pradesh, India
T. Hitendra Sarma & P. Viswanath
Department of Computer Science and Engineering, JNTUA College of Engineering, Anantapur, 515002, Andhra Pradesh, India
B. Eswara Reddy

Authors

T. Hitendra Sarma
View author publications
You can also search for this author inPubMed Google Scholar
P. Viswanath
View author publications
You can also search for this author inPubMed Google Scholar
B. Eswara Reddy
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to T. Hitendra Sarma.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sarma, T.H., Viswanath, P. & Reddy, B.E. A hybrid approach to speed-up the k-means clustering method. Int. J. Mach. Learn. & Cyber. 4, 107–117 (2013). https://doi.org/10.1007/s13042-012-0079-7

Download citation

Received: 14 June 2011
Accepted: 16 January 2012
Published: 15 February 2012
Issue Date: April 2013
DOI: https://doi.org/10.1007/s13042-012-0079-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A hybrid approach to speed-up the k-means clustering method

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An Adaptive Strategy for Dynamic Data Clustering with the K-Means Algorithm

Global k-means++: an effective relaxation of the global k-means clustering algorithm

A Fast Heuristic k-means Algorithm Based on Nearest Neighbor Information

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

A hybrid approach to speed-up the k-means clustering method

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An Adaptive Strategy for Dynamic Data Clustering with the K-Means Algorithm

Global k-means++: an effective relaxation of the global k-means clustering algorithm

A Fast Heuristic k-means Algorithm Based on Nearest Neighbor Information

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now