Abstract
Nowadays, clustering of massive datasets is a crucial part of many data-analytic tasks. Most of the available clustering algorithms have two shortcomings when used on big data: (1) a large group of clustering algorithms, e.g. \(k\)-means, has to keep the data in memory and iterate over the data many times which is very costly for big datasets, (2) clustering algorithms that run on limited memory sizes, especially the family of stream-clustering algorithms, do not have a parallel implementation to utilize modern multi-core processors and also they lack decent quality of results. In this paper, we propose an algorithm that combines parallel clustering with single-pass, stream-clustering algorithms. The aim is to make a clustering algorithm that utilizes maximum capabilities of a regular multi-core PC to cluster the dataset as fast as possible while resulting in acceptable quality of clusters. Our idea is to split the data into chunks and cluster each chunk in a separate thread. Then, the clusters extracted from chunks are aggregated at the final stage using re-clustering. Parameters of the algorithm can be adjusted according to hardware limitations. Experimental results on a 12-core computer show that the proposed method is much faster than its batch-processing equivalents (e.g. \(k\)-means++) and stream-based algorithms. Also, the quality of solution is often equal to \(k\)-means++, while it significantly dominates stream-clustering algorithms. Our solution also scales well with extra available cores and hence provides an effective and fast solution to clustering large datasets on multi-core and multi-processor systems.
Similar content being viewed by others
Notes
The source code can be found at https://github.com/alihadian/mpkmeans.
References
Ackermann MR, Märtens M, Raupach C, Swierkot K, Lammersen C, Sohler C (2012) StreamKM++: a clustering algorithm for data streams. J Exp Algorithm 17(1):2–4
Agarwal PK, Har-Peled S, Varadarajan KR (2004) Approximating extent measures of points. J ACM 51(4):606–635
Ana LNF, Jain AK (2003) Robust data clustering. In: Proceedings. 2003 IEEE computer society conference on computer vision and pattern recognition, vol 2, IEEE, 2003, pp II–128
Apiletti D, Baralis E, Cerquitelli T, Chiusano S, Grimaudo L (2013) Searum: a cloud-based service for association rule mining. In: Trust, security and privacy in computing and communications (TrustCom), 12th IEEE International Conference on, IEEE, 2013, pp 1283–1290
Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, pp 1027–1035
Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable k-means++. Proc VLDB Endow 5(7):622–633
Bekkerman R, Bilenko M, Langford J (2012) Scaling up machine learning: parallel and distributed approaches. Cambridge University Press, Cambridge
Borg I, Groenen PJF (2005) Modern multidimensional scaling: theory and applications. Springer, Berlin
Bottou L, Bengio Y (1995) Convergence properties of the K-means algorithm. Advances in Neural Information Processing Systems 7 (NIPS’94), pp 585–592
Braverman V, Meyerson A, Ostrovsky R, Roytman A, Shindler M, Tagiku B (2011) Streaming k-means on well-clusterable data. In: Proceedings of the 22nd annual ACM-SIAM symposium on discrete algorithms, SIAM, pp 26–40
Chandra A, Yao X (2006) Evolving hybrid ensembles of learning machines for better generalisation. Neurocomputing 69(7):686–700
Chiang MMT, Mirkin B (2010) Intelligent choice of the number of clusters in K-Means clustering: an experimental study with different cluster spreads. J Classif 27(1):3–40
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Dhillon IS, Modha DS (2000) A data-clustering algorithm on distributed memory multiprocessors. In: Large-scale parallel data mining. Springer, Berlin, pp 245–260
Domingos P, Hulten G (2001) A general method for scaling up machine learning algorithms and its application to clustering. In: ICML, pp 106–113
Ene A, Im S, Moseley B (2011) Fast clustering using mapreduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 681–689
Feldman D, Schmidt M, Sohler C (2013) Turning big data into tiny data: constant-size coresets for k-means, PCA and projective clustering. In: Proceedings of the 24th ACM-SIAM symposium on discrete algorithms (SODA13)
Gan G, Ma C, Wu J (2007) Data clustering: theory, algorithms, and applications, vol 20. Society for Industrial and Applied Mathematics
Hamerly YFG (2007) PG-means: learning the number of clusters in data. In: Advances in neural information processing systems, vol 19. The MIT Press, Cambridge, pp 393–400
Hawwash B, Nasraoui O (2012) Stream-dashboard: a framework for mining, tracking and validating clusters in a data stream. In: Proceedings of the 1st international workshop on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications, BigMine ’12ACM, New York, pp 109–117
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666
Mahafzah BA, Al-Badarneh AF, Zakaria MZ (2009) A new sampling technique for association rule mining. J Inf Sci 35(3):358–376
McLachlan GJ, Krishnan T (2007) The EM algorithm and extensions, vol 382. Wiley-Interscience, New York
OCallaghan L (2003) Approximation algorithms for clustering streams and large data sets. PhD thesis, Stanford University
Pelleg D, Moore A (2000) Others: X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the seventeenth international conference on machine learning, vol 1. San Francisco, pp 727–734
Rajaraman A, Ullman JD (2012) Mining of massive datasets. Cambridge University Press, Cambridge
Sculley D (2010) Web-scale k-means clustering. In: Proceedings of the 19th international conference on World wide web. ACM, pp 1177–1178
Shindler M, Wong A, Meyerson AW (2011) Fast and accurate k-means for large datasets. In. Advances in neural information processing systems, pp 2375–2383
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J Royal Stat Soc Series B (Stat Methodol) 63(2):411–423
Wang L, Leckie C, Ramamohanarao K, Bezdek J (2009) Automatically determining the number of clusters in unlabeled data sets. IEEE Trans Knowl Data Eng 21(3):335–350
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY (2008) Others: top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Xu R, Wunsch D (2008) Clustering, vol 10. Wiley-IEEE Press, New York
Yan M, Ye K (2007) Determining the number of clusters using the weighted gap statistic. Biometrics 63(4):1031–1037
Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: Cloud computing. Springer, Berlin, pp 674–679
Acknowledgments
The authors would like to thank Ali Gholami Rudi for his constructive comments and discussions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hadian, A., Shahrivari, S. High performance parallel \(k\)-means clustering for disk-resident datasets on multi-core CPUs. J Supercomput 69, 845–863 (2014). https://doi.org/10.1007/s11227-014-1185-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-014-1185-y