Skip to main content
Log in

High performance parallel \(k\)-means clustering for disk-resident datasets on multi-core CPUs

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Nowadays, clustering of massive datasets is a crucial part of many data-analytic tasks. Most of the available clustering algorithms have two shortcomings when used on big data: (1) a large group of clustering algorithms, e.g. \(k\)-means, has to keep the data in memory and iterate over the data many times which is very costly for big datasets, (2) clustering algorithms that run on limited memory sizes, especially the family of stream-clustering algorithms, do not have a parallel implementation to utilize modern multi-core processors and also they lack decent quality of results. In this paper, we propose an algorithm that combines parallel clustering with single-pass, stream-clustering algorithms. The aim is to make a clustering algorithm that utilizes maximum capabilities of a regular multi-core PC to cluster the dataset as fast as possible while resulting in acceptable quality of clusters. Our idea is to split the data into chunks and cluster each chunk in a separate thread. Then, the clusters extracted from chunks are aggregated at the final stage using re-clustering. Parameters of the algorithm can be adjusted according to hardware limitations. Experimental results on a 12-core computer show that the proposed method is much faster than its batch-processing equivalents (e.g. \(k\)-means++) and stream-based algorithms. Also, the quality of solution is often equal to \(k\)-means++, while it significantly dominates stream-clustering algorithms. Our solution also scales well with extra available cores and hence provides an effective and fast solution to clustering large datasets on multi-core and multi-processor systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. The source code can be found at https://github.com/alihadian/mpkmeans.

  2. http://kdd.ics.uci.edu/databases/kddcup99/.

  3. http://archive.ics.uci.edu/ml/datasets/Covertype.

  4. http://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption.

  5. http://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990).

  6. http://opencv.org/.

References

  1. Ackermann MR, Märtens M, Raupach C, Swierkot K, Lammersen C, Sohler C (2012) StreamKM++: a clustering algorithm for data streams. J Exp Algorithm 17(1):2–4

    Article  Google Scholar 

  2. Agarwal PK, Har-Peled S, Varadarajan KR (2004) Approximating extent measures of points. J ACM 51(4):606–635

    Article  MATH  MathSciNet  Google Scholar 

  3. Ana LNF, Jain AK (2003) Robust data clustering. In: Proceedings. 2003 IEEE computer society conference on computer vision and pattern recognition, vol 2, IEEE, 2003, pp II–128

  4. Apiletti D, Baralis E, Cerquitelli T, Chiusano S, Grimaudo L (2013) Searum: a cloud-based service for association rule mining. In: Trust, security and privacy in computing and communications (TrustCom), 12th IEEE International Conference on, IEEE, 2013, pp 1283–1290

  5. Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, pp 1027–1035

  6. Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable k-means++. Proc VLDB Endow 5(7):622–633

    Article  Google Scholar 

  7. Bekkerman R, Bilenko M, Langford J (2012) Scaling up machine learning: parallel and distributed approaches. Cambridge University Press, Cambridge

    Google Scholar 

  8. Borg I, Groenen PJF (2005) Modern multidimensional scaling: theory and applications. Springer, Berlin

    Google Scholar 

  9. Bottou L, Bengio Y (1995) Convergence properties of the K-means algorithm. Advances in Neural Information Processing Systems 7 (NIPS’94), pp 585–592

  10. Braverman V, Meyerson A, Ostrovsky R, Roytman A, Shindler M, Tagiku B (2011) Streaming k-means on well-clusterable data. In: Proceedings of the 22nd annual ACM-SIAM symposium on discrete algorithms, SIAM, pp 26–40

  11. Chandra A, Yao X (2006) Evolving hybrid ensembles of learning machines for better generalisation. Neurocomputing 69(7):686–700

  12. Chiang MMT, Mirkin B (2010) Intelligent choice of the number of clusters in K-Means clustering: an experimental study with different cluster spreads. J Classif 27(1):3–40

    Article  MathSciNet  Google Scholar 

  13. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  14. Dhillon IS, Modha DS (2000) A data-clustering algorithm on distributed memory multiprocessors. In: Large-scale parallel data mining. Springer, Berlin, pp 245–260

  15. Domingos P, Hulten G (2001) A general method for scaling up machine learning algorithms and its application to clustering. In: ICML, pp 106–113

  16. Ene A, Im S, Moseley B (2011) Fast clustering using mapreduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 681–689

  17. Feldman D, Schmidt M, Sohler C (2013) Turning big data into tiny data: constant-size coresets for k-means, PCA and projective clustering. In: Proceedings of the 24th ACM-SIAM symposium on discrete algorithms (SODA13)

  18. Gan G, Ma C, Wu J (2007) Data clustering: theory, algorithms, and applications, vol 20. Society for Industrial and Applied Mathematics

  19. Hamerly YFG (2007) PG-means: learning the number of clusters in data. In: Advances in neural information processing systems, vol 19. The MIT Press, Cambridge, pp 393–400

  20. Hawwash B, Nasraoui O (2012) Stream-dashboard: a framework for mining, tracking and validating clusters in a data stream. In: Proceedings of the 1st international workshop on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications, BigMine ’12ACM, New York, pp 109–117

  21. Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666

    Article  Google Scholar 

  22. Mahafzah BA, Al-Badarneh AF, Zakaria MZ (2009) A new sampling technique for association rule mining. J Inf Sci 35(3):358–376

    Article  Google Scholar 

  23. McLachlan GJ, Krishnan T (2007) The EM algorithm and extensions, vol 382. Wiley-Interscience, New York

  24. OCallaghan L (2003) Approximation algorithms for clustering streams and large data sets. PhD thesis, Stanford University

  25. Pelleg D, Moore A (2000) Others: X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the seventeenth international conference on machine learning, vol 1. San Francisco, pp 727–734

  26. Rajaraman A, Ullman JD (2012) Mining of massive datasets. Cambridge University Press, Cambridge

  27. Sculley D (2010) Web-scale k-means clustering. In: Proceedings of the 19th international conference on World wide web. ACM, pp 1177–1178

  28. Shindler M, Wong A, Meyerson AW (2011) Fast and accurate k-means for large datasets. In. Advances in neural information processing systems, pp 2375–2383

  29. Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J Royal Stat Soc Series B (Stat Methodol) 63(2):411–423

    Article  MATH  MathSciNet  Google Scholar 

  30. Wang L, Leckie C, Ramamohanarao K, Bezdek J (2009) Automatically determining the number of clusters in unlabeled data sets. IEEE Trans Knowl Data Eng 21(3):335–350

    Article  Google Scholar 

  31. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY (2008) Others: top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37

    Article  Google Scholar 

  32. Xu R, Wunsch D (2008) Clustering, vol 10. Wiley-IEEE Press, New York

  33. Yan M, Ye K (2007) Determining the number of clusters using the weighted gap statistic. Biometrics 63(4):1031–1037

    Article  MATH  MathSciNet  Google Scholar 

  34. Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: Cloud computing. Springer, Berlin, pp 674–679

Download references

Acknowledgments

The authors would like to thank Ali Gholami Rudi for his constructive comments and discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Hadian.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hadian, A., Shahrivari, S. High performance parallel \(k\)-means clustering for disk-resident datasets on multi-core CPUs. J Supercomput 69, 845–863 (2014). https://doi.org/10.1007/s11227-014-1185-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-014-1185-y

Keywords

Navigation