Skip to main content
Log in

A MapReduce-based K-means clustering algorithm

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The partitioning-based k-means clustering is one of the most important clustering algorithms. However, in big data environment, it faces the problems of random selection of initial cluster centers randomly, expensive communication overhead among MapReduce nodes and data skewing in data partitions, and others. To solve these problems, this paper proposes a parallel clustering algorithm based on grid density and local sensitive hash function (MR-PGDLSH) which takes into account the advantages of MapReduce and LSH (locality sensitive hash function). In the MR-PGDLSH, firstly the GDS (grid density strategy) is designed to obtain the relatively reasonable initial cluster centers. Then, a DP-LSH (data partition based on locality sensitive hash function) is proposed to divide the data set into multiple segments. The relevant data objects are mapped to the same sub-data set. The similarity function is designed to generate clusters, thereby reducing frequent communication overhead between nodes. Next, the AGS (adaptive grouping strategy) is applied to distribute the amount of data on each node evenly, which solves the problem of data skew on the node. Finally, the MR-PGDLSH is applied to mine the cluster centers in parallel, which obtains the final clustering results. Both theoretical analysis and experimental results have shown that the MR-PGDLSH is superior to the existing clustering algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Sagiroglu S, Sinanc D (2013) Big data: a review. In: 2013 International Conference on Collaboration Technologies And Systems (CTS), pp. 42–47

  2. Huda M, Maseleno A, Teh KSM, Don AG, Basiron B, Jasmi KA, Ahmad R (2018) Understanding modern learning environment (mle) in big data era. In: International Journal of Emerging Technologies in Learning (iJET) 13, pp. 71–85

  3. Hesse A, Glenna L, Hinrichs C, Chiles R, Sachs C (2019) Qualitative research ethics in the big data era. Am Behav Sci 63:560–583

    Article  Google Scholar 

  4. Yang LH, Wang YM, Su Q et al (2016) Multi-attribute search framework for optimizing extended belief rule-based systems. Inf Sci 370:159–183

    Article  Google Scholar 

  5. Wang J, Zhang XM, Lin Y et al (2018) Event-triggered dissipative control for networked stochastic systems under non-uniform sampling. Inf Sci 447:216–228

    Article  MATH  Google Scholar 

  6. Lu R, Zhu H, Liu X et al (2014) Toward efficient and privacy-preserving computing in big data era. IEEE Netw 28:46–50

    Article  Google Scholar 

  7. Danaher J, Hogan MJ, Noone C, Kennedy R, Behan A, De Paor A, Murphy MH (2017) Algorithmic governance: developing a research agenda through the power of collective intelligence. Big Data Soc 4:2053951717726554

    Article  Google Scholar 

  8. Beyer MA, Laney D (2012) The importance of ‘big data’: a definition. Stamford, CT: Gartner. 2014–2018

  9. Cui Y, Kara S, Chan KC (2020) Manufacturing big data ecosystem: a systematic literature review. Robot Comput-Integr Manuf 62:101861

    Article  Google Scholar 

  10. Huang Z, Yu Y, Gu J et al (2016) An efficient method for traffic sign recognition based on extreme learning machine. IEEE Trans Cybern 47:920–933

    Article  Google Scholar 

  11. Niu Y, Lin W, Ke X et al (2017) Fitting-based optimisation for image visual salient object detection. IET Comput Vision 11:161–172

    Article  Google Scholar 

  12. Liu G, Guo W et al (2015) A PSO-based-timing-driven octilinear steiner tree algorithm for VLSI routing considering bend reduction. Soft Comput 19:1153–1169

    Article  MATH  Google Scholar 

  13. Liu G, Guo W et al (2015) XGRouter: high-quality global router in X-architecture with particle swarm optimization. Front Comp Sci 9:576–594

    Article  Google Scholar 

  14. Liu G, Huang X, Guo W, Niu Y, Chen G (2015) Multilayer obstacle-avoiding X-architecture steiner minimal tree construction based on particle swarm optimization. IEEE Trans Cybern 45:1003–1016

    Article  Google Scholar 

  15. Yang D, Liao X, Shen H et al (2017) Relative influence maximization in competitive social networks. Sci China Inf Sci 60:108101

    Article  Google Scholar 

  16. Zhang S, Xia Y, Wang J (2015) A complex-valued projection neural network for constrained optimization of real functions in complex variables. IEEE Trans Neural Netw Learn Syst 26:3227–3238

    Article  MathSciNet  Google Scholar 

  17. Tu J, Xia Y, Zhang S (2017) A complex-valued multichannel speech enhancement learning algorithm for optimal tradeoff between noise reduction and speech distortion. Neurocomputing 267:333–343

    Article  Google Scholar 

  18. Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier

  19. Wu X, Zhu X, Wu GQ, Ding W (2013) Data mining with big data. IEEE Trans Knowl Data Eng 26:97–107

    Google Scholar 

  20. Yu Y, Sun Z (2017) Sparse coding extreme learning machine for classification. Neurocomputing 261:50–56

    Article  Google Scholar 

  21. Liu G, Chen Z, Zhuang Z, Guo W et al (2020) A unified algorithm based on HTS and self-adapting PSO for the construction of octagonal and rectilinear SMT. Soft Comput 24:3943–3961

    Article  Google Scholar 

  22. Luo F, Guo W et al (2017) A multi-label classification algorithm based on kernel extreme learning machine. Neurocomputing 260:313–320

    Article  Google Scholar 

  23. Ma T, Liu Q, Cao J, Tian Y, Al-Dhelaan A, Al-Rodhaan M (2020) LGIEM: global and local node influence based community detection. Futur Gener Comput Syst 105:533–546

    Article  Google Scholar 

  24. Ye Q, Li Z, Fu L, Zhang Z, Yang W, Yang G (2019) Nonpeaked discriminant analysis for data representation. IEEE Trans Neural Netw Learn Syst 30:3818–3832

    Article  MathSciNet  Google Scholar 

  25. Zhong S, Chen T, He F et al (2014) Fast gaussian kernel learning for classification tasks based on specially structured global optimization. Neural Netw 57:51–62

    Article  MATH  Google Scholar 

  26. Wei J, Liao X, Zheng H et al (2018) Learning from context: a mutual reinforcement model for Chinese microblog opinion retrieval. Front Comp Sci 12:714–724

    Article  Google Scholar 

  27. Cai J, Wei H, Yang H, Zhao X (2020) A novel clustering algorithm based on DPC and PSO. IEEE Access 8:88200–88214

    Article  Google Scholar 

  28. Shen J, Hao X, Liang Z, Liu Y, Wang W, Shao L (2016) Real-time superpixel segmentation by DBSCAN clustering algorithm. IEEE Trans Image Process 25:5933–5942

    Article  MathSciNet  MATH  Google Scholar 

  29. Kapil S, Chawla M, Ansari MD (2016) On K-means data clustering algorithm with genetic algorithm. In: 2016 Fourth International Conference on Parallel, Distributed and Grid Computing (PDGC), pp. 202–206

  30. Wang S, Guo W (2017) Robust co-clustering via dual local learning and high-order matrix factorization. Knowl-Based Syst 138:176–187

    Article  Google Scholar 

  31. Jinyin C, Xiang L, Haibing Z, Xintong B (2017) A novel cluster center fast determination clustering algorithm. Appl Soft Comput 57:539–555

    Article  Google Scholar 

  32. Ali HH, Kadhum LE (2017) K-means clustering algorithm applications in data mining and pattern recognition. Int J Sci Res 6:1577–1584

    Google Scholar 

  33. Cheng Y, Jiang H, Wang F et al (2018) Using high-bandwidth networks efficiently for fast graph computation. IEEE Trans Parallel Distrib Syst 30:1170–1183

    Article  Google Scholar 

  34. Xia Y, Leung H (2014) Performance analysis of statistical optimal data fusion algorithms. Inf Sci 277:808–824

    Article  MathSciNet  MATH  Google Scholar 

  35. Guo W, Chen G (2015) Human action recognition via multi-task learning base on spatial–temporal feature. Inf Sci 320:418–428

    Article  MathSciNet  Google Scholar 

  36. Berkhin P (2006) A survey of clustering data mining techniques. In: Grouping multidimensional data. Springer, Berlin, Heidelberg, pp. 25-71

  37. Arora P, Varshney S (2016) Analysis of k-means and k-medoids algorithm for big data. Procedia Comput Sci 78:507–512

    Article  Google Scholar 

  38. Kurasova O, Marcinkevicius V, Medvedev V, Rapecka A, Stefanovic P (2014) Strategies for big data clustering. In 2014 IEEE 26th International Conference on Tools with Artificial Intelligence, pp. 740–747

  39. Cheng H, Su Z, Xiong N et al (2016) Energy-efficient node scheduling algorithms for wireless sensor networks using Markov Random field model. Infor Sci 329:461–477

    Article  MATH  Google Scholar 

  40. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51:107–113

    Article  Google Scholar 

  41. Guo K, Guo W, Chen Y et al (2015) Community discovery by propagating local and global information based on the MapReduce model. Inf Sci 323:73–93

    Article  MathSciNet  Google Scholar 

  42. White T (2012) Hadoop: the definitive guide. " O'Reilly Media, Inc"

  43. Moon S, Lee J, Kee YS (2014) Introducing ssds to the hadoop mapreduce framework. In 2014 IEEE 7th International Conference on Cloud Computing, pp. 272–279

  44. Merla P, Liang Y (2017) Data analysis using hadoop MapReduce environment. In 2017 IEEE International Conference on Big Data (Big Data), pp. 4783–4785

  45. Jain M, Verma C (2014) Adapting k-means for clustering in big data. Int J Comput Appl 101:19–24

    Google Scholar 

  46. Yin A, Wu Y, Zhu M et al (2018) Improved K-means algorithm based on MapReduce framework. Appl Res Comput 322:61–64

    Google Scholar 

  47. Li Y, Sun, Q, Chao, Y, et al (2016) Highly efficient parallel algorithm of K-medoids in cloud environment. Compu Meas Control 14

  48. Zhou H, Liu G, Zhang B (2018) Load balancing strategy of MapReduce clustering based on index shift. Comput Sci 45:310–316

    Google Scholar 

  49. Datar M, Immorlica N, Indyk P, Mirrokni V.S (2004) Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the Twentieth Annual Symposium on Computational geometry, pp. 253–262

  50. Brown D, Japa A, Shi Y (2019) An attempt at improving density-based clustering algorithms. In Proceedings of the 2019 ACM Southeast Conference, pp. 172–175

  51. Vogt F (2015) A self-guided search for good local minima of the sum-of-squared-error in nonlinear least squares regression. J Chemom 29:71–79

    Article  Google Scholar 

  52. Gao T, Cheng B, Chen J, Chen M (2017) Enhancing collaborative filtering via topic model integrated uniform euclidean distance. China Commun 14:48–58

    Article  Google Scholar 

  53. Chen J, Ching R, Lin Y (2004) An extended study of the K-means algorithm for data clustering and its applications. J Oper Res Soc 55:976–987

    Article  MATH  Google Scholar 

  54. https://archive.ics.uci.edu/ml/index.php

Download references

Funding

This study was supported by the National Natural Science Foundation of China (41562019, 61762046) and the National Key Research and Development Program of China (2018YFC1504705).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to XueYu Huang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mao, Y., Gan, D., Mwakapesa, D.S. et al. A MapReduce-based K-means clustering algorithm. J Supercomput 78, 5181–5202 (2022). https://doi.org/10.1007/s11227-021-04078-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-04078-8

Keywords

Navigation