Abstract
The partitioning-based k-means clustering is one of the most important clustering algorithms. However, in big data environment, it faces the problems of random selection of initial cluster centers randomly, expensive communication overhead among MapReduce nodes and data skewing in data partitions, and others. To solve these problems, this paper proposes a parallel clustering algorithm based on grid density and local sensitive hash function (MR-PGDLSH) which takes into account the advantages of MapReduce and LSH (locality sensitive hash function). In the MR-PGDLSH, firstly the GDS (grid density strategy) is designed to obtain the relatively reasonable initial cluster centers. Then, a DP-LSH (data partition based on locality sensitive hash function) is proposed to divide the data set into multiple segments. The relevant data objects are mapped to the same sub-data set. The similarity function is designed to generate clusters, thereby reducing frequent communication overhead between nodes. Next, the AGS (adaptive grouping strategy) is applied to distribute the amount of data on each node evenly, which solves the problem of data skew on the node. Finally, the MR-PGDLSH is applied to mine the cluster centers in parallel, which obtains the final clustering results. Both theoretical analysis and experimental results have shown that the MR-PGDLSH is superior to the existing clustering algorithms.






Similar content being viewed by others
References
Sagiroglu S, Sinanc D (2013) Big data: a review. In: 2013 International Conference on Collaboration Technologies And Systems (CTS), pp. 42–47
Huda M, Maseleno A, Teh KSM, Don AG, Basiron B, Jasmi KA, Ahmad R (2018) Understanding modern learning environment (mle) in big data era. In: International Journal of Emerging Technologies in Learning (iJET) 13, pp. 71–85
Hesse A, Glenna L, Hinrichs C, Chiles R, Sachs C (2019) Qualitative research ethics in the big data era. Am Behav Sci 63:560–583
Yang LH, Wang YM, Su Q et al (2016) Multi-attribute search framework for optimizing extended belief rule-based systems. Inf Sci 370:159–183
Wang J, Zhang XM, Lin Y et al (2018) Event-triggered dissipative control for networked stochastic systems under non-uniform sampling. Inf Sci 447:216–228
Lu R, Zhu H, Liu X et al (2014) Toward efficient and privacy-preserving computing in big data era. IEEE Netw 28:46–50
Danaher J, Hogan MJ, Noone C, Kennedy R, Behan A, De Paor A, Murphy MH (2017) Algorithmic governance: developing a research agenda through the power of collective intelligence. Big Data Soc 4:2053951717726554
Beyer MA, Laney D (2012) The importance of ‘big data’: a definition. Stamford, CT: Gartner. 2014–2018
Cui Y, Kara S, Chan KC (2020) Manufacturing big data ecosystem: a systematic literature review. Robot Comput-Integr Manuf 62:101861
Huang Z, Yu Y, Gu J et al (2016) An efficient method for traffic sign recognition based on extreme learning machine. IEEE Trans Cybern 47:920–933
Niu Y, Lin W, Ke X et al (2017) Fitting-based optimisation for image visual salient object detection. IET Comput Vision 11:161–172
Liu G, Guo W et al (2015) A PSO-based-timing-driven octilinear steiner tree algorithm for VLSI routing considering bend reduction. Soft Comput 19:1153–1169
Liu G, Guo W et al (2015) XGRouter: high-quality global router in X-architecture with particle swarm optimization. Front Comp Sci 9:576–594
Liu G, Huang X, Guo W, Niu Y, Chen G (2015) Multilayer obstacle-avoiding X-architecture steiner minimal tree construction based on particle swarm optimization. IEEE Trans Cybern 45:1003–1016
Yang D, Liao X, Shen H et al (2017) Relative influence maximization in competitive social networks. Sci China Inf Sci 60:108101
Zhang S, Xia Y, Wang J (2015) A complex-valued projection neural network for constrained optimization of real functions in complex variables. IEEE Trans Neural Netw Learn Syst 26:3227–3238
Tu J, Xia Y, Zhang S (2017) A complex-valued multichannel speech enhancement learning algorithm for optimal tradeoff between noise reduction and speech distortion. Neurocomputing 267:333–343
Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier
Wu X, Zhu X, Wu GQ, Ding W (2013) Data mining with big data. IEEE Trans Knowl Data Eng 26:97–107
Yu Y, Sun Z (2017) Sparse coding extreme learning machine for classification. Neurocomputing 261:50–56
Liu G, Chen Z, Zhuang Z, Guo W et al (2020) A unified algorithm based on HTS and self-adapting PSO for the construction of octagonal and rectilinear SMT. Soft Comput 24:3943–3961
Luo F, Guo W et al (2017) A multi-label classification algorithm based on kernel extreme learning machine. Neurocomputing 260:313–320
Ma T, Liu Q, Cao J, Tian Y, Al-Dhelaan A, Al-Rodhaan M (2020) LGIEM: global and local node influence based community detection. Futur Gener Comput Syst 105:533–546
Ye Q, Li Z, Fu L, Zhang Z, Yang W, Yang G (2019) Nonpeaked discriminant analysis for data representation. IEEE Trans Neural Netw Learn Syst 30:3818–3832
Zhong S, Chen T, He F et al (2014) Fast gaussian kernel learning for classification tasks based on specially structured global optimization. Neural Netw 57:51–62
Wei J, Liao X, Zheng H et al (2018) Learning from context: a mutual reinforcement model for Chinese microblog opinion retrieval. Front Comp Sci 12:714–724
Cai J, Wei H, Yang H, Zhao X (2020) A novel clustering algorithm based on DPC and PSO. IEEE Access 8:88200–88214
Shen J, Hao X, Liang Z, Liu Y, Wang W, Shao L (2016) Real-time superpixel segmentation by DBSCAN clustering algorithm. IEEE Trans Image Process 25:5933–5942
Kapil S, Chawla M, Ansari MD (2016) On K-means data clustering algorithm with genetic algorithm. In: 2016 Fourth International Conference on Parallel, Distributed and Grid Computing (PDGC), pp. 202–206
Wang S, Guo W (2017) Robust co-clustering via dual local learning and high-order matrix factorization. Knowl-Based Syst 138:176–187
Jinyin C, Xiang L, Haibing Z, Xintong B (2017) A novel cluster center fast determination clustering algorithm. Appl Soft Comput 57:539–555
Ali HH, Kadhum LE (2017) K-means clustering algorithm applications in data mining and pattern recognition. Int J Sci Res 6:1577–1584
Cheng Y, Jiang H, Wang F et al (2018) Using high-bandwidth networks efficiently for fast graph computation. IEEE Trans Parallel Distrib Syst 30:1170–1183
Xia Y, Leung H (2014) Performance analysis of statistical optimal data fusion algorithms. Inf Sci 277:808–824
Guo W, Chen G (2015) Human action recognition via multi-task learning base on spatial–temporal feature. Inf Sci 320:418–428
Berkhin P (2006) A survey of clustering data mining techniques. In: Grouping multidimensional data. Springer, Berlin, Heidelberg, pp. 25-71
Arora P, Varshney S (2016) Analysis of k-means and k-medoids algorithm for big data. Procedia Comput Sci 78:507–512
Kurasova O, Marcinkevicius V, Medvedev V, Rapecka A, Stefanovic P (2014) Strategies for big data clustering. In 2014 IEEE 26th International Conference on Tools with Artificial Intelligence, pp. 740–747
Cheng H, Su Z, Xiong N et al (2016) Energy-efficient node scheduling algorithms for wireless sensor networks using Markov Random field model. Infor Sci 329:461–477
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51:107–113
Guo K, Guo W, Chen Y et al (2015) Community discovery by propagating local and global information based on the MapReduce model. Inf Sci 323:73–93
White T (2012) Hadoop: the definitive guide. " O'Reilly Media, Inc"
Moon S, Lee J, Kee YS (2014) Introducing ssds to the hadoop mapreduce framework. In 2014 IEEE 7th International Conference on Cloud Computing, pp. 272–279
Merla P, Liang Y (2017) Data analysis using hadoop MapReduce environment. In 2017 IEEE International Conference on Big Data (Big Data), pp. 4783–4785
Jain M, Verma C (2014) Adapting k-means for clustering in big data. Int J Comput Appl 101:19–24
Yin A, Wu Y, Zhu M et al (2018) Improved K-means algorithm based on MapReduce framework. Appl Res Comput 322:61–64
Li Y, Sun, Q, Chao, Y, et al (2016) Highly efficient parallel algorithm of K-medoids in cloud environment. Compu Meas Control 14
Zhou H, Liu G, Zhang B (2018) Load balancing strategy of MapReduce clustering based on index shift. Comput Sci 45:310–316
Datar M, Immorlica N, Indyk P, Mirrokni V.S (2004) Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the Twentieth Annual Symposium on Computational geometry, pp. 253–262
Brown D, Japa A, Shi Y (2019) An attempt at improving density-based clustering algorithms. In Proceedings of the 2019 ACM Southeast Conference, pp. 172–175
Vogt F (2015) A self-guided search for good local minima of the sum-of-squared-error in nonlinear least squares regression. J Chemom 29:71–79
Gao T, Cheng B, Chen J, Chen M (2017) Enhancing collaborative filtering via topic model integrated uniform euclidean distance. China Commun 14:48–58
Chen J, Ching R, Lin Y (2004) An extended study of the K-means algorithm for data clustering and its applications. J Oper Res Soc 55:976–987
Funding
This study was supported by the National Natural Science Foundation of China (41562019, 61762046) and the National Key Research and Development Program of China (2018YFC1504705).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mao, Y., Gan, D., Mwakapesa, D.S. et al. A MapReduce-based K-means clustering algorithm. J Supercomput 78, 5181–5202 (2022). https://doi.org/10.1007/s11227-021-04078-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-04078-8