A MapReduce-based K-means clustering algorithm

Mao, YiMin; Gan, DeJin; Mwakapesa, D. S.; Nanehkaran, Y. A.; Tao, Tao; Huang, XueYu

doi:10.1007/s11227-021-04078-8

A MapReduce-based K-means clustering algorithm

Published: 20 September 2021

Volume 78, pages 5181–5202, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

YiMin Mao¹,
DeJin Gan¹,
D. S. Mwakapesa¹,
Y. A. Nanehkaran¹,
Tao Tao¹ &
…
XueYu Huang¹

746 Accesses
10 Citations
Explore all metrics

Abstract

The partitioning-based k-means clustering is one of the most important clustering algorithms. However, in big data environment, it faces the problems of random selection of initial cluster centers randomly, expensive communication overhead among MapReduce nodes and data skewing in data partitions, and others. To solve these problems, this paper proposes a parallel clustering algorithm based on grid density and local sensitive hash function (MR-PGDLSH) which takes into account the advantages of MapReduce and LSH (locality sensitive hash function). In the MR-PGDLSH, firstly the GDS (grid density strategy) is designed to obtain the relatively reasonable initial cluster centers. Then, a DP-LSH (data partition based on locality sensitive hash function) is proposed to divide the data set into multiple segments. The relevant data objects are mapped to the same sub-data set. The similarity function is designed to generate clusters, thereby reducing frequent communication overhead between nodes. Next, the AGS (adaptive grouping strategy) is applied to distribute the amount of data on each node evenly, which solves the problem of data skew on the node. Finally, the MR-PGDLSH is applied to mine the cluster centers in parallel, which obtains the final clustering results. Both theoretical analysis and experimental results have shown that the MR-PGDLSH is superior to the existing clustering algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets

Article 11 November 2017

Efficient data distribution and results merging for parallel data clustering in mapreduce environment

Article 25 November 2017

A Novel MapReduce Based k-Means Clustering

References

Sagiroglu S, Sinanc D (2013) Big data: a review. In: 2013 International Conference on Collaboration Technologies And Systems (CTS), pp. 42–47
Huda M, Maseleno A, Teh KSM, Don AG, Basiron B, Jasmi KA, Ahmad R (2018) Understanding modern learning environment (mle) in big data era. In: International Journal of Emerging Technologies in Learning (iJET) 13, pp. 71–85
Hesse A, Glenna L, Hinrichs C, Chiles R, Sachs C (2019) Qualitative research ethics in the big data era. Am Behav Sci 63:560–583
Article Google Scholar
Yang LH, Wang YM, Su Q et al (2016) Multi-attribute search framework for optimizing extended belief rule-based systems. Inf Sci 370:159–183
Article Google Scholar
Wang J, Zhang XM, Lin Y et al (2018) Event-triggered dissipative control for networked stochastic systems under non-uniform sampling. Inf Sci 447:216–228
Article MATH Google Scholar
Lu R, Zhu H, Liu X et al (2014) Toward efficient and privacy-preserving computing in big data era. IEEE Netw 28:46–50
Article Google Scholar
Danaher J, Hogan MJ, Noone C, Kennedy R, Behan A, De Paor A, Murphy MH (2017) Algorithmic governance: developing a research agenda through the power of collective intelligence. Big Data Soc 4:2053951717726554
Article Google Scholar
Beyer MA, Laney D (2012) The importance of ‘big data’: a definition. Stamford, CT: Gartner. 2014–2018
Cui Y, Kara S, Chan KC (2020) Manufacturing big data ecosystem: a systematic literature review. Robot Comput-Integr Manuf 62:101861
Article Google Scholar
Huang Z, Yu Y, Gu J et al (2016) An efficient method for traffic sign recognition based on extreme learning machine. IEEE Trans Cybern 47:920–933
Article Google Scholar
Niu Y, Lin W, Ke X et al (2017) Fitting-based optimisation for image visual salient object detection. IET Comput Vision 11:161–172
Article Google Scholar
Liu G, Guo W et al (2015) A PSO-based-timing-driven octilinear steiner tree algorithm for VLSI routing considering bend reduction. Soft Comput 19:1153–1169
Article MATH Google Scholar
Liu G, Guo W et al (2015) XGRouter: high-quality global router in X-architecture with particle swarm optimization. Front Comp Sci 9:576–594
Article Google Scholar
Liu G, Huang X, Guo W, Niu Y, Chen G (2015) Multilayer obstacle-avoiding X-architecture steiner minimal tree construction based on particle swarm optimization. IEEE Trans Cybern 45:1003–1016
Article Google Scholar
Yang D, Liao X, Shen H et al (2017) Relative influence maximization in competitive social networks. Sci China Inf Sci 60:108101
Article Google Scholar
Zhang S, Xia Y, Wang J (2015) A complex-valued projection neural network for constrained optimization of real functions in complex variables. IEEE Trans Neural Netw Learn Syst 26:3227–3238
Article MathSciNet Google Scholar
Tu J, Xia Y, Zhang S (2017) A complex-valued multichannel speech enhancement learning algorithm for optimal tradeoff between noise reduction and speech distortion. Neurocomputing 267:333–343
Article Google Scholar
Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier
Wu X, Zhu X, Wu GQ, Ding W (2013) Data mining with big data. IEEE Trans Knowl Data Eng 26:97–107
Google Scholar
Yu Y, Sun Z (2017) Sparse coding extreme learning machine for classification. Neurocomputing 261:50–56
Article Google Scholar
Liu G, Chen Z, Zhuang Z, Guo W et al (2020) A unified algorithm based on HTS and self-adapting PSO for the construction of octagonal and rectilinear SMT. Soft Comput 24:3943–3961
Article Google Scholar
Luo F, Guo W et al (2017) A multi-label classification algorithm based on kernel extreme learning machine. Neurocomputing 260:313–320
Article Google Scholar
Ma T, Liu Q, Cao J, Tian Y, Al-Dhelaan A, Al-Rodhaan M (2020) LGIEM: global and local node influence based community detection. Futur Gener Comput Syst 105:533–546
Article Google Scholar
Ye Q, Li Z, Fu L, Zhang Z, Yang W, Yang G (2019) Nonpeaked discriminant analysis for data representation. IEEE Trans Neural Netw Learn Syst 30:3818–3832
Article MathSciNet Google Scholar
Zhong S, Chen T, He F et al (2014) Fast gaussian kernel learning for classification tasks based on specially structured global optimization. Neural Netw 57:51–62
Article MATH Google Scholar
Wei J, Liao X, Zheng H et al (2018) Learning from context: a mutual reinforcement model for Chinese microblog opinion retrieval. Front Comp Sci 12:714–724
Article Google Scholar
Cai J, Wei H, Yang H, Zhao X (2020) A novel clustering algorithm based on DPC and PSO. IEEE Access 8:88200–88214
Article Google Scholar
Shen J, Hao X, Liang Z, Liu Y, Wang W, Shao L (2016) Real-time superpixel segmentation by DBSCAN clustering algorithm. IEEE Trans Image Process 25:5933–5942
Article MathSciNet MATH Google Scholar
Kapil S, Chawla M, Ansari MD (2016) On K-means data clustering algorithm with genetic algorithm. In: 2016 Fourth International Conference on Parallel, Distributed and Grid Computing (PDGC), pp. 202–206
Wang S, Guo W (2017) Robust co-clustering via dual local learning and high-order matrix factorization. Knowl-Based Syst 138:176–187
Article Google Scholar
Jinyin C, Xiang L, Haibing Z, Xintong B (2017) A novel cluster center fast determination clustering algorithm. Appl Soft Comput 57:539–555
Article Google Scholar
Ali HH, Kadhum LE (2017) K-means clustering algorithm applications in data mining and pattern recognition. Int J Sci Res 6:1577–1584
Google Scholar
Cheng Y, Jiang H, Wang F et al (2018) Using high-bandwidth networks efficiently for fast graph computation. IEEE Trans Parallel Distrib Syst 30:1170–1183
Article Google Scholar
Xia Y, Leung H (2014) Performance analysis of statistical optimal data fusion algorithms. Inf Sci 277:808–824
Article MathSciNet MATH Google Scholar
Guo W, Chen G (2015) Human action recognition via multi-task learning base on spatial–temporal feature. Inf Sci 320:418–428
Article MathSciNet Google Scholar
Berkhin P (2006) A survey of clustering data mining techniques. In: Grouping multidimensional data. Springer, Berlin, Heidelberg, pp. 25-71
Arora P, Varshney S (2016) Analysis of k-means and k-medoids algorithm for big data. Procedia Comput Sci 78:507–512
Article Google Scholar
Kurasova O, Marcinkevicius V, Medvedev V, Rapecka A, Stefanovic P (2014) Strategies for big data clustering. In 2014 IEEE 26th International Conference on Tools with Artificial Intelligence, pp. 740–747
Cheng H, Su Z, Xiong N et al (2016) Energy-efficient node scheduling algorithms for wireless sensor networks using Markov Random field model. Infor Sci 329:461–477
Article MATH Google Scholar
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51:107–113
Article Google Scholar
Guo K, Guo W, Chen Y et al (2015) Community discovery by propagating local and global information based on the MapReduce model. Inf Sci 323:73–93
Article MathSciNet Google Scholar
White T (2012) Hadoop: the definitive guide. " O'Reilly Media, Inc"
Moon S, Lee J, Kee YS (2014) Introducing ssds to the hadoop mapreduce framework. In 2014 IEEE 7th International Conference on Cloud Computing, pp. 272–279
Merla P, Liang Y (2017) Data analysis using hadoop MapReduce environment. In 2017 IEEE International Conference on Big Data (Big Data), pp. 4783–4785
Jain M, Verma C (2014) Adapting k-means for clustering in big data. Int J Comput Appl 101:19–24
Google Scholar
Yin A, Wu Y, Zhu M et al (2018) Improved K-means algorithm based on MapReduce framework. Appl Res Comput 322:61–64
Google Scholar
Li Y, Sun, Q, Chao, Y, et al (2016) Highly efficient parallel algorithm of K-medoids in cloud environment. Compu Meas Control 14
Zhou H, Liu G, Zhang B (2018) Load balancing strategy of MapReduce clustering based on index shift. Comput Sci 45:310–316
Google Scholar
Datar M, Immorlica N, Indyk P, Mirrokni V.S (2004) Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the Twentieth Annual Symposium on Computational geometry, pp. 253–262
Brown D, Japa A, Shi Y (2019) An attempt at improving density-based clustering algorithms. In Proceedings of the 2019 ACM Southeast Conference, pp. 172–175
Vogt F (2015) A self-guided search for good local minima of the sum-of-squared-error in nonlinear least squares regression. J Chemom 29:71–79
Article Google Scholar
Gao T, Cheng B, Chen J, Chen M (2017) Enhancing collaborative filtering via topic model integrated uniform euclidean distance. China Commun 14:48–58
Article Google Scholar
Chen J, Ching R, Lin Y (2004) An extended study of the K-means algorithm for data clustering and its applications. J Oper Res Soc 55:976–987
Article MATH Google Scholar
https://archive.ics.uci.edu/ml/index.php

Download references

Funding

This study was supported by the National Natural Science Foundation of China (41562019, 61762046) and the National Key Research and Development Program of China (2018YFC1504705).

Author information

Authors and Affiliations

School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou, 341000, Jiangxi, China
YiMin Mao, DeJin Gan, D. S. Mwakapesa, Y. A. Nanehkaran, Tao Tao & XueYu Huang

Authors

YiMin Mao
View author publications
You can also search for this author inPubMed Google Scholar
DeJin Gan
View author publications
You can also search for this author inPubMed Google Scholar
D. S. Mwakapesa
View author publications
You can also search for this author inPubMed Google Scholar
Y. A. Nanehkaran
View author publications
You can also search for this author inPubMed Google Scholar
Tao Tao
View author publications
You can also search for this author inPubMed Google Scholar
XueYu Huang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to XueYu Huang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mao, Y., Gan, D., Mwakapesa, D.S. et al. A MapReduce-based K-means clustering algorithm. J Supercomput 78, 5181–5202 (2022). https://doi.org/10.1007/s11227-021-04078-8

Download citation

Accepted: 07 September 2021
Published: 20 September 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s11227-021-04078-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A MapReduce-based K-means clustering algorithm

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets

Efficient data distribution and results merging for parallel data clustering in mapreduce environment

A Novel MapReduce Based k-Means Clustering

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now