Efficient data distribution and results merging for parallel data clustering in mapreduce environment

Bousbaci, Abdelhak; Kamel, Nadjet

doi:10.1007/s10489-017-1089-7

Efficient data distribution and results merging for parallel data clustering in mapreduce environment

Published: 25 November 2017

Volume 48, pages 2408–2428, (2018)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

489 Accesses
4 Citations
Explore all metrics

Abstract

Clustering data consists in partitioning it into clusters such that there is a strong similarity between data in the same cluster and a weak similarity between data in different clusters. With the significant increase in data volume, the clustering process becomes an expensive task in terms of computation. Therefore, several solutions have been proposed to overcome this issue using parallelism with the MapReduce paradigm. The proposed solutions in the literature aim to optimize the execution time while keeping the clustering quality close or identical to the sequential execution. One of the commonly used parallel clustering strategies when using the MapReduce framework consists in partitioning data and processing each partition separately. The results obtained from each partition are merged to obtain the final clusters configuration. Using a random data distribution strategy and an inappropriate merging technique will lead to an inaccurate final centroids and a rather average clustering quality. Hence, in this paper we propose a parallel scheme for partitional clustering algorithms based on MapReduce with a non-conventional data distribution and results merging strategies to improve the clustering quality. With this solution, in addition to optimizing the execution time, we exploit the parallel environment to enhance the clustering quality. The experimental results demonstrate the effectiveness and scalability of our solution in comparison with other recently proposed works. We also proposed an application of our approach to the community detection problem. The results demonstrate the ability of our approach to provide effective and relevant results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Efficient Results Merging for Parallel Data Clustering Using MapReduce

Optimisation Techniques for Parallel K-Means on MapReduce

A survey on parallel clustering algorithms for Big Data

Article 06 October 2020

Zineb Dafir, Yasmine Lamari & Said Chah Slaoui

References

(2016) Knowledge and Data Engineering Group, University of Kassel: Benchmark folksonomy data from bibsonomy version of January 01st. http://bibsonomy.org/
Banharnsakun A (2017) A mapreduce-based artificial bee colony for large-scale data clustering. Pattern Recogn Lett 93:78–84
Article Google Scholar
Benz D, Hotho A, Jäschke R, Krause B, Mitzlaff F, Schmitz C, Stumme G (2010) The social bookmark and publication management system bibsonomy. The VLDB Journal—The International Journal on Very Large Data Bases 19(6):849–875
Article Google Scholar
Bousbaci A, Kamel N (2014) A parallel sampling-pso-multi-core-k-means algorithm using mapreduce. In: 14th international conference on hybrid intelligent systems (HIS), 2014. IEEE, pp 129–134
Bousbaci A, Kamel N (2016) Efficient results merging for parallel data clustering using mapreduce. In: 13th international conference distributed computing and artificial intelligence. Springer, pp 349–357
Chaimontree S, Atkinson K, Coenen F (2011) A multi-agent based approach to clustering: harnessing the power of agents. In: International workshop on agents and data mining interaction. Springer, pp 16–29
Cui X, Potok TE (2005) Document clustering analysis based on hybrid pso+ k-means algorithm. J Comput Sci (special issue) 27:33
Google Scholar
Cui X, Charles JS, Potok T (2013) Gpu enhanced parallel computing for large scale data clustering. Futur Gener Comput Syst 29(7):1736–1741
Article Google Scholar
Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data k-means clustering using mapreduce. J Supercomput 70(3):1249–1259
Article Google Scholar
Davidson I, Satyanarayana A (2003) Speeding up k-means clustering by bootstrap averaging. In: IEEE data mining workshop on clustering large data sets
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51 (1):107–113
Article Google Scholar
Ene A, Im S, Moseley B (2011) Fast clustering using mapreduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 681–689
Ester M, Kriegel HP, Sander J, Xu X et al. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd, vol 96, pp 226–231
Ferreira Cordeiro RL, Traina Junior C, Machado Traina AJ, López J, Kang U, Faloutsos C (2011) Clustering very large multi-dimensional datasets with mapreduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 690–698
Fränti P (2015) Clustering datasets. http://cs.uef.fi/sipu/datasets/
Goil S, Nagesh H, Choudhary A (1999) Mafia: efficient and scalable subspace clustering for very large data sets. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 443–452
Guerrieri A, Montresor A (2012) Ds-means: distributed data stream clustering. In: European conference on parallel processing. Springer, pp 260–271
Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: ACM SIGMOD Record, ACM, vol 27, pp 73–84
Hammouda KM, Kamel MS (2014) Models of distributed data clustering in peer-to-peer environments. Knowl Inf Syst 38(2):303–329
Article Google Scholar
Han D, Giraud-Carrier C, Li S (2015) Efficient mining of high-speed uncertain data streams. Appl Intell 43(4):773–785
Article Google Scholar
Kamel N, Ouchen I, Baali K (2014) A sampling-pso-k-means algorithm for document clustering. In: Genetic and evolutionary computing. Springer, pp 45–54
Kaufman L, Rousseeuw PJ (2009) Finding groups in data: an introduction to cluster analysis, vol 344. Wiley
Kerdprasop K, Kerdprasop N (2010) A lightweight method to parallel k-means clustering. International Journal of Mathematics and Computers in Simulation 4(4):144–153
Google Scholar
Kraus JM, Kestler HA (2010) A highly efficient multi-core algorithm for clustering extremely large datasets. BMC Bioinforma 11(1):1
Article Google Scholar
Kriegel HP, Kröger P, Sander J, Zimek A (2011) Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1(3):231–240
Google Scholar
Kwedlo W, Iwanowicz P (2010) Using genetic algorithm for selection of initial cluster centers for the k-means method. In: International conference on artificial intelligence and soft computing. Springer, pp 165–172
Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml
Lin KW, Lin CH, Hsiao CY (2014) A parallel and scalable cast-based clustering algorithm on gpu. Soft Comput 18(3):539– 547
Article Google Scholar
Ludwig SA (2015) Mapreduce-based fuzzy c-means clustering algorithm: implementation and scalability. Int J Mach Learn Cybern 6(6):923–934
Article Google Scholar
MacQueen J et al. (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth berkeley symposium on mathematical statistics and probability, Oakland, CA, USA, vol 1, pp 281–297
Maulik U, Bandyopadhyay S (2000) Genetic algorithm-based clustering technique. Pattern Recogn 33 (9):1455–1465
Article Google Scholar
More P, Hall LO (2004) Scalable clustering: a distributed approach. In: IEEE international conference on fuzzy systems, 2004. Proceedings. 2004, IEEE, vol 1, pp 143–148
Rokach L, Maimon O (2005) Clustering methods. In: Data mining and knowledge discovery handbook. Springer, pp 321–352
Saoud Z, Platoš J (2017) Community detection in bibsonomy using data clustering. In: International conference on information systems architecture and technology. Springer, pp 149–158
Sheikholeslami G, Chatterjee S, Zhang A (1998) Wavecluster: a multi-resolution clustering approach for very large spatial databases. In: VLDB, vol 98, pp 428–439
Shirkhorshidi AS, Aghabozorgi S, Wah TY, Herawan T (2014) Big data clustering: a review. In: International conference on computational science and its applications. Springer, pp 707–720
Sinha A, Jana PK (2016) A novel k-means based clustering algorithm for big data. In: International conference on advances in computing, communications and informatics (ICACCI), 2016. IEEE, pp 1875–1879
Stoffel K, Belkoniene A (1999) Parallel k/h-means clustering for large data sets. In: European conference on parallel processing. Springer, pp 1451–1454
Sun Z (2013) A parallel clustering method study based on mapreduce. In: 1st international workshop on cloud computing and information security. Atlantis Press
Timón I, Soto J, Pérez-Sánchez H, Cecilia JM (2016) Parallel implementation of fuzzy minimals clustering algorithm. Expert Syst Appl 48:35–41
Article Google Scholar
Wang J, Yuan D, Jiang M (2012) Parallel k-pso based on mapreduce. In: IEEE 14th international conference on communication technology (ICCT), 2012. IEEE, pp 1203–1208
Xu S, Zhang J (2004) A parallel hybrid web document clustering algo- rithm and its performance study. J Supercomput 30(2):117–131
Article Google Scholar
Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: IEEE international conference on cloud computing. Springer, pp 674–679

Download references

Author information

Authors and Affiliations

LRIA, Computer Science Department, USTHB, BP 32 El Alia 16111 Bab Ezzouar, Algiers, Algeria
Abdelhak Bousbaci
Computer Science Department, Faculty of Sciences, UFAS, Ferhat Abbas Setif University 1, Campus El Bez, Setif, 1900, Algeria
Nadjet Kamel

Authors

Abdelhak Bousbaci
View author publications
You can also search for this author in PubMed Google Scholar
Nadjet Kamel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdelhak Bousbaci.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bousbaci, A., Kamel, N. Efficient data distribution and results merging for parallel data clustering in mapreduce environment. Appl Intell 48, 2408–2428 (2018). https://doi.org/10.1007/s10489-017-1089-7

Download citation

Published: 25 November 2017
Issue Date: August 2018
DOI: https://doi.org/10.1007/s10489-017-1089-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Efficient data distribution and results merging for parallel data clustering in mapreduce environment

Abstract

Access this article

Similar content being viewed by others

Efficient Results Merging for Parallel Data Clustering Using MapReduce

Optimisation Techniques for Parallel K-Means on MapReduce

A survey on parallel clustering algorithms for Big Data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient data distribution and results merging for parallel data clustering in mapreduce environment

Abstract

Access this article

Similar content being viewed by others

Efficient Results Merging for Parallel Data Clustering Using MapReduce

Optimisation Techniques for Parallel K-Means on MapReduce

A survey on parallel clustering algorithms for Big Data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation