Abstract
Clustering data consists in partitioning it into clusters such that there is a strong similarity between data in the same cluster and a weak similarity between data in different clusters. With the significant increase in data volume, the clustering process becomes an expensive task in terms of computation. Therefore, several solutions have been proposed to overcome this issue using parallelism with the MapReduce paradigm. The proposed solutions in the literature aim to optimize the execution time while keeping the clustering quality close or identical to the sequential execution. One of the commonly used parallel clustering strategies when using the MapReduce framework consists in partitioning data and processing each partition separately. The results obtained from each partition are merged to obtain the final clusters configuration. Using a random data distribution strategy and an inappropriate merging technique will lead to an inaccurate final centroids and a rather average clustering quality. Hence, in this paper we propose a parallel scheme for partitional clustering algorithms based on MapReduce with a non-conventional data distribution and results merging strategies to improve the clustering quality. With this solution, in addition to optimizing the execution time, we exploit the parallel environment to enhance the clustering quality. The experimental results demonstrate the effectiveness and scalability of our solution in comparison with other recently proposed works. We also proposed an application of our approach to the community detection problem. The results demonstrate the ability of our approach to provide effective and relevant results.
Similar content being viewed by others
References
(2016) Knowledge and Data Engineering Group, University of Kassel: Benchmark folksonomy data from bibsonomy version of January 01st. http://bibsonomy.org/
Banharnsakun A (2017) A mapreduce-based artificial bee colony for large-scale data clustering. Pattern Recogn Lett 93:78–84
Benz D, Hotho A, Jäschke R, Krause B, Mitzlaff F, Schmitz C, Stumme G (2010) The social bookmark and publication management system bibsonomy. The VLDB Journal—The International Journal on Very Large Data Bases 19(6):849–875
Bousbaci A, Kamel N (2014) A parallel sampling-pso-multi-core-k-means algorithm using mapreduce. In: 14th international conference on hybrid intelligent systems (HIS), 2014. IEEE, pp 129–134
Bousbaci A, Kamel N (2016) Efficient results merging for parallel data clustering using mapreduce. In: 13th international conference distributed computing and artificial intelligence. Springer, pp 349–357
Chaimontree S, Atkinson K, Coenen F (2011) A multi-agent based approach to clustering: harnessing the power of agents. In: International workshop on agents and data mining interaction. Springer, pp 16–29
Cui X, Potok TE (2005) Document clustering analysis based on hybrid pso+ k-means algorithm. J Comput Sci (special issue) 27:33
Cui X, Charles JS, Potok T (2013) Gpu enhanced parallel computing for large scale data clustering. Futur Gener Comput Syst 29(7):1736–1741
Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data k-means clustering using mapreduce. J Supercomput 70(3):1249–1259
Davidson I, Satyanarayana A (2003) Speeding up k-means clustering by bootstrap averaging. In: IEEE data mining workshop on clustering large data sets
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51 (1):107–113
Ene A, Im S, Moseley B (2011) Fast clustering using mapreduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 681–689
Ester M, Kriegel HP, Sander J, Xu X et al. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd, vol 96, pp 226–231
Ferreira Cordeiro RL, Traina Junior C, Machado Traina AJ, López J, Kang U, Faloutsos C (2011) Clustering very large multi-dimensional datasets with mapreduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 690–698
Fränti P (2015) Clustering datasets. http://cs.uef.fi/sipu/datasets/
Goil S, Nagesh H, Choudhary A (1999) Mafia: efficient and scalable subspace clustering for very large data sets. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 443–452
Guerrieri A, Montresor A (2012) Ds-means: distributed data stream clustering. In: European conference on parallel processing. Springer, pp 260–271
Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: ACM SIGMOD Record, ACM, vol 27, pp 73–84
Hammouda KM, Kamel MS (2014) Models of distributed data clustering in peer-to-peer environments. Knowl Inf Syst 38(2):303–329
Han D, Giraud-Carrier C, Li S (2015) Efficient mining of high-speed uncertain data streams. Appl Intell 43(4):773–785
Kamel N, Ouchen I, Baali K (2014) A sampling-pso-k-means algorithm for document clustering. In: Genetic and evolutionary computing. Springer, pp 45–54
Kaufman L, Rousseeuw PJ (2009) Finding groups in data: an introduction to cluster analysis, vol 344. Wiley
Kerdprasop K, Kerdprasop N (2010) A lightweight method to parallel k-means clustering. International Journal of Mathematics and Computers in Simulation 4(4):144–153
Kraus JM, Kestler HA (2010) A highly efficient multi-core algorithm for clustering extremely large datasets. BMC Bioinforma 11(1):1
Kriegel HP, Kröger P, Sander J, Zimek A (2011) Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1(3):231–240
Kwedlo W, Iwanowicz P (2010) Using genetic algorithm for selection of initial cluster centers for the k-means method. In: International conference on artificial intelligence and soft computing. Springer, pp 165–172
Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml
Lin KW, Lin CH, Hsiao CY (2014) A parallel and scalable cast-based clustering algorithm on gpu. Soft Comput 18(3):539– 547
Ludwig SA (2015) Mapreduce-based fuzzy c-means clustering algorithm: implementation and scalability. Int J Mach Learn Cybern 6(6):923–934
MacQueen J et al. (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth berkeley symposium on mathematical statistics and probability, Oakland, CA, USA, vol 1, pp 281–297
Maulik U, Bandyopadhyay S (2000) Genetic algorithm-based clustering technique. Pattern Recogn 33 (9):1455–1465
More P, Hall LO (2004) Scalable clustering: a distributed approach. In: IEEE international conference on fuzzy systems, 2004. Proceedings. 2004, IEEE, vol 1, pp 143–148
Rokach L, Maimon O (2005) Clustering methods. In: Data mining and knowledge discovery handbook. Springer, pp 321–352
Saoud Z, Platoš J (2017) Community detection in bibsonomy using data clustering. In: International conference on information systems architecture and technology. Springer, pp 149–158
Sheikholeslami G, Chatterjee S, Zhang A (1998) Wavecluster: a multi-resolution clustering approach for very large spatial databases. In: VLDB, vol 98, pp 428–439
Shirkhorshidi AS, Aghabozorgi S, Wah TY, Herawan T (2014) Big data clustering: a review. In: International conference on computational science and its applications. Springer, pp 707–720
Sinha A, Jana PK (2016) A novel k-means based clustering algorithm for big data. In: International conference on advances in computing, communications and informatics (ICACCI), 2016. IEEE, pp 1875–1879
Stoffel K, Belkoniene A (1999) Parallel k/h-means clustering for large data sets. In: European conference on parallel processing. Springer, pp 1451–1454
Sun Z (2013) A parallel clustering method study based on mapreduce. In: 1st international workshop on cloud computing and information security. Atlantis Press
Timón I, Soto J, Pérez-Sánchez H, Cecilia JM (2016) Parallel implementation of fuzzy minimals clustering algorithm. Expert Syst Appl 48:35–41
Wang J, Yuan D, Jiang M (2012) Parallel k-pso based on mapreduce. In: IEEE 14th international conference on communication technology (ICCT), 2012. IEEE, pp 1203–1208
Xu S, Zhang J (2004) A parallel hybrid web document clustering algo- rithm and its performance study. J Supercomput 30(2):117–131
Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: IEEE international conference on cloud computing. Springer, pp 674–679
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bousbaci, A., Kamel, N. Efficient data distribution and results merging for parallel data clustering in mapreduce environment. Appl Intell 48, 2408–2428 (2018). https://doi.org/10.1007/s10489-017-1089-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-017-1089-7