Skip to main content

Advertisement

Log in

Efficient data distribution and results merging for parallel data clustering in mapreduce environment

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Clustering data consists in partitioning it into clusters such that there is a strong similarity between data in the same cluster and a weak similarity between data in different clusters. With the significant increase in data volume, the clustering process becomes an expensive task in terms of computation. Therefore, several solutions have been proposed to overcome this issue using parallelism with the MapReduce paradigm. The proposed solutions in the literature aim to optimize the execution time while keeping the clustering quality close or identical to the sequential execution. One of the commonly used parallel clustering strategies when using the MapReduce framework consists in partitioning data and processing each partition separately. The results obtained from each partition are merged to obtain the final clusters configuration. Using a random data distribution strategy and an inappropriate merging technique will lead to an inaccurate final centroids and a rather average clustering quality. Hence, in this paper we propose a parallel scheme for partitional clustering algorithms based on MapReduce with a non-conventional data distribution and results merging strategies to improve the clustering quality. With this solution, in addition to optimizing the execution time, we exploit the parallel environment to enhance the clustering quality. The experimental results demonstrate the effectiveness and scalability of our solution in comparison with other recently proposed works. We also proposed an application of our approach to the community detection problem. The results demonstrate the ability of our approach to provide effective and relevant results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. (2016) Knowledge and Data Engineering Group, University of Kassel: Benchmark folksonomy data from bibsonomy version of January 01st. http://bibsonomy.org/

  2. Banharnsakun A (2017) A mapreduce-based artificial bee colony for large-scale data clustering. Pattern Recogn Lett 93:78–84

    Article  Google Scholar 

  3. Benz D, Hotho A, Jäschke R, Krause B, Mitzlaff F, Schmitz C, Stumme G (2010) The social bookmark and publication management system bibsonomy. The VLDB Journal—The International Journal on Very Large Data Bases 19(6):849–875

    Article  Google Scholar 

  4. Bousbaci A, Kamel N (2014) A parallel sampling-pso-multi-core-k-means algorithm using mapreduce. In: 14th international conference on hybrid intelligent systems (HIS), 2014. IEEE, pp 129–134

  5. Bousbaci A, Kamel N (2016) Efficient results merging for parallel data clustering using mapreduce. In: 13th international conference distributed computing and artificial intelligence. Springer, pp 349–357

  6. Chaimontree S, Atkinson K, Coenen F (2011) A multi-agent based approach to clustering: harnessing the power of agents. In: International workshop on agents and data mining interaction. Springer, pp 16–29

  7. Cui X, Potok TE (2005) Document clustering analysis based on hybrid pso+ k-means algorithm. J Comput Sci (special issue) 27:33

    Google Scholar 

  8. Cui X, Charles JS, Potok T (2013) Gpu enhanced parallel computing for large scale data clustering. Futur Gener Comput Syst 29(7):1736–1741

    Article  Google Scholar 

  9. Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data k-means clustering using mapreduce. J Supercomput 70(3):1249–1259

    Article  Google Scholar 

  10. Davidson I, Satyanarayana A (2003) Speeding up k-means clustering by bootstrap averaging. In: IEEE data mining workshop on clustering large data sets

  11. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51 (1):107–113

    Article  Google Scholar 

  12. Ene A, Im S, Moseley B (2011) Fast clustering using mapreduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 681–689

  13. Ester M, Kriegel HP, Sander J, Xu X et al. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd, vol 96, pp 226–231

  14. Ferreira Cordeiro RL, Traina Junior C, Machado Traina AJ, López J, Kang U, Faloutsos C (2011) Clustering very large multi-dimensional datasets with mapreduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 690–698

  15. Fränti P (2015) Clustering datasets. http://cs.uef.fi/sipu/datasets/

  16. Goil S, Nagesh H, Choudhary A (1999) Mafia: efficient and scalable subspace clustering for very large data sets. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 443–452

  17. Guerrieri A, Montresor A (2012) Ds-means: distributed data stream clustering. In: European conference on parallel processing. Springer, pp 260–271

  18. Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: ACM SIGMOD Record, ACM, vol 27, pp 73–84

  19. Hammouda KM, Kamel MS (2014) Models of distributed data clustering in peer-to-peer environments. Knowl Inf Syst 38(2):303–329

    Article  Google Scholar 

  20. Han D, Giraud-Carrier C, Li S (2015) Efficient mining of high-speed uncertain data streams. Appl Intell 43(4):773–785

    Article  Google Scholar 

  21. Kamel N, Ouchen I, Baali K (2014) A sampling-pso-k-means algorithm for document clustering. In: Genetic and evolutionary computing. Springer, pp 45–54

  22. Kaufman L, Rousseeuw PJ (2009) Finding groups in data: an introduction to cluster analysis, vol 344. Wiley

  23. Kerdprasop K, Kerdprasop N (2010) A lightweight method to parallel k-means clustering. International Journal of Mathematics and Computers in Simulation 4(4):144–153

    Google Scholar 

  24. Kraus JM, Kestler HA (2010) A highly efficient multi-core algorithm for clustering extremely large datasets. BMC Bioinforma 11(1):1

    Article  Google Scholar 

  25. Kriegel HP, Kröger P, Sander J, Zimek A (2011) Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1(3):231–240

    Google Scholar 

  26. Kwedlo W, Iwanowicz P (2010) Using genetic algorithm for selection of initial cluster centers for the k-means method. In: International conference on artificial intelligence and soft computing. Springer, pp 165–172

  27. Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml

  28. Lin KW, Lin CH, Hsiao CY (2014) A parallel and scalable cast-based clustering algorithm on gpu. Soft Comput 18(3):539– 547

    Article  Google Scholar 

  29. Ludwig SA (2015) Mapreduce-based fuzzy c-means clustering algorithm: implementation and scalability. Int J Mach Learn Cybern 6(6):923–934

    Article  Google Scholar 

  30. MacQueen J et al. (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth berkeley symposium on mathematical statistics and probability, Oakland, CA, USA, vol 1, pp 281–297

  31. Maulik U, Bandyopadhyay S (2000) Genetic algorithm-based clustering technique. Pattern Recogn 33 (9):1455–1465

    Article  Google Scholar 

  32. More P, Hall LO (2004) Scalable clustering: a distributed approach. In: IEEE international conference on fuzzy systems, 2004. Proceedings. 2004, IEEE, vol 1, pp 143–148

  33. Rokach L, Maimon O (2005) Clustering methods. In: Data mining and knowledge discovery handbook. Springer, pp 321–352

  34. Saoud Z, Platoš J (2017) Community detection in bibsonomy using data clustering. In: International conference on information systems architecture and technology. Springer, pp 149–158

  35. Sheikholeslami G, Chatterjee S, Zhang A (1998) Wavecluster: a multi-resolution clustering approach for very large spatial databases. In: VLDB, vol 98, pp 428–439

  36. Shirkhorshidi AS, Aghabozorgi S, Wah TY, Herawan T (2014) Big data clustering: a review. In: International conference on computational science and its applications. Springer, pp 707–720

  37. Sinha A, Jana PK (2016) A novel k-means based clustering algorithm for big data. In: International conference on advances in computing, communications and informatics (ICACCI), 2016. IEEE, pp 1875–1879

  38. Stoffel K, Belkoniene A (1999) Parallel k/h-means clustering for large data sets. In: European conference on parallel processing. Springer, pp 1451–1454

  39. Sun Z (2013) A parallel clustering method study based on mapreduce. In: 1st international workshop on cloud computing and information security. Atlantis Press

  40. Timón I, Soto J, Pérez-Sánchez H, Cecilia JM (2016) Parallel implementation of fuzzy minimals clustering algorithm. Expert Syst Appl 48:35–41

    Article  Google Scholar 

  41. Wang J, Yuan D, Jiang M (2012) Parallel k-pso based on mapreduce. In: IEEE 14th international conference on communication technology (ICCT), 2012. IEEE, pp 1203–1208

  42. Xu S, Zhang J (2004) A parallel hybrid web document clustering algo- rithm and its performance study. J Supercomput 30(2):117–131

    Article  Google Scholar 

  43. Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: IEEE international conference on cloud computing. Springer, pp 674–679

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abdelhak Bousbaci.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bousbaci, A., Kamel, N. Efficient data distribution and results merging for parallel data clustering in mapreduce environment. Appl Intell 48, 2408–2428 (2018). https://doi.org/10.1007/s10489-017-1089-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-017-1089-7

Keywords

Navigation