Skip to main content

Efficient Results Merging for Parallel Data Clustering Using MapReduce

  • Conference paper
  • First Online:
Book cover Distributed Computing and Artificial Intelligence, 13th International Conference

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 474))

Abstract

Data clustering is partitioning data into sub-groups using a distance measure. Clustering a large data amount requires an important execution time. Several works have been proposed to overcome this problem using parallelism. One of the parallel techniques consists in partitioning data and processing each partition apart, the results obtained from each partition are merged to get the final clusters configuration. Using an inappropriate merging technique leads to an inaccurate final centroids and a middling clustering quality. In this paper, we propose two merging techniques to improve the clustering quality.

In a first solution, the results are merged using the K-means algorithm, and in a second one using the genetic algorithm. The results proved the efficiency of the proposed strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, California, USA, vol. 1, pp. 281–297 (1967)

    Google Scholar 

  2. Ene, A., Im, S., Moseley, B.: Fast clustering using mapreduce. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 681–689. ACM (2011)

    Google Scholar 

  3. Guerrieri, A., Montresor, A.: Ds-means: distributed data stream clustering. In: Euro-Par 2012 Parallel Processing, pp. 260–271. Springer (2012)

    Google Scholar 

  4. Ferreira Cordeiro, R.L., Traina Junior, C., Machado Traina, A.J., López, J., Kang, U., Faloutsos, C.: Clustering very large multi-dimensional datasets with mapreduce. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 690–698. ACM (2011)

    Google Scholar 

  5. Mashayekhi, H., Habibi, J., Voulgaris, S., van Steen, M.: Goscan: Decentralized scalable data clustering. Computing 95(9), 759–784 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  6. Bousbaci, A., Kamel, N.: A parallel sampling-pso-multi-core-k-means algorithm using mapreduce. In: 2014 14th International Conference on Hybrid Intelligent Systems (HIS), pp. 129–134. IEEE (2014)

    Google Scholar 

  7. Cui, X., Zhu, P., Yang, X., Li, K., Ji, C.: Optimized big data k-means clustering using mapreduce. The Journal of Supercomputing 70(3), 1249–1259 (2014)

    Article  Google Scholar 

  8. Kamel, N., Ouchen, I., Baali, K.: A sampling-pso-k-means algorithm for document clustering. In: Genetic and Evolutionary Computing, pp. 45–54. Springer (2014)

    Google Scholar 

  9. Bradley, P.S., Fayyad, U.M.: Refining initial points for k-means clustering. In: ICML, vol. 98, pp. 91–99. Citeseer (1998)

    Google Scholar 

  10. Kwedlo, W., Iwanowicz, P.: Using genetic algorithm for selection of initial cluster centers for the k-means method. In: Artifical Intelligence and Soft Computing, pp. 165–172. Springer (2010)

    Google Scholar 

  11. Maulik, U., Bandyopadhyay, S.: Genetic algorithm-based clustering technique. Pattern recognition 33(9), 1455–1465 (2000)

    Article  Google Scholar 

  12. Hore, P., Hall, L., Goldgof, D.: A cluster ensemble framework for large data sets. In: IEEE International Conference on Systems, Man and Cybernetics, SMC 2006, vol. 4, pp. 3342–3347. IEEE (2006)

    Google Scholar 

  13. Lichman, M.: UCI Machine Learning Repository (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abdelhak Bousbaci .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Bousbaci, A., Kamel, N. (2016). Efficient Results Merging for Parallel Data Clustering Using MapReduce. In: Omatu, S., et al. Distributed Computing and Artificial Intelligence, 13th International Conference. Advances in Intelligent Systems and Computing, vol 474. Springer, Cham. https://doi.org/10.1007/978-3-319-40162-1_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-40162-1_38

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-40161-4

  • Online ISBN: 978-3-319-40162-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics