Efficient Results Merging for Parallel Data Clustering Using MapReduce

Bousbaci, Abdelhak; Kamel, Nadjet

doi:10.1007/978-3-319-40162-1_38

Abdelhak Bousbaci⁹ &
Nadjet Kamel^9,10

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 474))

1644 Accesses
1 Citations

Abstract

Data clustering is partitioning data into sub-groups using a distance measure. Clustering a large data amount requires an important execution time. Several works have been proposed to overcome this problem using parallelism. One of the parallel techniques consists in partitioning data and processing each partition apart, the results obtained from each partition are merged to get the final clusters configuration. Using an inappropriate merging technique leads to an inaccurate final centroids and a middling clustering quality. In this paper, we propose two merging techniques to improve the clustering quality.

In a first solution, the results are merged using the K-means algorithm, and in a second one using the genetic algorithm. The results proved the efficiency of the proposed strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, California, USA, vol. 1, pp. 281–297 (1967)
Google Scholar
Ene, A., Im, S., Moseley, B.: Fast clustering using mapreduce. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 681–689. ACM (2011)
Google Scholar
Guerrieri, A., Montresor, A.: Ds-means: distributed data stream clustering. In: Euro-Par 2012 Parallel Processing, pp. 260–271. Springer (2012)
Google Scholar
Ferreira Cordeiro, R.L., Traina Junior, C., Machado Traina, A.J., López, J., Kang, U., Faloutsos, C.: Clustering very large multi-dimensional datasets with mapreduce. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 690–698. ACM (2011)
Google Scholar
Mashayekhi, H., Habibi, J., Voulgaris, S., van Steen, M.: Goscan: Decentralized scalable data clustering. Computing 95(9), 759–784 (2013)
Article MathSciNet MATH Google Scholar
Bousbaci, A., Kamel, N.: A parallel sampling-pso-multi-core-k-means algorithm using mapreduce. In: 2014 14th International Conference on Hybrid Intelligent Systems (HIS), pp. 129–134. IEEE (2014)
Google Scholar
Cui, X., Zhu, P., Yang, X., Li, K., Ji, C.: Optimized big data k-means clustering using mapreduce. The Journal of Supercomputing 70(3), 1249–1259 (2014)
Article Google Scholar
Kamel, N., Ouchen, I., Baali, K.: A sampling-pso-k-means algorithm for document clustering. In: Genetic and Evolutionary Computing, pp. 45–54. Springer (2014)
Google Scholar
Bradley, P.S., Fayyad, U.M.: Refining initial points for k-means clustering. In: ICML, vol. 98, pp. 91–99. Citeseer (1998)
Google Scholar
Kwedlo, W., Iwanowicz, P.: Using genetic algorithm for selection of initial cluster centers for the k-means method. In: Artifical Intelligence and Soft Computing, pp. 165–172. Springer (2010)
Google Scholar
Maulik, U., Bandyopadhyay, S.: Genetic algorithm-based clustering technique. Pattern recognition 33(9), 1455–1465 (2000)
Article Google Scholar
Hore, P., Hall, L., Goldgof, D.: A cluster ensemble framework for large data sets. In: IEEE International Conference on Systems, Man and Cybernetics, SMC 2006, vol. 4, pp. 3342–3347. IEEE (2006)
Google Scholar
Lichman, M.: UCI Machine Learning Repository (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

LRIA, Computer Science Department, USTHB Algiers, Bab Ezzouar, Algeria
Abdelhak Bousbaci & Nadjet Kamel
Computer Science Department, Faculty of Sciences, UFAS Setif, Setif, Algeria
Nadjet Kamel

Authors

Abdelhak Bousbaci
View author publications
You can also search for this author in PubMed Google Scholar
Nadjet Kamel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdelhak Bousbaci .

Editor information

Editors and Affiliations

Faculty of Engineering, Osaka Institute of Technology Faculty of Engineering, Osaka, Osaka, Japan
Sigeru Omatu
Faculty of Computer Science & Informatio, Universiti Teknologi Malaysia (UTM) Faculty of Computer Science & Informatio, Baharu, Malaysia
Ali Semalat
Department of Electronics and Compu, Koszalin University of Technology Department of Electronics and Compu, Koszalin, Poland
Grzegorz Bocewicz
Faculty of Electrical Engineering and Cs, Kielce University of Technology Faculty of Electrical Engineering and Cs, Kielce, Poland
Paweł Sitek
Faculty of Engineering and Science, Aalborg University Faculty of Engineering and Science, Aalborg, Denmark
Izabela E. Nielsen
ETS Ingeniería Informática, University of Sevilla ETS Ingeniería Informática, Sevilla, Spain
Julián A. García García
Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid Departamento de Inteligencia Artificial, Madrid, Spain
Javier Bajo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bousbaci, A., Kamel, N. (2016). Efficient Results Merging for Parallel Data Clustering Using MapReduce. In: Omatu, S., et al. Distributed Computing and Artificial Intelligence, 13th International Conference. Advances in Intelligent Systems and Computing, vol 474. Springer, Cham. https://doi.org/10.1007/978-3-319-40162-1_38

Download citation

DOI: https://doi.org/10.1007/978-3-319-40162-1_38
Published: 01 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40161-4
Online ISBN: 978-3-319-40162-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics