Scalable Random Sampling K-Prototypes Using Spark

Ben HajKacem, Mohamed Aymen; Ben N’cir, Chiheb-Eddine; Essoussi, Nadia

doi:10.1007/978-3-319-98539-8_24

Mohamed Aymen Ben HajKacem¹⁵,
Chiheb-Eddine Ben N’cir¹⁵ &
Nadia Essoussi¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11031))

Included in the following conference series:

International Conference on Big Data Analytics and Knowledge Discovery

1065 Accesses
2 Citations

Abstract

Big data clustering has become an important challenge in machine learning. Several Big data frameworks have been developed to scale clustering methods for Big data analysis. One such framework called Spark works well for iterative algorithms by supporting in-memory computations. We propose in this paper a new Scalable Random Sampling K-Prototypes, implemented on Spark framework. This method is able to perform grouping from mixed large scale data. Experiments realized on simulated and real data sets show the efficiency of the proposed method compared to existing k-prototypes methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Parallel K-Prototypes Clustering with High Efficiency and Accuracy

Parallel K-prototypes for Clustering Big Data

A Framework for Clustering and Classification of Big Data Using Spark

Notes

References

Ben Haj Kacem, M.A., Ben N’cir, C.E., Essoussi, N.: MapReduce-based k-prototypes clustering method for big data. In: Proceedings of Data Science and Advanced Analytics, pp. 1–7 (2015)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Gandomi, A., Haider, M.: Beyond the hype: big data concepts, methods, and analytics. Int. J. Inf. Manag. 35(2), 137–144 (2015)
Article Google Scholar
Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, Amsterdam (2011)
MATH Google Scholar
Huang, Z.: Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 21–34 (1997)
Google Scholar
Ludwig, S.A.: MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability. Int. J. Mach. Learn. Cybern. 923–934, 1–12 (2015)
Google Scholar
Singh, D., Reddy, C.K.: A survey on platforms for big data analytics. J. Big Data 2(1), 8 (2015)
Article Google Scholar
Shahrivari, S., Jalili, S.: Single-pass and linear-time k-means clustering based on MapReduce. Inf. Syst. 60, 1–12 (2016)
Article Google Scholar
Vattani, A.: K-means requires exponentially many iterations even in the plane. Discret. Comput. Geom. 45(4), 596–616 (2011)
Article MathSciNet Google Scholar
Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)
Article MathSciNet Google Scholar
Xu, X., Jäger, J., Kriegel, H.P.: A fast parallel clustering algorithm for large spatial databases. In: Guo, Y., Grossman, R. (eds.) High Performance Data Mining, pp. 263–290. Springer, Boston (2002). https://doi.org/10.1007/0-306-47011-X_3
Chapter Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10(10–10), 95 (2010)
Google Scholar
Zhao, W., Ma, H., He, Q.: Parallel K-Means clustering based on mapreduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) CloudCom 2009. LNCS, vol. 5931, pp. 674–679. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-10665-1_71
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

LARODEC, Université de Tunis, Institut Supérieur de Gestion de Tunis, 41 Avenue de la liberté, cité Bouchoucha, 2000, Le Bardo, Tunisia
Mohamed Aymen Ben HajKacem, Chiheb-Eddine Ben N’cir & Nadia Essoussi

Authors

Mohamed Aymen Ben HajKacem
View author publications
You can also search for this author in PubMed Google Scholar
Chiheb-Eddine Ben N’cir
View author publications
You can also search for this author in PubMed Google Scholar
Nadia Essoussi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamed Aymen Ben HajKacem .

Editor information

Editors and Affiliations

University of Houston, Houston, Texas, USA
Carlos Ordonez
LIAS/ISAE-ENSMA, Chasseneuil-du-Poitou, France
Ladjel Bellatreche

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ben HajKacem, M.A., Ben N’cir, CE., Essoussi, N. (2018). Scalable Random Sampling K-Prototypes Using Spark. In: Ordonez, C., Bellatreche, L. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2018. Lecture Notes in Computer Science(), vol 11031. Springer, Cham. https://doi.org/10.1007/978-3-319-98539-8_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-98539-8_24
Published: 08 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98538-1
Online ISBN: 978-3-319-98539-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics