A survey on parallel clustering algorithms for Big Data

Dafir, Zineb; Lamari, Yasmine; Slaoui, Said Chah

doi:10.1007/s10462-020-09918-2

A survey on parallel clustering algorithms for Big Data

Published: 06 October 2020

Volume 54, pages 2411–2443, (2021)
Cite this article

Artificial Intelligence Review Aims and scope Submit manuscript

2585 Accesses
42 Citations
Explore all metrics

Abstract

Data clustering is one of the most studied data mining tasks. It aims, through various methods, to discover previously unknown groups within the data sets. In the past years, considerable progress has been made in this field leading to the development of innovative and promising clustering algorithms. These traditional clustering algorithms present some serious issues in connection with the speed-up, the throughput, and the scalability. Thus, they can no longer be directly used in the context of Big Data, where data are mainly characterized by their volume, velocity, and variety. In order to overcome their limitations, the research today is heading to the parallel computing concept by giving rise to the so-called parallel clustering algorithms. This paper presents an overview of the latest parallel clustering algorithms categorized according to the computing platforms used to handle the Big Data, namely, the horizontal and vertical scaling platforms. The former category includes peer-to-peer networks, MapReduce, and Spark platforms, while the latter category includes Multi-core processors, Graphics Processing Unit, and Field Programmable Gate Arrays platforms. In addition, it includes a comparison of the performance of the reviewed algorithms based on some common criteria of clustering validation in the Big Data context. Therefore, it provides the reader with an overall vision of the current parallel clustering techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

A brief introduction to distributed systems

Article Open access 16 August 2016

Big data preprocessing: methods and prospects

Article Open access 01 November 2016

References

Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases, VLDB ’03, vol 29. VLDB Endowment, Berlin, pp 81–92
Akhter S, Roberts J (2006) Multi-core programming: increasing performance through software multi-threading, 1st edn. Books by engineers, for engineers. Intel Press, Hillsboro
Google Scholar
Altinigneli MC, Plant C, Böhm C (2013) Massively parallel expectation maximization using graphics processing units. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’13. ACM, Chicago, pp 838–846. https://doi.org/10.1145/2487575.2487628
An F, Koide T, Mattausch HJ (2012) A k-means-based multi-prototype high-speed learning system with FPGA-implemented coprocessor for 1-NN searching. IEICE Trans Inf Syst E95–D(9):2327–2338
Article Google Scholar
Andrade G, Ramos G, Madeira D, Sachetto R, Ferreira R, Rocha L (2013) G-DBSCAN: a GPU accelerated algorithm for density-based clustering. Procedia Comput Sci 18(Supplement C):369–378. https://doi.org/10.1016/j.procs.2013.05.200
Article Google Scholar
Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) Optics: ordering points to identify the clustering structure. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, SIGMOD ’99. ACM, Philadelphia, pp 49–60. https://doi.org/10.1145/304182.304187
Azimi R, Sajedi H, Ghayekhloo M (2017) A distributed data clustering algorithm in p2p networks. Appl Soft Comput 51(Supplement C):147–167. https://doi.org/10.1016/j.asoc.2016.11.045
Article Google Scholar
Banharnsakun A (2017) A mapreduce-based artificial bee colony for large-scale data clustering. Pattern Recognit Lett 93(Supplement C):78–84. https://doi.org/10.1016/j.patrec.2016.07.027
Article Google Scholar
Ben-Dor A, Shamir R, Yakhini Z (1999) Clustering gene expression patterns. J Comput Biol 6(3–4):281–297. https://doi.org/10.1089/106652799318274
Article Google Scholar
Bharill N, Tiwari A, Malviya A (2016) Fuzzy based scalable clustering algorithms for handling big data using apache spark. IEEE Trans Big Data 2(4):339–352. https://doi.org/10.1109/TBDATA.2016.2622288
Article Google Scholar
Brown SD, Francis RJ, Rose J, Vranesic ZG (1992) Field-programmable gate arrays. Kluwer international series in engineering and computer science. Springer, Boston. https://doi.org/10.1007/978-1-4615-3572-0
Book MATH Google Scholar
Bustamam A, Burrage K, Hamilton NA (2012) Fast parallel markov clustering in bioinformatics using massively parallel computing on GPU with CUDA and ELLPACK-R sparse format. IEEE/ACM Trans Comput Biol Bioinform 9(3):679–692. https://doi.org/10.1109/TCBB.2011.68
Article Google Scholar
Cordova I, Moh TS (2015) DBSCAN on resilient distributed datasets. In: International conference on high performance computing simulation (HPCS). IEEE, Amsterdam, pp 531–540. https://doi.org/10.1109/HPCSim.2015.7237086
Cui X, Gao J, Potok TE (2006) A flocking based algorithm for document clustering analysis. J Syst Archit 52(8):505–515. https://doi.org/10.1016/j.sysarc.2006.02.003
Article Google Scholar
Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data k-means clustering using MapReduce. J Supercomput 70(3):1249–1259. https://doi.org/10.1007/s11227-014-1225-7
Article Google Scholar
Cuomo S, De Angelis V, Farina G, Marcellino L, Toraldo G (2017) A GPU-accelerated parallel k-means algorithm. Comput Electr Eng. https://doi.org/10.1016/j.compeleceng.2017.12.002
Article Google Scholar
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on opearting systems design and implementation, OSDI’04, vol 6. USENIX Association, Berkeley
Deng Z, Hu Y, Zhu M, Huang X, Du B (2015) A scalable and fast optics for clustering trajectory big data. Cluster Comput 18(2):549–562. https://doi.org/10.1007/s10586-014-0413-9
Article Google Scholar
Ekanayake J, Li H, Zhang B, Gunarathne T, Bae SH, Qiu J, Fox G (2010) Twister: a runtime for iterative MapReduce. In: Proceedings of the 19th ACM international symposium on high performance distributed computing. ACM, pp 810–818
Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30(7):1575–1584
Article Google Scholar
Erdem A, Gündem Tİ (2014) M-FDBSCAN: a multicore density-based uncertain data clustering algorithm. Turk J Electr Eng Comput Sci 22:143–154. https://doi.org/10.3906/elk-1202-83
Article Google Scholar
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining, KDD’96. AAAI Press, Portland, pp 226–231
Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for Big Data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279. https://doi.org/10.1109/TETC.2014.2330519
Article Google Scholar
Farooq U, Marrakchi Z, Mehrez H (2012) FPGA architectures: an overview. In: Tree-based heterogeneous FPGA architectures, chap. 2. Springer, New York, pp 7–48. https://doi.org/10.1007/978-1-4614-3594-5_2
Ferreira Cordeiro RL, Traina Junior C, Machado Traina AJ, López J, Kang U, Faloutsos C (2011) Clustering very large multi-dimensional datasets with mapreduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’11. ACM, San Diego, pp 690–698. https://doi.org/10.1145/2020408.2020516
Gehweiler J, Meyerhenke H (2010) A distributed diffusive heuristic for clustering a virtual p2p supercomputer. In: IEEE international symposium on parallel distributed processing, workshops and Phd forum (IPDPSW). IEEE, Atlanta, pp 1–8. https://doi.org/10.1109/IPDPSW.2010.5470922
Gepner P, Kowalik MF (2006) Multi-core processors: New way to achieve high system performance. In: International symposium on parallel computing in electrical engineering (PARELEC’06). Bialystok, Poland, pp 9–13. https://doi.org/10.1109/PARELEC.2006.54
Gouineau F, Landry T, Triplet T (2016) Patchwork, a scalable density-grid clustering algorithm. In: Proceedings of the 31st annual ACM symposium on applied computing, SAC ’16. ACM, Pisa, pp 824–831. https://doi.org/10.1145/2851613.2851643
Hadian A, Shahrivari S (2014) High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs. J Supercomput 69(2):845–863. https://doi.org/10.1007/s11227-014-1185-y
Article Google Scholar
Han D, Agrawal A, Liao WK, Choudhary A (2016) A novel scalable DBSCAN algorithm with Spark. In: IEEE international parallel and distributed processing symposium workshops (IPDPSW). IEEE, Chicago, pp 1393–1402. https://doi.org/10.1109/IPDPSW.2016.57
Han J, Kamber M, Pei J (2012) Cluster analysis: basic concepts and methods. In: Data mining, The Morgan Kaufmann series in data management systems, 3rd edn, chap. 10. Morgan Kaufmann, pp 443–495. https://doi.org/10.1016/B978-0-12-381479-1.00010-1
Harish P, Narayanan PJ (2007) Accelerating large graph algorithms on the GPU using CUDA. In: High performance computing—HiPC 2007. Lecture notes in computer science. Springer, Berlin, pp 197–208. https://doi.org/10.1007/978-3-540-77220-0_21
Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. Appl Stat 28(1):100. https://doi.org/10.2307/2346830
Article MATH Google Scholar
Havens TC, Bezdek JC, Leckie C, Hall LO, Palaniswami M (2012) Fuzzy c-means algorithms for very large data. IEEE Trans Fuzzy Syst 20(6):1130–1146. https://doi.org/10.1109/TFUZZ.2012.2201485
Article Google Scholar
He Y, Tan H, Luo W, Feng S, Fan J (2014) MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Front Comput Sci 8(1):83–99. https://doi.org/10.1007/s11704-013-3158-3
Article MathSciNet Google Scholar
Huang P, Li X, Yuan B (2015) A parallel gpu-based approach to clustering very fast data streams. In: Proceedings of the 24th ACM international on conference on information and knowledge management, CIKM ’15. ACM, Melbourne, pp 23–32. https://doi.org/10.1145/2806416.2806545
Hussain HM, Benkrid K, Seker H, Erdogan AT (2011) FPGA implementation of k-means algorithm for bioinformatics application: an accelerated approach to clustering microarray data. In: NASA/ESA conference on adaptive hardware and systems (AHS). IEEE, San Diego, pp 248–255. https://doi.org/10.1109/AHS.2011.5963944
Jia F, Wang C, Li X, Zhou X (2015) SAKMA: specialized FPGA-based accelerator architecture for data-intensive k-means algorithms. In: Algorithms and architectures for parallel processing. Springer, Cham, pp 106–119. https://doi.org/10.1007/978-3-319-27122-4_8
Jin R, Kou C, Liu R, Li Y (2013) Efficient parallel spectral clustering algorithm design for large data sets under cloud computing environment. J Cloud Comput Adv Syst Appl 2(1):18. https://doi.org/10.1186/2192-113X-2-18
Article Google Scholar
Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2002) An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell 24(7):881–892. https://doi.org/10.1109/TPAMI.2002.1017616
Article MATH Google Scholar
Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392. https://doi.org/10.1137/S1064827595287997
Article MathSciNet MATH Google Scholar
Kraus JM, Kestler HA (2010) A highly efficient multi-core algorithm for clustering extremely large datasets. BMC Bioinform 11(1):169. https://doi.org/10.1186/1471-2105-11-169
Article Google Scholar
Kriegel HP, Pfeifle M (2005) Density-based clustering of uncertain data. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining. ACM, Chicago, pp 672–677. https://doi.org/10.1145/1081870.1081955
Lanczos C (1950) An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. United States Governm., Press Office Los Angeles
Laney D (2001) 3D data management: controlling data volume, velocity, and variety. Technical Report, 949, Gartner
Li C, Zhang Y, Jiao M, Yu G (2014) Mux-Kmeans: multiplex Kmeans for clustering large-scale data set. In: Proceedings of the 5th ACM workshop on scientific cloud computing, ScienceCloud ’14. ACM, Vancouver, pp 25–32. https://doi.org/10.1145/2608029.2608033
Lin F, Cohen WW (2010) Power iteration clustering. In: Proceedings of the 27th international conference on machine learning (ICML-10). Omnipress, Haifa, pp 655–662
Lin KW, Lin CH, Hsiao CY (2014) A parallel and scalable cast-based clustering algorithm on GPU. Soft Comput 18(3):539–547. https://doi.org/10.1007/s00500-013-1074-y
Article Google Scholar
Liu R, Li X, Du L, Zhi S, Wei M (2017) Parallel implementation of density peaks clustering algorithm based on spark. Procedia Comput Sci 107(Supplement C):442–447. https://doi.org/10.1016/j.procs.2017.03.138
Article Google Scholar
Luo G, Luo X, Gooch TF, Tian L, Qin K (2016) A parallel DBSCAN algorithm based on spark. In: IEEE international conferences on big data and cloud computing, social computing and networking, sustainable computing and communications. IEEE, Atlanta, pp 548–553. https://doi.org/10.1109/BDCloud-SocialCom-SustainCom.2016.85
Mallios X, Vassalos V, Venetis T, Vlachou A (2016) A framework for clustering and classification of big data using spark. In: Debruyne C, Panetto H, Meersman R, Dillon T, Kühn E, O’Sullivan D, Ardagna CA (eds) On the move to meaningful internet systems: OTM 2016 conferences, vol 10033. Springer, Cham, pp 344–362. https://doi.org/10.1007/978-3-319-48472-3_20
Melo D, Toledo S, Mourao F, Sachetto R, Andrade G, Ferreira R, Parthasarathy S, Rocha L (2016) Hierarchical density-based clustering based on GPU accelerated data indexing strategy. Procedia Comput Sci 80:951–961. https://doi.org/10.1016/j.procs.2016.05.389
Article Google Scholar
Milojicic DS, Kalogeraki V, Lukose R, Nagaraja K, Pruyne J, Richard B, Rollins S, Xu Z (2002) Peer-to-peer computing. Technical Report. HPL-2002-57, HP Labs
Nanni M, Pedreschi D (2006) Time-focused clustering of trajectories of moving objects. J Intell Inf Syst 27(3):267–289. https://doi.org/10.1007/s10844-006-9953-7
Article Google Scholar
Nickolls J, Buck I, Garland M (2008) Scalable parallel programming. In: IEEE hot chips 20 symposium (HCS). IEEE, pp 40–53
Owens J, Houston M, Luebke D, Green S, Stone J, Phillips J (2008) GPU computing. Proc IEEE 96(5):879–899. https://doi.org/10.1109/JPROC.2008.917757
Article Google Scholar
Patwary MA, Palsetia D, Agrawal A, Liao WK, Manne F, Choudhary A (2013) Scalable parallel optics data clustering using graph algorithmic techniques. In: Proceedings of the international conference on high performance computing, networking, storage and analysis, SC ’13. ACM, Denver, pp 49:1–49:12. https://doi.org/10.1145/2503210.2503255
Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496. https://doi.org/10.1126/science.1242072
Article Google Scholar
Savvas IK, Tselios D (2016) Parallelizing DBSCAN algorithm using MPI. In: IEEE 25th International conference on enabling technologies: infrastructure for collaborative enterprises (WETICE). IEEE, Paris, pp 77–82. https://doi.org/10.1109/WETICE.2016.26
Scicluna N, Bouganis CS (2015) ARC 2014: a multidimensional FPGA-based parallel DBSCAN architecture. ACM Trans Reconfig Technol Syst 9(1):2:1–2:15. https://doi.org/10.1145/2724722
Article Google Scholar
Sheikholeslami G, Chatterjee S, Zhang A (2000) Wavecluster: a wavelet-based clustering approach for spatial data in very large databases. VLDB J Int J Very Large Data Bases 8(3–4):289–304. https://doi.org/10.1007/s007780050009
Article Google Scholar
Shi S, Yue Q, Wang Q (2014) FPGA based accelerator for parallel DBSCAN algorithm. Comput Model New Technol 18(2):135–142
Google Scholar
Singh D, Reddy CK (2014) A survey on platforms for big data analytics. J Big Data 2(1):8. https://doi.org/10.1186/s40537-014-0008-6
Article Google Scholar
Sinha A, Jana PK (2016) A novel k-means based clustering algorithm for big data. In: International conference on advances in computing, communications and informatics (ICACCI). IEEE, pp 1875–1879. https://doi.org/10.1109/ICACCI.2016.7732323
Skillicorn D (1999) Strategies for parallel data mining. IEEE Concurr 7(4):26–35. https://doi.org/10.1109/4434.806976
Article Google Scholar
Sotiropoulou CL, Gkaitatzis S, Annovi A, Beretta M, Giannetti P, Kordas K, Luciano P, Nikolaidis S, Petridou C, Volpi G (2014) A multi-core FPGA-based 2D-clustering implementation for real-time image processing. IEEE Trans Nuclear Sci 61(6):3599–3606. https://doi.org/10.1109/TNS.2014.2364183
Article Google Scholar
Stone JE, Gohara D, Shi G (2010) OpenCL: a parallel programming standard for heterogeneous computing systems. Comput Sci Eng 12(3):66
Article Google Scholar
Sun Z, Fox G, Gu W, Li Z (2014) A parallel clustering method combined information bottleneck theory and centroid-based clustering. J Supercomput 69(1):452–467. https://doi.org/10.1007/s11227-014-1174-1
Article Google Scholar
Tsapanos N, Tefas A, Nikolaidis N, Pitas I (2015) A distributed framework for trimmed kernel k-means clustering. Pattern Recognit 48(8):2685–2698. https://doi.org/10.1016/j.patcog.2015.02.020
Article MATH Google Scholar
Tsapanos N, Tefas A, Nikolaidis N, Pitas I (2016) Efficient mapreduce kernel k-means for big data clustering. In: Proceedings of the 9th hellenic conference on artificial intelligence, SETN ’16. ACM, Thessaloniki, pp 28:1–28:5. https://doi.org/10.1145/2903220.2903255
Valiant LG (1990) A bridging model for parallel computation. Commun ACM 33(8):103–111. https://doi.org/10.1145/79173.79181
Article Google Scholar
Van Dongen S (2008) Graph clustering via a discrete uncoupling process. SIAM J Matrix Anal Appl 30(1):121–141. https://doi.org/10.1137/040608635
Article MathSciNet MATH Google Scholar
Voulgaris S, Gavidia D, van Steen M (2005) Cyclon: inexpensive membership management for unstructured p2p overlays. J Netw Syst Manag 13(2):197–217. https://doi.org/10.1007/s10922-005-4441-x
Article Google Scholar
Wang J, Yuan D, Jiang M (2012) Parallel K-PSO based on MapReduce. In: IEEE 14th international conference on communication technology, pp 1203–1208. IEEE, Chengdu. https://doi.org/10.1109/ICCT.2012.6511380
Wang B, Yin J, Hua Q, Wu Z, Cao J (2016) Parallelizing k-means-based clustering on spark. In: International conference on advanced cloud and Big Data (CBD). IEEE, Chengdu, pp 31–36. https://doi.org/10.1109/CBD.2016.016
Winterstein F, Bayliss S, Constantinides GA (2013) FPGA-based k-means clustering using tree-based data structures. In: The 23rd international conference on field programmable logic and applications. IEEE, Porto, pp 1–6. https://doi.org/10.1109/FPL.2013.6645501
Yan W, Brahmakshatriya U, Xue Y, Gilder M, Wise B (2013) p-PIC: parallel power iteration clustering for big data. J Parallel Distrib Comput 73(3):352–359. https://doi.org/10.1016/j.jpdc.2012.06.009
Article Google Scholar
Yang J, Li X (2013) MapReduce based method for big data semantic clustering. In: IEEE international conference on systems, man, and cybernetics. IEEE, pp 2814–2819. https://doi.org/10.1109/SMC.2013.480
Yıldırım AA, Özdoğan C (2011) Parallel wavecluster: a linear scaling parallel clustering algorithm implementation with application to very large datasets. J Parallel Distrib Comput 71(7):955–962. https://doi.org/10.1016/j.jpdc.2011.03.007
Article Google Scholar
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on hot topics in cloud computing, HotCloud’10. USENIX Association, Berkeley
Zayani A, Ben N’Cir CE, Essoussi N (2016) Parallel clustering method for non-disjoint partitioning of large-scale data based on spark framework. In: IEEE international conference on big data (Big Data). IEEE, Washington, DC, pp 1064–1069. https://doi.org/10.1109/BigData.2016.7840708
Zhang Y, Mueller F, Cui X, Potok T (2010) Large-scale multi-dimensional document clustering on GPU clusters. In: IEEE international symposium on parallel distributed processing (IPDPS). IEEE, pp 1–10. https://doi.org/10.1109/IPDPS.2010.5470429
Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on MapReduce. In: Cloud computing. Lecture notes in computer science. Springer, Berlin, pp 674–679. https://doi.org/10.1007/978-3-642-10665-1_71

Download references

Author information

Authors and Affiliations

Faculty of Science of Rabat, Mohammed V University, Rabat, Morocco
Zineb Dafir, Yasmine Lamari & Said Chah Slaoui

Authors

Zineb Dafir
View author publications
You can also search for this author in PubMed Google Scholar
Yasmine Lamari
View author publications
You can also search for this author in PubMed Google Scholar
Said Chah Slaoui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zineb Dafir.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dafir, Z., Lamari, Y. & Slaoui, S.C. A survey on parallel clustering algorithms for Big Data. Artif Intell Rev 54, 2411–2443 (2021). https://doi.org/10.1007/s10462-020-09918-2

Download citation

Published: 06 October 2020
Issue Date: April 2021
DOI: https://doi.org/10.1007/s10462-020-09918-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on parallel clustering algorithms for Big Data

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

A brief introduction to distributed systems

Big data preprocessing: methods and prospects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A survey on parallel clustering algorithms for Big Data

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

A brief introduction to distributed systems

Big data preprocessing: methods and prospects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation