Skip to main content
Log in

A survey on parallel clustering algorithms for Big Data

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Data clustering is one of the most studied data mining tasks. It aims, through various methods, to discover previously unknown groups within the data sets. In the past years, considerable progress has been made in this field leading to the development of innovative and promising clustering algorithms. These traditional clustering algorithms present some serious issues in connection with the speed-up, the throughput, and the scalability. Thus, they can no longer be directly used in the context of Big Data, where data are mainly characterized by their volume, velocity, and variety. In order to overcome their limitations, the research today is heading to the parallel computing concept by giving rise to the so-called parallel clustering algorithms. This paper presents an overview of the latest parallel clustering algorithms categorized according to the computing platforms used to handle the Big Data, namely, the horizontal and vertical scaling platforms. The former category includes peer-to-peer networks, MapReduce, and Spark platforms, while the latter category includes Multi-core processors, Graphics Processing Unit, and Field Programmable Gate Arrays platforms. In addition, it includes a comparison of the performance of the reviewed algorithms based on some common criteria of clustering validation in the Big Data context. Therefore, it provides the reader with an overall vision of the current parallel clustering techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases, VLDB ’03, vol 29. VLDB Endowment, Berlin, pp 81–92

  • Akhter S, Roberts J (2006) Multi-core programming: increasing performance through software multi-threading, 1st edn. Books by engineers, for engineers. Intel Press, Hillsboro

    Google Scholar 

  • Altinigneli MC, Plant C, Böhm C (2013) Massively parallel expectation maximization using graphics processing units. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’13. ACM, Chicago, pp 838–846. https://doi.org/10.1145/2487575.2487628

  • An F, Koide T, Mattausch HJ (2012) A k-means-based multi-prototype high-speed learning system with FPGA-implemented coprocessor for 1-NN searching. IEICE Trans Inf Syst E95–D(9):2327–2338

    Article  Google Scholar 

  • Andrade G, Ramos G, Madeira D, Sachetto R, Ferreira R, Rocha L (2013) G-DBSCAN: a GPU accelerated algorithm for density-based clustering. Procedia Comput Sci 18(Supplement C):369–378. https://doi.org/10.1016/j.procs.2013.05.200

    Article  Google Scholar 

  • Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) Optics: ordering points to identify the clustering structure. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, SIGMOD ’99. ACM, Philadelphia, pp 49–60. https://doi.org/10.1145/304182.304187

  • Azimi R, Sajedi H, Ghayekhloo M (2017) A distributed data clustering algorithm in p2p networks. Appl Soft Comput 51(Supplement C):147–167. https://doi.org/10.1016/j.asoc.2016.11.045

    Article  Google Scholar 

  • Banharnsakun A (2017) A mapreduce-based artificial bee colony for large-scale data clustering. Pattern Recognit Lett 93(Supplement C):78–84. https://doi.org/10.1016/j.patrec.2016.07.027

    Article  Google Scholar 

  • Ben-Dor A, Shamir R, Yakhini Z (1999) Clustering gene expression patterns. J Comput Biol 6(3–4):281–297. https://doi.org/10.1089/106652799318274

    Article  Google Scholar 

  • Bharill N, Tiwari A, Malviya A (2016) Fuzzy based scalable clustering algorithms for handling big data using apache spark. IEEE Trans Big Data 2(4):339–352. https://doi.org/10.1109/TBDATA.2016.2622288

    Article  Google Scholar 

  • Brown SD, Francis RJ, Rose J, Vranesic ZG (1992) Field-programmable gate arrays. Kluwer international series in engineering and computer science. Springer, Boston. https://doi.org/10.1007/978-1-4615-3572-0

    Book  MATH  Google Scholar 

  • Bustamam A, Burrage K, Hamilton NA (2012) Fast parallel markov clustering in bioinformatics using massively parallel computing on GPU with CUDA and ELLPACK-R sparse format. IEEE/ACM Trans Comput Biol Bioinform 9(3):679–692. https://doi.org/10.1109/TCBB.2011.68

    Article  Google Scholar 

  • Cordova I, Moh TS (2015) DBSCAN on resilient distributed datasets. In: International conference on high performance computing simulation (HPCS). IEEE, Amsterdam, pp 531–540. https://doi.org/10.1109/HPCSim.2015.7237086

  • Cui X, Gao J, Potok TE (2006) A flocking based algorithm for document clustering analysis. J Syst Archit 52(8):505–515. https://doi.org/10.1016/j.sysarc.2006.02.003

    Article  Google Scholar 

  • Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data k-means clustering using MapReduce. J Supercomput 70(3):1249–1259. https://doi.org/10.1007/s11227-014-1225-7

    Article  Google Scholar 

  • Cuomo S, De Angelis V, Farina G, Marcellino L, Toraldo G (2017) A GPU-accelerated parallel k-means algorithm. Comput Electr Eng. https://doi.org/10.1016/j.compeleceng.2017.12.002

    Article  Google Scholar 

  • Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on opearting systems design and implementation, OSDI’04, vol 6. USENIX Association, Berkeley

  • Deng Z, Hu Y, Zhu M, Huang X, Du B (2015) A scalable and fast optics for clustering trajectory big data. Cluster Comput 18(2):549–562. https://doi.org/10.1007/s10586-014-0413-9

    Article  Google Scholar 

  • Ekanayake J, Li H, Zhang B, Gunarathne T, Bae SH, Qiu J, Fox G (2010) Twister: a runtime for iterative MapReduce. In: Proceedings of the 19th ACM international symposium on high performance distributed computing. ACM, pp 810–818

  • Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30(7):1575–1584

    Article  Google Scholar 

  • Erdem A, Gündem Tİ (2014) M-FDBSCAN: a multicore density-based uncertain data clustering algorithm. Turk J Electr Eng Comput Sci 22:143–154. https://doi.org/10.3906/elk-1202-83

    Article  Google Scholar 

  • Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining, KDD’96. AAAI Press, Portland, pp 226–231

  • Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for Big Data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279. https://doi.org/10.1109/TETC.2014.2330519

    Article  Google Scholar 

  • Farooq U, Marrakchi Z, Mehrez H (2012) FPGA architectures: an overview. In: Tree-based heterogeneous FPGA architectures, chap. 2. Springer, New York, pp 7–48. https://doi.org/10.1007/978-1-4614-3594-5_2

  • Ferreira Cordeiro RL, Traina Junior C, Machado Traina AJ, López J, Kang U, Faloutsos C (2011) Clustering very large multi-dimensional datasets with mapreduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’11. ACM, San Diego, pp 690–698. https://doi.org/10.1145/2020408.2020516

  • Gehweiler J, Meyerhenke H (2010) A distributed diffusive heuristic for clustering a virtual p2p supercomputer. In: IEEE international symposium on parallel distributed processing, workshops and Phd forum (IPDPSW). IEEE, Atlanta, pp 1–8. https://doi.org/10.1109/IPDPSW.2010.5470922

  • Gepner P, Kowalik MF (2006) Multi-core processors: New way to achieve high system performance. In: International symposium on parallel computing in electrical engineering (PARELEC’06). Bialystok, Poland, pp 9–13. https://doi.org/10.1109/PARELEC.2006.54

  • Gouineau F, Landry T, Triplet T (2016) Patchwork, a scalable density-grid clustering algorithm. In: Proceedings of the 31st annual ACM symposium on applied computing, SAC ’16. ACM, Pisa, pp 824–831. https://doi.org/10.1145/2851613.2851643

  • Hadian A, Shahrivari S (2014) High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs. J Supercomput 69(2):845–863. https://doi.org/10.1007/s11227-014-1185-y

    Article  Google Scholar 

  • Han D, Agrawal A, Liao WK, Choudhary A (2016) A novel scalable DBSCAN algorithm with Spark. In: IEEE international parallel and distributed processing symposium workshops (IPDPSW). IEEE, Chicago, pp 1393–1402. https://doi.org/10.1109/IPDPSW.2016.57

  • Han J, Kamber M, Pei J (2012) Cluster analysis: basic concepts and methods. In: Data mining, The Morgan Kaufmann series in data management systems, 3rd edn, chap. 10. Morgan Kaufmann, pp 443–495. https://doi.org/10.1016/B978-0-12-381479-1.00010-1

  • Harish P, Narayanan PJ (2007) Accelerating large graph algorithms on the GPU using CUDA. In: High performance computing—HiPC 2007. Lecture notes in computer science. Springer, Berlin, pp 197–208. https://doi.org/10.1007/978-3-540-77220-0_21

  • Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. Appl Stat 28(1):100. https://doi.org/10.2307/2346830

    Article  MATH  Google Scholar 

  • Havens TC, Bezdek JC, Leckie C, Hall LO, Palaniswami M (2012) Fuzzy c-means algorithms for very large data. IEEE Trans Fuzzy Syst 20(6):1130–1146. https://doi.org/10.1109/TFUZZ.2012.2201485

    Article  Google Scholar 

  • He Y, Tan H, Luo W, Feng S, Fan J (2014) MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Front Comput Sci 8(1):83–99. https://doi.org/10.1007/s11704-013-3158-3

    Article  MathSciNet  Google Scholar 

  • Huang P, Li X, Yuan B (2015) A parallel gpu-based approach to clustering very fast data streams. In: Proceedings of the 24th ACM international on conference on information and knowledge management, CIKM ’15. ACM, Melbourne, pp 23–32. https://doi.org/10.1145/2806416.2806545

  • Hussain HM, Benkrid K, Seker H, Erdogan AT (2011) FPGA implementation of k-means algorithm for bioinformatics application: an accelerated approach to clustering microarray data. In: NASA/ESA conference on adaptive hardware and systems (AHS). IEEE, San Diego, pp 248–255. https://doi.org/10.1109/AHS.2011.5963944

  • Jia F, Wang C, Li X, Zhou X (2015) SAKMA: specialized FPGA-based accelerator architecture for data-intensive k-means algorithms. In: Algorithms and architectures for parallel processing. Springer, Cham, pp 106–119. https://doi.org/10.1007/978-3-319-27122-4_8

  • Jin R, Kou C, Liu R, Li Y (2013) Efficient parallel spectral clustering algorithm design for large data sets under cloud computing environment. J Cloud Comput Adv Syst Appl 2(1):18. https://doi.org/10.1186/2192-113X-2-18

    Article  Google Scholar 

  • Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2002) An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell 24(7):881–892. https://doi.org/10.1109/TPAMI.2002.1017616

    Article  MATH  Google Scholar 

  • Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392. https://doi.org/10.1137/S1064827595287997

    Article  MathSciNet  MATH  Google Scholar 

  • Kraus JM, Kestler HA (2010) A highly efficient multi-core algorithm for clustering extremely large datasets. BMC Bioinform 11(1):169. https://doi.org/10.1186/1471-2105-11-169

    Article  Google Scholar 

  • Kriegel HP, Pfeifle M (2005) Density-based clustering of uncertain data. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining. ACM, Chicago, pp 672–677. https://doi.org/10.1145/1081870.1081955

  • Lanczos C (1950) An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. United States Governm., Press Office Los Angeles

  • Laney D (2001) 3D data management: controlling data volume, velocity, and variety. Technical Report, 949, Gartner

  • Li C, Zhang Y, Jiao M, Yu G (2014) Mux-Kmeans: multiplex Kmeans for clustering large-scale data set. In: Proceedings of the 5th ACM workshop on scientific cloud computing, ScienceCloud ’14. ACM, Vancouver, pp 25–32. https://doi.org/10.1145/2608029.2608033

  • Lin F, Cohen WW (2010) Power iteration clustering. In: Proceedings of the 27th international conference on machine learning (ICML-10). Omnipress, Haifa, pp 655–662

  • Lin KW, Lin CH, Hsiao CY (2014) A parallel and scalable cast-based clustering algorithm on GPU. Soft Comput 18(3):539–547. https://doi.org/10.1007/s00500-013-1074-y

    Article  Google Scholar 

  • Liu R, Li X, Du L, Zhi S, Wei M (2017) Parallel implementation of density peaks clustering algorithm based on spark. Procedia Comput Sci 107(Supplement C):442–447. https://doi.org/10.1016/j.procs.2017.03.138

    Article  Google Scholar 

  • Luo G, Luo X, Gooch TF, Tian L, Qin K (2016) A parallel DBSCAN algorithm based on spark. In: IEEE international conferences on big data and cloud computing, social computing and networking, sustainable computing and communications. IEEE, Atlanta, pp 548–553. https://doi.org/10.1109/BDCloud-SocialCom-SustainCom.2016.85

  • Mallios X, Vassalos V, Venetis T, Vlachou A (2016) A framework for clustering and classification of big data using spark. In: Debruyne C, Panetto H, Meersman R, Dillon T, Kühn E, O’Sullivan D, Ardagna CA (eds) On the move to meaningful internet systems: OTM 2016 conferences, vol 10033. Springer, Cham, pp 344–362. https://doi.org/10.1007/978-3-319-48472-3_20

  • Melo D, Toledo S, Mourao F, Sachetto R, Andrade G, Ferreira R, Parthasarathy S, Rocha L (2016) Hierarchical density-based clustering based on GPU accelerated data indexing strategy. Procedia Comput Sci 80:951–961. https://doi.org/10.1016/j.procs.2016.05.389

    Article  Google Scholar 

  • Milojicic DS, Kalogeraki V, Lukose R, Nagaraja K, Pruyne J, Richard B, Rollins S, Xu Z (2002) Peer-to-peer computing. Technical Report. HPL-2002-57, HP Labs

  • Nanni M, Pedreschi D (2006) Time-focused clustering of trajectories of moving objects. J Intell Inf Syst 27(3):267–289. https://doi.org/10.1007/s10844-006-9953-7

    Article  Google Scholar 

  • Nickolls J, Buck I, Garland M (2008) Scalable parallel programming. In: IEEE hot chips 20 symposium (HCS). IEEE, pp 40–53

  • Owens J, Houston M, Luebke D, Green S, Stone J, Phillips J (2008) GPU computing. Proc IEEE 96(5):879–899. https://doi.org/10.1109/JPROC.2008.917757

    Article  Google Scholar 

  • Patwary MA, Palsetia D, Agrawal A, Liao WK, Manne F, Choudhary A (2013) Scalable parallel optics data clustering using graph algorithmic techniques. In: Proceedings of the international conference on high performance computing, networking, storage and analysis, SC ’13. ACM, Denver, pp 49:1–49:12. https://doi.org/10.1145/2503210.2503255

  • Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496. https://doi.org/10.1126/science.1242072

    Article  Google Scholar 

  • Savvas IK, Tselios D (2016) Parallelizing DBSCAN algorithm using MPI. In: IEEE 25th International conference on enabling technologies: infrastructure for collaborative enterprises (WETICE). IEEE, Paris, pp 77–82. https://doi.org/10.1109/WETICE.2016.26

  • Scicluna N, Bouganis CS (2015) ARC 2014: a multidimensional FPGA-based parallel DBSCAN architecture. ACM Trans Reconfig Technol Syst 9(1):2:1–2:15. https://doi.org/10.1145/2724722

    Article  Google Scholar 

  • Sheikholeslami G, Chatterjee S, Zhang A (2000) Wavecluster: a wavelet-based clustering approach for spatial data in very large databases. VLDB J Int J Very Large Data Bases 8(3–4):289–304. https://doi.org/10.1007/s007780050009

    Article  Google Scholar 

  • Shi S, Yue Q, Wang Q (2014) FPGA based accelerator for parallel DBSCAN algorithm. Comput Model New Technol 18(2):135–142

    Google Scholar 

  • Singh D, Reddy CK (2014) A survey on platforms for big data analytics. J Big Data 2(1):8. https://doi.org/10.1186/s40537-014-0008-6

    Article  Google Scholar 

  • Sinha A, Jana PK (2016) A novel k-means based clustering algorithm for big data. In: International conference on advances in computing, communications and informatics (ICACCI). IEEE, pp 1875–1879. https://doi.org/10.1109/ICACCI.2016.7732323

  • Skillicorn D (1999) Strategies for parallel data mining. IEEE Concurr 7(4):26–35. https://doi.org/10.1109/4434.806976

    Article  Google Scholar 

  • Sotiropoulou CL, Gkaitatzis S, Annovi A, Beretta M, Giannetti P, Kordas K, Luciano P, Nikolaidis S, Petridou C, Volpi G (2014) A multi-core FPGA-based 2D-clustering implementation for real-time image processing. IEEE Trans Nuclear Sci 61(6):3599–3606. https://doi.org/10.1109/TNS.2014.2364183

    Article  Google Scholar 

  • Stone JE, Gohara D, Shi G (2010) OpenCL: a parallel programming standard for heterogeneous computing systems. Comput Sci Eng 12(3):66

    Article  Google Scholar 

  • Sun Z, Fox G, Gu W, Li Z (2014) A parallel clustering method combined information bottleneck theory and centroid-based clustering. J Supercomput 69(1):452–467. https://doi.org/10.1007/s11227-014-1174-1

    Article  Google Scholar 

  • Tsapanos N, Tefas A, Nikolaidis N, Pitas I (2015) A distributed framework for trimmed kernel k-means clustering. Pattern Recognit 48(8):2685–2698. https://doi.org/10.1016/j.patcog.2015.02.020

    Article  MATH  Google Scholar 

  • Tsapanos N, Tefas A, Nikolaidis N, Pitas I (2016) Efficient mapreduce kernel k-means for big data clustering. In: Proceedings of the 9th hellenic conference on artificial intelligence, SETN ’16. ACM, Thessaloniki, pp 28:1–28:5. https://doi.org/10.1145/2903220.2903255

  • Valiant LG (1990) A bridging model for parallel computation. Commun ACM 33(8):103–111. https://doi.org/10.1145/79173.79181

    Article  Google Scholar 

  • Van Dongen S (2008) Graph clustering via a discrete uncoupling process. SIAM J Matrix Anal Appl 30(1):121–141. https://doi.org/10.1137/040608635

    Article  MathSciNet  MATH  Google Scholar 

  • Voulgaris S, Gavidia D, van Steen M (2005) Cyclon: inexpensive membership management for unstructured p2p overlays. J Netw Syst Manag 13(2):197–217. https://doi.org/10.1007/s10922-005-4441-x

    Article  Google Scholar 

  • Wang J, Yuan D, Jiang M (2012) Parallel K-PSO based on MapReduce. In: IEEE 14th international conference on communication technology, pp 1203–1208. IEEE, Chengdu. https://doi.org/10.1109/ICCT.2012.6511380

  • Wang B, Yin J, Hua Q, Wu Z, Cao J (2016) Parallelizing k-means-based clustering on spark. In: International conference on advanced cloud and Big Data (CBD). IEEE, Chengdu, pp 31–36. https://doi.org/10.1109/CBD.2016.016

  • Winterstein F, Bayliss S, Constantinides GA (2013) FPGA-based k-means clustering using tree-based data structures. In: The 23rd international conference on field programmable logic and applications. IEEE, Porto, pp 1–6. https://doi.org/10.1109/FPL.2013.6645501

  • Yan W, Brahmakshatriya U, Xue Y, Gilder M, Wise B (2013) p-PIC: parallel power iteration clustering for big data. J Parallel Distrib Comput 73(3):352–359. https://doi.org/10.1016/j.jpdc.2012.06.009

    Article  Google Scholar 

  • Yang J, Li X (2013) MapReduce based method for big data semantic clustering. In: IEEE international conference on systems, man, and cybernetics. IEEE, pp 2814–2819. https://doi.org/10.1109/SMC.2013.480

  • Yıldırım AA, Özdoğan C (2011) Parallel wavecluster: a linear scaling parallel clustering algorithm implementation with application to very large datasets. J Parallel Distrib Comput 71(7):955–962. https://doi.org/10.1016/j.jpdc.2011.03.007

    Article  Google Scholar 

  • Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on hot topics in cloud computing, HotCloud’10. USENIX Association, Berkeley

  • Zayani A, Ben N’Cir CE, Essoussi N (2016) Parallel clustering method for non-disjoint partitioning of large-scale data based on spark framework. In: IEEE international conference on big data (Big Data). IEEE, Washington, DC, pp 1064–1069. https://doi.org/10.1109/BigData.2016.7840708

  • Zhang Y, Mueller F, Cui X, Potok T (2010) Large-scale multi-dimensional document clustering on GPU clusters. In: IEEE international symposium on parallel distributed processing (IPDPS). IEEE, pp 1–10. https://doi.org/10.1109/IPDPS.2010.5470429

  • Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on MapReduce. In: Cloud computing. Lecture notes in computer science. Springer, Berlin, pp 674–679. https://doi.org/10.1007/978-3-642-10665-1_71

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zineb Dafir.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dafir, Z., Lamari, Y. & Slaoui, S.C. A survey on parallel clustering algorithms for Big Data. Artif Intell Rev 54, 2411–2443 (2021). https://doi.org/10.1007/s10462-020-09918-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-020-09918-2

Keywords

Navigation