Skip to main content
Log in

Anytime parallel density-based clustering

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

The density-based clustering algorithm DBSCAN is a state-of-the-art data clustering technique with numerous applications in many fields. However, DBSCAN requires neighborhood queries for all objects and propagation of labels from object to object. This scheme is time consuming and thus limits its applicability for large datasets. In this paper, we propose a novel anytime approach to cope with this problem by reducing both the range query and the label propagation time of DBSCAN. Our algorithm, called AnyDBC, compresses the data into smaller density-connected subsets called primitive clusters and labels objects based on connected components of these primitive clusters to reduce the label propagation time. Moreover, instead of passively performing range queries for all objects as in existing techniques, AnyDBC iteratively and actively learns the current cluster structure of the data and selects a few most promising objects for refining clusters at each iteration. Thus, in the end, it performs substantially fewer range queries compared to DBSCAN while still satisfying the cluster definition of DBSCAN. Moreover, by processing queries in block and merging the results into the current cluster structure, AnyDBC can be efficiently parallelized on shared memory architectures to further accelerate the performance, uniquely making it a parallel and anytime technique at the same time. Experiments show speedup factors of orders of magnitude compared to DBSCAN and its fastest variants as well as a high parallel scalability on multicore processors for very large real and synthetic complex datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31
Fig. 32
Fig. 33
Fig. 34
Fig. 35
Fig. 36
Fig. 37
Fig. 38

Similar content being viewed by others

Notes

  1. http://openmp.org/wp/.

  2. http://openmp.org/wp/.

References

  • Aggarwal CC, Reddy CK (eds) (2014) Data clustering: algorithms and applications. CRC Press, Boca Raton

    MATH  Google Scholar 

  • Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: International conference on management of data (SIGMOD), pp 49–60

  • Arlia D, Coppola M (2001) Experiments in parallel clustering with DBSCAN. In: International Euro-par conference, pp 326–331

  • Assent I, Kranen P, Baldauf C, Seidl T (2012) AnyOut: anytime outlier detection on streaming data. In: International conference on database systems for advanced applications (DASFAA) (1), pp 228–242

  • Bentley JL (1975) Multidimensional binary search trees used for associative searching. Commun ACM 18(9):509–517

    Article  MathSciNet  MATH  Google Scholar 

  • Böhm C, Noll R, Plant C, Wackersreuther B (2009) Density-based clustering using graphics processors. In: International conference on information and knowledge management (CIKM), pp 661–670

  • Böhm C, Feng J, He X, Mai ST, Plant C, Shao J (2011) A novel similarity measure for fiber clustering using longest common subsequence. In: Proceedings of the 2011 workshop on data mining for medicine and healthcare, pp 1–9

  • Borah B, Bhattacharyya DK (2004) An improved sampling-based DBSCAN for large spatial databases. In: International conference on intelligent sensing and information processing (ICISIP), pp 92–96

  • Brecheisen S, Kriegel H, Pfeifle M (2004) Efficient density-based clustering of complex objects. In: IEEE international conference on data mining (ICDM), pp 43–50

  • Brecheisen S, Kriegel H, Pfeifle M (2006a) Parallel density-based clustering of complex objects. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 179–188

  • Brecheisen S, Kriegel HP, Pfeifle M (2006b) Multi-step density-based clustering. Knowl. Inf Syst 9(3):284–308

    Google Scholar 

  • Chen L, Ng RT (2004) On the marriage of Lp-norms and edit distance. In: Very large data bases (VLDB), pp 792–803

  • Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. MIT Press, Cambridge

    MATH  Google Scholar 

  • Dai BR, Lin IC (2012) Efficient map/reduce-based DBSCAN algorithm with optimized data partition. In: IEEE international conference on cloud computing (CLOUD), pp 59–66

  • Dash M, Liu H, Xu X (2001) ‘\(1 + 1 > 2\)’: Merging distance and density based clustering. In: International conference on database systems for advanced applications (DASFAA), pp 32–39

  • Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 226–231

  • Fonollosa J, Sheik S, Huerta R, Marco S (2015) Reservoir computing compensates slow response of chemosensor arrays exposed to fast varying gas concentrations in continuous monitoring. Sensors Actuators B: Chem 215:618–629

    Article  Google Scholar 

  • Francis Z, Villagrasa C, Clairand I (2011) Simulation of DNA damage clustering after proton irradiation using an adapted DBSCAN algorithm. Comput Methods Programs Biomed 101(3):265–270

    Article  Google Scholar 

  • Gan J, Tao Y (2015) DBSCAN revisited: mis-claim, un-fixability, and approximation. In: International conference on management of data (SIGMOD), pp 519–530

  • Gan J, Tao Y (2017) On the hardness and approximation of Euclidean DBSCAN. ACM Trans Database Syst 42(3):14:1–14:45

    Article  MathSciNet  Google Scholar 

  • Götz M, Bodenstein C, Riedel M (2015) HPDBSCAN: highly parallel DBSCAN. In: Proceedings of the workshop on machine learning in high-performance computing environments, pp 2:1–2:10

  • Greiner J (1994) A comparison of parallel algorithms for connected components. In: Proceedings of the 6th annual ACM symposium on parallel algorithms and architectures (SSPA), pp 16–25

  • Gunawan A (2013) A faster algorithm for DBSCAN. Msc thesis, TU Eindhoven

  • He Y, Tan H, Luo W, Mao H, Ma D, Feng S, Fan J (2011) MR-DBSCAN: an efficient parallel density-based clustering algorithm using MapReduce. In: International conference on parallel and distributed systems (ICPADS), pp 473–480

  • Januzaj E, Kriegel HP, Pfeifle M (2004) Scalable density-based distributed clustering. In: European conference on principles of data mining and knowledge discovery (PKDD), pp 231–244

  • Kobayashi T, Iwamura M, Matsuda T, Kise K (2013) An anytime algorithm for camera-based character recognition. In: International conference on document analysis and recognition (ICDAR), pp 1140–1144

  • Kriegel HP, Pfeifle M (2005) Density-based clustering of uncertain data. In: ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 672–677

  • Kriegel H, Schubert E, Zimek A (2017) The (black) art of runtime evaluation: are we comparing algorithms or implementations? Knowl Inf Syst 52(2):341–378

    Article  Google Scholar 

  • Kumar V (2002) Introduction to parallel computing, 2nd edn. Addison-Wesley, Boston

    Google Scholar 

  • Li T, Heinis T, Luk W (2016) Hashing-based approximate DBSCAN. In: Symposium on advances in databases and information systems (ADBIS), pp 31–45

  • Li T, Heinis T, Luk W (2017) ADvaNCE—efficient and scalable approximate density-based clustering based on hashing. Informatica 28(1):105–130

    Article  Google Scholar 

  • Lulli A, Dell’Amico M, Michiardi P, Ricci L (2016) NG-DBSCAN: scalable density-based clustering for arbitrary data. Proc VLDB Endow (PVLDB) 10(3):157–168

    Article  Google Scholar 

  • Mahran S, Mahar K (2008) Using grid for accelerating density-based clustering. In: IEEE international conference on computer and information technology (CIT), pp 35–40

  • Mai ST, Goebl S, Plant C (2012) A similarity model and segmentation algorithm for white matter fiber tracts. In: IEEE international conference on data mining (ICDM), pp 1014–1019

  • Mai ST, He X, Feng J, Böhm C (2013a) Efficient anytime density-based clustering. In: SIAM international conference on data mining (SDM), pp 112–120

  • Mai ST, He X, Hubig N, Plant C, Böhm C (2013b) Active density-based clustering. In: IEEE international conference on data mining (ICDM), pp 508–517

  • Mai ST, He X, Feng J, Plant C, Böhm C (2015) Anytime density-based clustering of complex data. Knowl Inf Syst 45(2):319–355

    Article  Google Scholar 

  • Mai ST, Assent I, Le A (2016a) Anytime OPTICS: an efficient approach for hierarchical density-based clustering. In: International conference on database systems for advanced applications (DASFAA), pp 164–179

  • Mai ST, Assent I, Storgaard M (2016b) AnyDBC: an efficient anytime density-based clustering algorithm for very large complex datasets. In: ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 1025–1034

  • Mai ST, Dieu MS, Assent I, Jacobsen J, Kristensen J, Birk M (2017) Scalable and interactive graph clustering algorithm on multicore CPUs. In: IEEE international conference on data engineering (ICDE), pp 349–360

  • Patwary MMA, Palsetia D, Agrawal A, Liao W, Manne F, Choudhary AN (2012) A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In: Proceedings of the international conference on high performance computing, networking, storage and analysis (SC), p 62

  • Patwary MMA, Satish N, Sundaram N, Manne F, Habib S, Dubey P (2014) Pardicle: parallel approximate density-based clustering. In: Proceedings of the international conference on high performance computing, networking, storage and analysis (SC), pp 560–571

  • Reiss A, Stricker D (2012) Introducing a new benchmarked dataset for activity monitoring. In: International symposium on wearable computers (ISWC), pp 108–109

  • Sakai T, Tamura K, Kitakami H (2017) Cell-based DBSCAN algorithm using minimum bounding rectangle criteria. In: International conference on database systems for advanced applications (DASFAA), pp 133–144

  • Sander J, Ester M, Kriegel HP, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Discov 2(2):169–194

    Article  Google Scholar 

  • Schubert E, Sander J, Ester M, Kriegel H, Xu X (2017) DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Trans Database Syst 42(3):19:1–19:21

    Article  MathSciNet  Google Scholar 

  • Tramacere A, Vecchio C (2013) \(\gamma \)-Ray DBSCAN: a clustering algorithm applied to fermi-LAT \(\gamma \)-ray data—I. Detection performances with real and simulated data. Astron Astrophys 549:A138

    Article  Google Scholar 

  • Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: International conference on machine learning (ICML), pp 1073–1080

  • Wang X, Hamilton HJ (2003) DBRS: a density-based spatial clustering method with random sampling. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 563–575

  • Xu X, Jäger J, Kriegel HP (1999) A fast parallel clustering algorithm for large spatial databases. Data Min Knowl Discov 3(3):263–290

    Article  Google Scholar 

  • Zaki MJ, M W Jr (2014) Data mining and analysis: fundamental concepts and algorithms. Cambridge University Press, New York

    Book  Google Scholar 

  • Zhao W, Hopke PK, Prather KA (2008) Comparison of two cluster analysis methods using single particle mass spectra. Atmos Environ 42(5):881–892

    Article  Google Scholar 

  • Zhou S, Zhou A, Cao J, Jin W, Fan Y, Hu Y (2000) Combining sampling technique with DBSCAN algorithm for clustering large spatial databases. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 169–172

  • Zilberstein S (1996) Using anytime algorithms in intelligent systems. AI Mag 17(3):73–83

    Google Scholar 

Download references

Acknowledgements

We thank Prof. Yufei Tao for providing us with the binary file of DBSCANR and the authors of PDSDBSCAN for making their source code available for download. Our special thanks to the anonymous reviewers for their invaluable comments. We appreciate the support and discussion with Sihem Amer-Yahia, Diep Phan, Jon Jacobsen, Jesper Kristensen, Ky Nguyen, Kenneth Bøgh, Sean Chester, and Manuel Ciosici during the preparation of this paper. Part of this research was funded by a Villum postdoc fellowship.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Son T. Mai.

Additional information

Responsible editor: Jian Pei.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mai, S.T., Assent, I., Jacobsen, J. et al. Anytime parallel density-based clustering. Data Min Knowl Disc 32, 1121–1176 (2018). https://doi.org/10.1007/s10618-018-0562-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-018-0562-1

Keywords

Navigation