Anytime parallel density-based clustering

Mai, Son T.; Assent, Ira; Jacobsen, Jon; Dieu, Martin Storgaard

doi:10.1007/s10618-018-0562-1

Anytime parallel density-based clustering

Published: 10 April 2018

Volume 32, pages 1121–1176, (2018)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Son T. Mai¹^nAff2,
Ira Assent¹,
Jon Jacobsen¹ &
…
Martin Storgaard Dieu¹

918 Accesses
10 Citations
Explore all metrics

Abstract

The density-based clustering algorithm DBSCAN is a state-of-the-art data clustering technique with numerous applications in many fields. However, DBSCAN requires neighborhood queries for all objects and propagation of labels from object to object. This scheme is time consuming and thus limits its applicability for large datasets. In this paper, we propose a novel anytime approach to cope with this problem by reducing both the range query and the label propagation time of DBSCAN. Our algorithm, called AnyDBC, compresses the data into smaller density-connected subsets called primitive clusters and labels objects based on connected components of these primitive clusters to reduce the label propagation time. Moreover, instead of passively performing range queries for all objects as in existing techniques, AnyDBC iteratively and actively learns the current cluster structure of the data and selects a few most promising objects for refining clusters at each iteration. Thus, in the end, it performs substantially fewer range queries compared to DBSCAN while still satisfying the cluster definition of DBSCAN. Moreover, by processing queries in block and merging the results into the current cluster structure, AnyDBC can be efficiently parallelized on shared memory architectures to further accelerate the performance, uniquely making it a parallel and anytime technique at the same time. Experiments show speedup factors of orders of magnitude compared to DBSCAN and its fastest variants as well as a high parallel scalability on multicore processors for very large real and synthetic complex datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

The big data system, components, tools, and technologies: a survey

Article 18 September 2018

A survey of density based clustering algorithms

Article 29 September 2020

Notes

References

Aggarwal CC, Reddy CK (eds) (2014) Data clustering: algorithms and applications. CRC Press, Boca Raton
MATH Google Scholar
Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: International conference on management of data (SIGMOD), pp 49–60
Arlia D, Coppola M (2001) Experiments in parallel clustering with DBSCAN. In: International Euro-par conference, pp 326–331
Assent I, Kranen P, Baldauf C, Seidl T (2012) AnyOut: anytime outlier detection on streaming data. In: International conference on database systems for advanced applications (DASFAA) (1), pp 228–242
Bentley JL (1975) Multidimensional binary search trees used for associative searching. Commun ACM 18(9):509–517
Article MathSciNet MATH Google Scholar
Böhm C, Noll R, Plant C, Wackersreuther B (2009) Density-based clustering using graphics processors. In: International conference on information and knowledge management (CIKM), pp 661–670
Böhm C, Feng J, He X, Mai ST, Plant C, Shao J (2011) A novel similarity measure for fiber clustering using longest common subsequence. In: Proceedings of the 2011 workshop on data mining for medicine and healthcare, pp 1–9
Borah B, Bhattacharyya DK (2004) An improved sampling-based DBSCAN for large spatial databases. In: International conference on intelligent sensing and information processing (ICISIP), pp 92–96
Brecheisen S, Kriegel H, Pfeifle M (2004) Efficient density-based clustering of complex objects. In: IEEE international conference on data mining (ICDM), pp 43–50
Brecheisen S, Kriegel H, Pfeifle M (2006a) Parallel density-based clustering of complex objects. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 179–188
Brecheisen S, Kriegel HP, Pfeifle M (2006b) Multi-step density-based clustering. Knowl. Inf Syst 9(3):284–308
Google Scholar
Chen L, Ng RT (2004) On the marriage of Lp-norms and edit distance. In: Very large data bases (VLDB), pp 792–803
Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. MIT Press, Cambridge
MATH Google Scholar
Dai BR, Lin IC (2012) Efficient map/reduce-based DBSCAN algorithm with optimized data partition. In: IEEE international conference on cloud computing (CLOUD), pp 59–66
Dash M, Liu H, Xu X (2001) ‘\(1 + 1 > 2\)’: Merging distance and density based clustering. In: International conference on database systems for advanced applications (DASFAA), pp 32–39
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 226–231
Fonollosa J, Sheik S, Huerta R, Marco S (2015) Reservoir computing compensates slow response of chemosensor arrays exposed to fast varying gas concentrations in continuous monitoring. Sensors Actuators B: Chem 215:618–629
Article Google Scholar
Francis Z, Villagrasa C, Clairand I (2011) Simulation of DNA damage clustering after proton irradiation using an adapted DBSCAN algorithm. Comput Methods Programs Biomed 101(3):265–270
Article Google Scholar
Gan J, Tao Y (2015) DBSCAN revisited: mis-claim, un-fixability, and approximation. In: International conference on management of data (SIGMOD), pp 519–530
Gan J, Tao Y (2017) On the hardness and approximation of Euclidean DBSCAN. ACM Trans Database Syst 42(3):14:1–14:45
Article MathSciNet Google Scholar
Götz M, Bodenstein C, Riedel M (2015) HPDBSCAN: highly parallel DBSCAN. In: Proceedings of the workshop on machine learning in high-performance computing environments, pp 2:1–2:10
Greiner J (1994) A comparison of parallel algorithms for connected components. In: Proceedings of the 6th annual ACM symposium on parallel algorithms and architectures (SSPA), pp 16–25
Gunawan A (2013) A faster algorithm for DBSCAN. Msc thesis, TU Eindhoven
He Y, Tan H, Luo W, Mao H, Ma D, Feng S, Fan J (2011) MR-DBSCAN: an efficient parallel density-based clustering algorithm using MapReduce. In: International conference on parallel and distributed systems (ICPADS), pp 473–480
Januzaj E, Kriegel HP, Pfeifle M (2004) Scalable density-based distributed clustering. In: European conference on principles of data mining and knowledge discovery (PKDD), pp 231–244
Kobayashi T, Iwamura M, Matsuda T, Kise K (2013) An anytime algorithm for camera-based character recognition. In: International conference on document analysis and recognition (ICDAR), pp 1140–1144
Kriegel HP, Pfeifle M (2005) Density-based clustering of uncertain data. In: ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 672–677
Kriegel H, Schubert E, Zimek A (2017) The (black) art of runtime evaluation: are we comparing algorithms or implementations? Knowl Inf Syst 52(2):341–378
Article Google Scholar
Kumar V (2002) Introduction to parallel computing, 2nd edn. Addison-Wesley, Boston
Google Scholar
Li T, Heinis T, Luk W (2016) Hashing-based approximate DBSCAN. In: Symposium on advances in databases and information systems (ADBIS), pp 31–45
Li T, Heinis T, Luk W (2017) ADvaNCE—efficient and scalable approximate density-based clustering based on hashing. Informatica 28(1):105–130
Article Google Scholar
Lulli A, Dell’Amico M, Michiardi P, Ricci L (2016) NG-DBSCAN: scalable density-based clustering for arbitrary data. Proc VLDB Endow (PVLDB) 10(3):157–168
Article Google Scholar
Mahran S, Mahar K (2008) Using grid for accelerating density-based clustering. In: IEEE international conference on computer and information technology (CIT), pp 35–40
Mai ST, Goebl S, Plant C (2012) A similarity model and segmentation algorithm for white matter fiber tracts. In: IEEE international conference on data mining (ICDM), pp 1014–1019
Mai ST, He X, Feng J, Böhm C (2013a) Efficient anytime density-based clustering. In: SIAM international conference on data mining (SDM), pp 112–120
Mai ST, He X, Hubig N, Plant C, Böhm C (2013b) Active density-based clustering. In: IEEE international conference on data mining (ICDM), pp 508–517
Mai ST, He X, Feng J, Plant C, Böhm C (2015) Anytime density-based clustering of complex data. Knowl Inf Syst 45(2):319–355
Article Google Scholar
Mai ST, Assent I, Le A (2016a) Anytime OPTICS: an efficient approach for hierarchical density-based clustering. In: International conference on database systems for advanced applications (DASFAA), pp 164–179
Mai ST, Assent I, Storgaard M (2016b) AnyDBC: an efficient anytime density-based clustering algorithm for very large complex datasets. In: ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 1025–1034
Mai ST, Dieu MS, Assent I, Jacobsen J, Kristensen J, Birk M (2017) Scalable and interactive graph clustering algorithm on multicore CPUs. In: IEEE international conference on data engineering (ICDE), pp 349–360
Patwary MMA, Palsetia D, Agrawal A, Liao W, Manne F, Choudhary AN (2012) A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In: Proceedings of the international conference on high performance computing, networking, storage and analysis (SC), p 62
Patwary MMA, Satish N, Sundaram N, Manne F, Habib S, Dubey P (2014) Pardicle: parallel approximate density-based clustering. In: Proceedings of the international conference on high performance computing, networking, storage and analysis (SC), pp 560–571
Reiss A, Stricker D (2012) Introducing a new benchmarked dataset for activity monitoring. In: International symposium on wearable computers (ISWC), pp 108–109
Sakai T, Tamura K, Kitakami H (2017) Cell-based DBSCAN algorithm using minimum bounding rectangle criteria. In: International conference on database systems for advanced applications (DASFAA), pp 133–144
Sander J, Ester M, Kriegel HP, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Discov 2(2):169–194
Article Google Scholar
Schubert E, Sander J, Ester M, Kriegel H, Xu X (2017) DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Trans Database Syst 42(3):19:1–19:21
Article MathSciNet Google Scholar
Tramacere A, Vecchio C (2013) \(\gamma \)-Ray DBSCAN: a clustering algorithm applied to fermi-LAT \(\gamma \)-ray data—I. Detection performances with real and simulated data. Astron Astrophys 549:A138
Article Google Scholar
Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: International conference on machine learning (ICML), pp 1073–1080
Wang X, Hamilton HJ (2003) DBRS: a density-based spatial clustering method with random sampling. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 563–575
Xu X, Jäger J, Kriegel HP (1999) A fast parallel clustering algorithm for large spatial databases. Data Min Knowl Discov 3(3):263–290
Article Google Scholar
Zaki MJ, M W Jr (2014) Data mining and analysis: fundamental concepts and algorithms. Cambridge University Press, New York
Book Google Scholar
Zhao W, Hopke PK, Prather KA (2008) Comparison of two cluster analysis methods using single particle mass spectra. Atmos Environ 42(5):881–892
Article Google Scholar
Zhou S, Zhou A, Cao J, Jin W, Fan Y, Hu Y (2000) Combining sampling technique with DBSCAN algorithm for clustering large spatial databases. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 169–172
Zilberstein S (1996) Using anytime algorithms in intelligent systems. AI Mag 17(3):73–83
Google Scholar

Download references

Acknowledgements

We thank Prof. Yufei Tao for providing us with the binary file of DBSCANR and the authors of PDSDBSCAN for making their source code available for download. Our special thanks to the anonymous reviewers for their invaluable comments. We appreciate the support and discussion with Sihem Amer-Yahia, Diep Phan, Jon Jacobsen, Jesper Kristensen, Ky Nguyen, Kenneth Bøgh, Sean Chester, and Manuel Ciosici during the preparation of this paper. Part of this research was funded by a Villum postdoc fellowship.

Author information

Son T. Mai
Present address: University of Grenoble Alpes, Grenoble, France

Authors and Affiliations

Aarhus University, Aarhus, Denmark
Son T. Mai, Ira Assent, Jon Jacobsen & Martin Storgaard Dieu

Authors

Son T. Mai
View author publications
You can also search for this author in PubMed Google Scholar
Ira Assent
View author publications
You can also search for this author in PubMed Google Scholar
Jon Jacobsen
View author publications
You can also search for this author in PubMed Google Scholar
Martin Storgaard Dieu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Son T. Mai.

Additional information

Responsible editor: Jian Pei.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mai, S.T., Assent, I., Jacobsen, J. et al. Anytime parallel density-based clustering. Data Min Knowl Disc 32, 1121–1176 (2018). https://doi.org/10.1007/s10618-018-0562-1

Download citation

Received: 30 November 2016
Accepted: 30 March 2018
Published: 10 April 2018
Issue Date: July 2018
DOI: https://doi.org/10.1007/s10618-018-0562-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Anytime parallel density-based clustering

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

The big data system, components, tools, and technologies: a survey

A survey of density based clustering algorithms

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Anytime parallel density-based clustering

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

The big data system, components, tools, and technologies: a survey

A survey of density based clustering algorithms

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation