research-article

On the Hardness and Approximation of Euclidean DBSCAN

Authors:

Yufei TaoAuthors Info & Claims

ACM Transactions on Database Systems (TODS), Volume 42, Issue 3

Article No.: 14, Pages 1 - 45

https://doi.org/10.1145/3083897

Published: 31 July 2017 Publication History

Abstract

DBSCAN is a method proposed in 1996 for clustering multi-dimensional points, and has received extensive applications. Its computational hardness is still unsolved to this date. The original KDD‚96 paper claimed an algorithm of O(n log n) ”average runtime complexity„ (where n is the number of data points) without a rigorous proof. In 2013, a genuine O(n log n)-time algorithm was found in 2D space under Euclidean distance. The hardness of dimensionality d ≥3 has remained open ever since.

This article considers the problem of computing DBSCAN clusters from scratch (assuming no existing indexes) under Euclidean distance. We prove that, for d ≥3, the problem requires ω(n ^4/3) time to solve, unless very significant breakthroughs—ones widely believed to be impossible—could be made in theoretical computer science. Motivated by this, we propose a relaxed version of the problem called ρ-approximate DBSCAN, which returns the same clusters as DBSCAN, unless the clusters are ”unstable„ (i.e., they change once the input parameters are slightly perturbed). The ρ-approximate problem can be settled in O(n) expected time regardless of the constant dimensionality d.

The article also enhances the previous result on the exact DBSCAN problem in 2D space. We show that, if the n data points have been pre-sorted on each dimension (i.e., one sorted list per dimension), the problem can be settled in O(n) worst-case time. As a corollary, when all the coordinates are integers, the 2D DBSCAN problem can be solved in O(n log log n) time deterministically, improving the existing O(n log n) bound.

References

[1]

Pankaj K. Agarwal, Herbert Edelsbrunner, and Otfried Schwarzkopf. 1991. Euclidean minimum spanning trees and bichromatic closest pairs. Discrete 8 Computational Geometry 6 (1991), 407--422.

Digital Library

[2]

Arne Andersson, Torben Hagerup, Stefan Nilsson, and Rajeev Raman. 1998. Sorting in linear time?Journal of Computer and System Sciences (JCSS) 57, 1 (1998), 74--93.

Digital Library

[3]

Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. OPTICS: Ordering points to identify the clustering structure. In Proceedings of ACM Management of Data (SIGMOD). 49--60.

Digital Library

[4]

Sunil Arya and David M. Mount. 2000. Approximate range searching. Computational Geometry 17, 3--4 (2000), 135--152.

[5]

Sunil Arya and David M. Mount. 2016. A fast and simple algorithm for computing approximate Euclidean minimum spanning trees. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 1220--1233.

[6]

K. Bache and M. Lichman. 2013. UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml.

[7]

Christian Böhm, Karin Kailing, Peer Kröger, and Arthur Zimek. 2004. Computing clusters of correlation connected objects. In Proceedings of ACM Management of Data (SIGMOD). 455--466.

Digital Library

[8]

B. Borah and D. K. Bhattacharyya. 2004. An improved sampling-based DBSCAN for large spatial databases. In Proceedings of Intelligent Sensing and Information Processing. 92--96.

[9]

Prosenjit Bose, Anil Maheshwari, Pat Morin, Jason Morrison, Michiel H. M. Smid, and Jan Vahrenhold. 2007. Space-efficient geometric divide-and-conquer algorithms. Computational Geometry 37, 3 (2007), 209--227.

Digital Library

[10]

Vineet Chaoji, Mohammad Al Hasan, Saeed Salem, and Mohammed J. Zaki. 2008. SPARCL: Efficient and effective shape-based clustering. In Proceedings of International Conference on Management of Data (ICDM). 93--102.

Digital Library

[11]

Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars. 2008. Computational Geometry: Algorithms and Applications (3rd ed.). Springer-Verlag.

Digital Library

[12]

Mark de Berg, Constantinos Tsirogiannis, and B. T. Wilkinson. 2015. Fast computation of categorical richness on raster data sets and related problems. 18:1--18:10 .

[13]

Jeff Erickson. 1995. On the relative complexities of some geometric problems. In Proceedings of the Canadian Conference on Computational Geometry (CCCG). 85--90.

[14]

Jeff Erickson. 1996. New lower bounds for Hopcroft&lsquor;s problem. Discrete 8 Computational Geometry 16, 4 (1996), 389--418.

Digital Library

[15]

Martin Ester. 2013. Density-based clustering. In Data Clustering: Algorithms and Applications. 111--126.

[16]

Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of ACM Knowledge Discovery and Data Mining (SIGKDD). 226--231.

[17]

Junhao Gan and Yufei Tao. 2015. DBSCAN revisited: Mis-claim, un-fixability, and approximation. In Proceedings of ACM Management of Data (SIGMOD). 519--530.

Digital Library

[18]

Ade Gunawan. 2013. A Faster Algorithm for DBSCAN. Master&lsquor;s thesis. Technische University Eindhoven.

[19]

Jiawei Han, Micheline Kamber, and Jian Pei. 2012. Data Mining: Concepts and Techniques. Morgan Kaufmann.

[20]

Yijie Han and Mikkel Thorup. 2002. Integer sorting in 0(n sqrt (log log n)) expected time and linear space. In Proceedings of Annual IEEE Symposium on Foundations of Computer Science (FOCS). 135--144.

[21]

G. R. Hjaltason and H. Samet. 1999. Distance browsing in spatial databases. ACM Transactions on Database Systems (TODS) 24, 2 (1999), 265--318.

Digital Library

[22]

David G. Kirkpatrick and Stefan Reisch. 1984. Upper bounds for sorting integers on random access machines. Theoretical Computer Science 28 (1984), 263--276.

[23]

Matthias Klusch, Stefano Lodi, and Gianluca Moro. 2003. Distributed clustering based on sampling local density estimates. In Proceedings of the International Joint Conference of Artificial Intelligence (IJCAI). 485--490.

[24]

Zhenhui Li, Bolin Ding, Jiawei Han, and Roland Kays. 2010. Swarm: Mining relaxed temporal moving object clusters. Proceedings of the VLDB Endowment (PVLDB) 3, 1 (2010), 723--734.

Digital Library

[25]

Bing Liu. 2006. A fast density-based clustering algorithm for large databases. In Proceedings of International Conference on Machine Learning and Cybernetics. 996--1000.

[26]

Eric Hsueh-Chan Lu, Vincent S. Tseng, and Philip S. Yu. 2011. Mining cluster-based temporal mobile sequential patterns in location-based service environments. IEEE Transactions on Knowledge and Data Engineering (TKDE) 23, 6 (2011), 914--927.

Digital Library

[27]

Shaaban Mahran and Khaled Mahar. 2008. Using grid for accelerating density-based clustering. In Proceedings of IEEE International Conference on Computer and Information Technology (CIT). 35--40.

[28]

Jirí Matousek. 1993. Range searching with efficient hiearchical cutting. Discrete 8 Computational Geometry 10 (1993), 157--182.

Digital Library

[29]

Boriana L. Milenova and Marcos M. Campos. 2002. O-Cluster: Scalable clustering of large high dimensional data sets. In Proceedings of International Conference on Management of Data (ICDM). 290--297.

[30]

Davoud Moulavi, Pablo A. Jaskowiak, Ricardo J. G. B. Campello, Arthur Zimek, and Jörg Sander. 2014. Density-based clustering validation. In International Conference on Data Mining. 839--847.

[31]

Md. Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-keng Liao, Fredrik Manne, and Alok N. Choudhary. 2012. A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In Conference on High Performance Computing Networking, Storage and Analysis. 62 .

[32]

Tao Pei, A-Xing Zhu, Chenghu Zhou, Baolin Li, and Chengzhi Qin. 2006. A new approach to the nearest-neighbour method to discover cluster features in overlaid spatial point processes. International Journal of Geographical Information Science 20, 2 (2006), 153--168.

[33]

Attila Reiss and Didier Stricker. 2012. Introducing a new benchmarked dataset for activity monitoring. In International Symposium on Wearable Computers. 108--109.

Digital Library

[34]

S. Roy and D. K. Bhattacharyya. 2005. An approach to find embedded clusters using density based techniques. In Proceedings of Distributed Computing and Internet Technology. 523--535.

Digital Library

[35]

Gholamhosein Sheikholeslami, Surojit Chatterjee, and Aidong Zhang. 2000. WaveCluster: A wavelet based clustering approach for spatial data in very large databases. The VLDB Journal 8, 3--4 (2000), 289--304.

Digital Library

[36]

Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. 2006. Introduction to Data Mining. Pearson.

[37]

Robert Endre Tarjan. 1979. A class of algorithms which require nonlinear time to maintain disjoint sets. Journal of Computer and System Sciences (JCSS) 18, 2 (1979), 110--127.

[38]

Cheng-Fa Tsai and Chien-Tsung Wu. 2009. GF-DBSCAN: A new efficient and effective data clustering technique for large databases. In Proceedings of International Conference on Multimedia Systems and Signal Processing. 231--236.

[39]

Manik Varma and Andrew Zisserman. 2003. Texture classification: Are filter banks necessary?. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 691--698.

[40]

Wei Wang, Jiong Yang, and Richard R. Muntz. 1997. STING: A statistical information grid approach to spatial data mining. In Proceedings of Very Large Data Bases (VLDB). 186--195.

[41]

Ji-Rong Wen, Jian-Yun Nie, and HongJiang Zhang. 2002. Query clustering using user logs. ACM Transactions on Information Systems (TOIS) 20, 1 (2002), 59--81.

Digital Library

Cited By

Men ZShen ZGu YSun Y(2025)Parallel kd-tree with Batch UpdatesProceedings of the ACM on Management of Data10.1145/37097123:1(1-26)Online publication date: 11-Feb-2025
https://dl.acm.org/doi/10.1145/3709712
Chen YRuys WBiros G(2025)KNN-DBSCAN: a DBSCAN in high dimensionsACM Transactions on Parallel Computing10.1145/370162412:1(1-27)Online publication date: 12-Feb-2025
https://dl.acm.org/doi/10.1145/3701624
Sao PProkopenko ALebrun-Grandie D(2024)PANDORA: A Parallel Dendrogram Construction Algorithm for Single Linkage Clustering on GPUProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673148(908-918)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673148
Show More Cited By

Recommendations

AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities

Clustering is a typical data mining technique that partitions a dataset into multiple subsets of similar objects according to similarity metrics. In particular, density-based algorithms can find clusters of different shapes and sizes while remaining ...
A new hybrid method based on partitioning-based DBSCAN and ant clustering

Clustering problem is an unsupervised learning problem. It is a procedure that partition data objects into matching clusters. The data objects in the same cluster are quite similar to each other and dissimilar in the other clusters. Density-based ...
Exact, Fast and Scalable Parallel DBSCAN for Commodity Platforms
ICDCN '17: Proceedings of the 18th International Conference on Distributed Computing and Networking

DBSCAN is one of the most popular density-based clustering algorithm capable of identifying arbitrary shaped clusters and noise. It is computationally expensive for large data sets. In this paper, we present a grid-based DBSCAN algorithm, GridDBSCAN, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems

ACM Transactions on Database Systems Volume 42, Issue 3

Invited Paper from SIGMOD 2015, Invited Paper from PODS 2015, Regular Papers and Technical Correspondence

September 2017

220 pages

ISSN:0362-5915

EISSN:1557-4644

DOI:10.1145/3129336

Editor:
Christian S. Jensen
Aalborg University, Denmark

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 July 2017

Accepted: 01 April 2017

Revised: 01 April 2017

Received: 01 April 2016

Published in TODS Volume 42, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

34
Total Citations
View Citations
636
Total Downloads

Downloads (Last 12 months)72
Downloads (Last 6 weeks)7

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Men ZShen ZGu YSun Y(2025)Parallel kd-tree with Batch UpdatesProceedings of the ACM on Management of Data10.1145/37097123:1(1-26)Online publication date: 11-Feb-2025
https://dl.acm.org/doi/10.1145/3709712
Chen YRuys WBiros G(2025)KNN-DBSCAN: a DBSCAN in high dimensionsACM Transactions on Parallel Computing10.1145/370162412:1(1-27)Online publication date: 12-Feb-2025
https://dl.acm.org/doi/10.1145/3701624
Sao PProkopenko ALebrun-Grandie D(2024)PANDORA: A Parallel Dendrogram Construction Algorithm for Single Linkage Clustering on GPUProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673148(908-918)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673148
Dong XDhulipala LGu YSun YLee IChabbi MSteuwer M(2024)Parallel Integer Sort: Theory and PracticeProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638483(301-315)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638483
Jahn PFrey CBeer ALeiber CSeidl T(2024)Data with Density-Based Clusters: A Generator for Systematic Evaluation of Clustering AlgorithmsMachine Learning and Knowledge Discovery in Databases. Research Track10.1007/978-3-031-70368-3_1(3-21)Online publication date: 22-Aug-2024
https://doi.org/10.1007/978-3-031-70368-3_1
You HLi YQin ZLei PChen JShi X(2023)Research on Multilevel Filtering Algorithm Used for Denoising Strong and Weak Beams of Daytime Photon Cloud Data with High Background NoiseRemote Sensing10.3390/rs1517426015:17(4260)Online publication date: 30-Aug-2023
https://doi.org/10.3390/rs15174260
Prokopenko ALebrun-Grandie DArndt D(2023)Fast tree-based algorithms for DBSCAN for low-dimensional data on GPUsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605594(503-512)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605594
Huang XMa T(2023)Fast Density-Based Clustering: Geometric ApproachProceedings of the ACM on Management of Data10.1145/35889121:1(1-24)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588912
Chan TU LChoi BXu JCheng RDas SPandis ISelçuk Candan KAmer-Yahia S(2023)Large-scale Geospatial Analytics: Problems, Challenges, and OpportunitiesCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589401(21-29)Online publication date: 4-Jun-2023
https://dl.acm.org/doi/10.1145/3555041.3589401
Huang XMa TLiu CLiu S(2023)GriT-DBSCAN: A spatial clustering algorithm for very large databasesPattern Recognition10.1016/j.patcog.2023.109658142(109658)Online publication date: Oct-2023
https://doi.org/10.1016/j.patcog.2023.109658
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents