skip to main content
research-article

On the Hardness and Approximation of Euclidean DBSCAN

Published: 31 July 2017 Publication History

Abstract

DBSCAN is a method proposed in 1996 for clustering multi-dimensional points, and has received extensive applications. Its computational hardness is still unsolved to this date. The original KDD‚96 paper claimed an algorithm of O(n log n) ”average runtime complexity„ (where n is the number of data points) without a rigorous proof. In 2013, a genuine O(n log n)-time algorithm was found in 2D space under Euclidean distance. The hardness of dimensionality d ≥3 has remained open ever since.
This article considers the problem of computing DBSCAN clusters from scratch (assuming no existing indexes) under Euclidean distance. We prove that, for d ≥3, the problem requires ω(n 4/3) time to solve, unless very significant breakthroughs—ones widely believed to be impossible—could be made in theoretical computer science. Motivated by this, we propose a relaxed version of the problem called ρ-approximate DBSCAN, which returns the same clusters as DBSCAN, unless the clusters are ”unstable„ (i.e., they change once the input parameters are slightly perturbed). The ρ-approximate problem can be settled in O(n) expected time regardless of the constant dimensionality d.
The article also enhances the previous result on the exact DBSCAN problem in 2D space. We show that, if the n data points have been pre-sorted on each dimension (i.e., one sorted list per dimension), the problem can be settled in O(n) worst-case time. As a corollary, when all the coordinates are integers, the 2D DBSCAN problem can be solved in O(n log log n) time deterministically, improving the existing O(n log n) bound.

References

[1]
Pankaj K. Agarwal, Herbert Edelsbrunner, and Otfried Schwarzkopf. 1991. Euclidean minimum spanning trees and bichromatic closest pairs. Discrete 8 Computational Geometry 6 (1991), 407--422.
[2]
Arne Andersson, Torben Hagerup, Stefan Nilsson, and Rajeev Raman. 1998. Sorting in linear time?Journal of Computer and System Sciences (JCSS) 57, 1 (1998), 74--93.
[3]
Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. OPTICS: Ordering points to identify the clustering structure. In Proceedings of ACM Management of Data (SIGMOD). 49--60.
[4]
Sunil Arya and David M. Mount. 2000. Approximate range searching. Computational Geometry 17, 3--4 (2000), 135--152.
[5]
Sunil Arya and David M. Mount. 2016. A fast and simple algorithm for computing approximate Euclidean minimum spanning trees. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 1220--1233.
[6]
K. Bache and M. Lichman. 2013. UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml.
[7]
Christian Böhm, Karin Kailing, Peer Kröger, and Arthur Zimek. 2004. Computing clusters of correlation connected objects. In Proceedings of ACM Management of Data (SIGMOD). 455--466.
[8]
B. Borah and D. K. Bhattacharyya. 2004. An improved sampling-based DBSCAN for large spatial databases. In Proceedings of Intelligent Sensing and Information Processing. 92--96.
[9]
Prosenjit Bose, Anil Maheshwari, Pat Morin, Jason Morrison, Michiel H. M. Smid, and Jan Vahrenhold. 2007. Space-efficient geometric divide-and-conquer algorithms. Computational Geometry 37, 3 (2007), 209--227.
[10]
Vineet Chaoji, Mohammad Al Hasan, Saeed Salem, and Mohammed J. Zaki. 2008. SPARCL: Efficient and effective shape-based clustering. In Proceedings of International Conference on Management of Data (ICDM). 93--102.
[11]
Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars. 2008. Computational Geometry: Algorithms and Applications (3rd ed.). Springer-Verlag.
[12]
Mark de Berg, Constantinos Tsirogiannis, and B. T. Wilkinson. 2015. Fast computation of categorical richness on raster data sets and related problems. 18:1--18:10 .
[13]
Jeff Erickson. 1995. On the relative complexities of some geometric problems. In Proceedings of the Canadian Conference on Computational Geometry (CCCG). 85--90.
[14]
Jeff Erickson. 1996. New lower bounds for Hopcroft‚s problem. Discrete 8 Computational Geometry 16, 4 (1996), 389--418.
[15]
Martin Ester. 2013. Density-based clustering. In Data Clustering: Algorithms and Applications. 111--126.
[16]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of ACM Knowledge Discovery and Data Mining (SIGKDD). 226--231.
[17]
Junhao Gan and Yufei Tao. 2015. DBSCAN revisited: Mis-claim, un-fixability, and approximation. In Proceedings of ACM Management of Data (SIGMOD). 519--530.
[18]
Ade Gunawan. 2013. A Faster Algorithm for DBSCAN. Master‚s thesis. Technische University Eindhoven.
[19]
Jiawei Han, Micheline Kamber, and Jian Pei. 2012. Data Mining: Concepts and Techniques. Morgan Kaufmann.
[20]
Yijie Han and Mikkel Thorup. 2002. Integer sorting in 0(n sqrt (log log n)) expected time and linear space. In Proceedings of Annual IEEE Symposium on Foundations of Computer Science (FOCS). 135--144.
[21]
G. R. Hjaltason and H. Samet. 1999. Distance browsing in spatial databases. ACM Transactions on Database Systems (TODS) 24, 2 (1999), 265--318.
[22]
David G. Kirkpatrick and Stefan Reisch. 1984. Upper bounds for sorting integers on random access machines. Theoretical Computer Science 28 (1984), 263--276.
[23]
Matthias Klusch, Stefano Lodi, and Gianluca Moro. 2003. Distributed clustering based on sampling local density estimates. In Proceedings of the International Joint Conference of Artificial Intelligence (IJCAI). 485--490.
[24]
Zhenhui Li, Bolin Ding, Jiawei Han, and Roland Kays. 2010. Swarm: Mining relaxed temporal moving object clusters. Proceedings of the VLDB Endowment (PVLDB) 3, 1 (2010), 723--734.
[25]
Bing Liu. 2006. A fast density-based clustering algorithm for large databases. In Proceedings of International Conference on Machine Learning and Cybernetics. 996--1000.
[26]
Eric Hsueh-Chan Lu, Vincent S. Tseng, and Philip S. Yu. 2011. Mining cluster-based temporal mobile sequential patterns in location-based service environments. IEEE Transactions on Knowledge and Data Engineering (TKDE) 23, 6 (2011), 914--927.
[27]
Shaaban Mahran and Khaled Mahar. 2008. Using grid for accelerating density-based clustering. In Proceedings of IEEE International Conference on Computer and Information Technology (CIT). 35--40.
[28]
Jirí Matousek. 1993. Range searching with efficient hiearchical cutting. Discrete 8 Computational Geometry 10 (1993), 157--182.
[29]
Boriana L. Milenova and Marcos M. Campos. 2002. O-Cluster: Scalable clustering of large high dimensional data sets. In Proceedings of International Conference on Management of Data (ICDM). 290--297.
[30]
Davoud Moulavi, Pablo A. Jaskowiak, Ricardo J. G. B. Campello, Arthur Zimek, and Jörg Sander. 2014. Density-based clustering validation. In International Conference on Data Mining. 839--847.
[31]
Md. Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-keng Liao, Fredrik Manne, and Alok N. Choudhary. 2012. A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In Conference on High Performance Computing Networking, Storage and Analysis. 62 .
[32]
Tao Pei, A-Xing Zhu, Chenghu Zhou, Baolin Li, and Chengzhi Qin. 2006. A new approach to the nearest-neighbour method to discover cluster features in overlaid spatial point processes. International Journal of Geographical Information Science 20, 2 (2006), 153--168.
[33]
Attila Reiss and Didier Stricker. 2012. Introducing a new benchmarked dataset for activity monitoring. In International Symposium on Wearable Computers. 108--109.
[34]
S. Roy and D. K. Bhattacharyya. 2005. An approach to find embedded clusters using density based techniques. In Proceedings of Distributed Computing and Internet Technology. 523--535.
[35]
Gholamhosein Sheikholeslami, Surojit Chatterjee, and Aidong Zhang. 2000. WaveCluster: A wavelet based clustering approach for spatial data in very large databases. The VLDB Journal 8, 3--4 (2000), 289--304.
[36]
Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. 2006. Introduction to Data Mining. Pearson.
[37]
Robert Endre Tarjan. 1979. A class of algorithms which require nonlinear time to maintain disjoint sets. Journal of Computer and System Sciences (JCSS) 18, 2 (1979), 110--127.
[38]
Cheng-Fa Tsai and Chien-Tsung Wu. 2009. GF-DBSCAN: A new efficient and effective data clustering technique for large databases. In Proceedings of International Conference on Multimedia Systems and Signal Processing. 231--236.
[39]
Manik Varma and Andrew Zisserman. 2003. Texture classification: Are filter banks necessary?. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 691--698.
[40]
Wei Wang, Jiong Yang, and Richard R. Muntz. 1997. STING: A statistical information grid approach to spatial data mining. In Proceedings of Very Large Data Bases (VLDB). 186--195.
[41]
Ji-Rong Wen, Jian-Yun Nie, and HongJiang Zhang. 2002. Query clustering using user logs. ACM Transactions on Information Systems (TOIS) 20, 1 (2002), 59--81.

Cited By

View all
  • (2025)Parallel kd-tree with Batch UpdatesProceedings of the ACM on Management of Data10.1145/37097123:1(1-26)Online publication date: 11-Feb-2025
  • (2025)KNN-DBSCAN: a DBSCAN in high dimensionsACM Transactions on Parallel Computing10.1145/370162412:1(1-27)Online publication date: 12-Feb-2025
  • (2024)PANDORA: A Parallel Dendrogram Construction Algorithm for Single Linkage Clustering on GPUProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673148(908-918)Online publication date: 12-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 42, Issue 3
Invited Paper from SIGMOD 2015, Invited Paper from PODS 2015, Regular Papers and Technical Correspondence
September 2017
220 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/3129336
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 July 2017
Accepted: 01 April 2017
Revised: 01 April 2017
Received: 01 April 2016
Published in TODS Volume 42, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. DBSCAN
  2. algorithms
  3. computational geometry
  4. density-based clustering
  5. hopcroft hard

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)72
  • Downloads (Last 6 weeks)7
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Parallel kd-tree with Batch UpdatesProceedings of the ACM on Management of Data10.1145/37097123:1(1-26)Online publication date: 11-Feb-2025
  • (2025)KNN-DBSCAN: a DBSCAN in high dimensionsACM Transactions on Parallel Computing10.1145/370162412:1(1-27)Online publication date: 12-Feb-2025
  • (2024)PANDORA: A Parallel Dendrogram Construction Algorithm for Single Linkage Clustering on GPUProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673148(908-918)Online publication date: 12-Aug-2024
  • (2024)Parallel Integer Sort: Theory and PracticeProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638483(301-315)Online publication date: 2-Mar-2024
  • (2024)Data with Density-Based Clusters: A Generator for Systematic Evaluation of Clustering AlgorithmsMachine Learning and Knowledge Discovery in Databases. Research Track10.1007/978-3-031-70368-3_1(3-21)Online publication date: 22-Aug-2024
  • (2023)Research on Multilevel Filtering Algorithm Used for Denoising Strong and Weak Beams of Daytime Photon Cloud Data with High Background NoiseRemote Sensing10.3390/rs1517426015:17(4260)Online publication date: 30-Aug-2023
  • (2023)Fast tree-based algorithms for DBSCAN for low-dimensional data on GPUsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605594(503-512)Online publication date: 7-Aug-2023
  • (2023)Fast Density-Based Clustering: Geometric ApproachProceedings of the ACM on Management of Data10.1145/35889121:1(1-24)Online publication date: 30-May-2023
  • (2023)Large-scale Geospatial Analytics: Problems, Challenges, and OpportunitiesCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589401(21-29)Online publication date: 4-Jun-2023
  • (2023)GriT-DBSCAN: A spatial clustering algorithm for very large databasesPattern Recognition10.1016/j.patcog.2023.109658142(109658)Online publication date: Oct-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media