research-article

Fast Density-Based Clustering: Geometric Approach

Authors:
Xiaogang Huang

Southwestern University of Finance and Economics, Chengdu, China

Southwestern University of Finance and Economics, Chengdu, China

0000-0002-1267-8498
View Profile

,
Tiefeng Ma

Southwestern University of Finance and Economics, Chengdu, China

Southwestern University of Finance and Economics, Chengdu, China

0000-0003-3464-6080
View Profile

Proceedings of the ACM on Management of Data Volume 1 Issue 1Article No.: 58pp 1–24https://doi.org/10.1145/3588912

Published:30 May 2023Publication History

Proceedings of the ACM on Management of Data

Abstract

DBSCAN is a fundamental density-based clustering algorithm with extensive applications. However, a bottleneck of DBSCAN is its O(n2) worst-case time complexity. In this paper, we propose an algorithm called GAP-DBC, which exploits the geometric relationships between points to solve this problem. GAP-DBC introduces an efficient partitioning algorithm to partition the data set with a limited number of range queries and then establishes an initial cluster structure based on the partition. GAP-DBC proceeds to iteratively refine the cluster structure by additional range queries. Finally, the cluster structure is accomplished using an iterative algorithm that utilizes the spatial relationships among points to reduce unnecessary distance calculations. We further demonstrate theoretically that GAP-DBC has an excellent guarantee in terms of computational efficiency. We conducted experiments on both synthetic and real-world data sets to evaluate the performance of GAP-DBC. The results show that our algorithm is competitive with other state-of-the-art algorithms.

Supplemental Material

PACMMOD-V1mod58.mp4

Presentation video for SIGMOD 2023

mp4

18.6 MB

Download

References

Thapana Boonchoo, Xiang Ao, Yang Liu, Weizhong Zhao, Fuzhen Zhuang, and Qing He. 2019. Grid-based DBSCAN: Indexing and inference. Pattern Recognition, Vol. 90 (2019), 271--284.Google ScholarDigital Library
Bhogeswar Borah and Dhruba K. Bhattacharyya. 2004. An improved sampling-based DBSCAN for large spatial databases. In Proceedings of the 2004 International Conference on Intelligent Sensing and Information Processing. IEEE, 92--96.Google Scholar
Yewang Chen, Shengyu Tang, Nizar Bouguila, Cheng Wang, Jixiang Du, and HaiLin Li. 2018. A fast clustering algorithm based on pruning unnecessary distance computations in DBSCAN for high-dimensional data. Pattern Recognition, Vol. 83 (2018), 375--387.Google ScholarDigital Library
Yewang Chen, Lida Zhou, Nizar Bouguila, Cheng Wang, Yi Chen, and Jixiang Du. 2021. BLOCK-DBSCAN: Fast clustering for large scale data. Pattern Recognition, Vol. 109 (2021), 107624.Google ScholarDigital Library
Yewang Chen, Lida Zhou, Songwen Pei, Zhiwen Yu, Yi Chen, Xin Liu, Jixiang Du, and Naixue Xiong. 2019. KNN-BLOCK DBSCAN: Fast clustering for large-scale data. IEEE Transactions on Systems, Man, and Cybernetics: Systems (2019).Google Scholar
Difei Cheng, Ruihang Xu, Bo Zhang, and Ruinan Jin. 2023. Fast Density Estimation for Density-based Clustering Methods. Neurocomputing (2023).Google Scholar
Dheeru Dua and Casey Graff. 2017. UCI machine learning repository. http://archive.ics.uci.edu/mlGoogle Scholar
Martin Ester, Hans-Peter Kriegel, Jö rg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD '96). 226--231.Google ScholarDigital Library
Junhao Gan and Yufei Tao. 2015. DBSCAN revisited: Mis-claim, un-fixability, and approximation. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). 519--530.Google ScholarDigital Library
Junhao Gan and Yufei Tao. 2017. On the hardness and approximation of Euclidean DBSCAN. ACM Transactions on Database Systems (TODS), Vol. 42, 3 (2017), 1--45.Google ScholarDigital Library
Teofilo F. Gonzalez. 1985. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, Vol. 38 (1985), 293--306.Google ScholarCross Ref
Ade Gunawan and M. de Berg. 2013. A faster algorithm for DBSCAN. Master's thesis (2013).Google Scholar
John A. Hartigan. 1975. Clustering algorithms. John Wiley & Sons.Google ScholarDigital Library
Xiaogang Huang, Tiefeng Ma, Conan Liu, and Shuangzhe Liu. 2022. GriT-DBSCAN: A spatial clustering algorithm for very large databases. arXiv preprint arXiv:2210.07580 (2022).Google Scholar
Jennifer Jang and Heinrich Jiang. 2019. DBSCAN: Towards fast and scalable density clustering. In International Conference on Machine Learning. PMLR, 3019--3029.Google Scholar
Heinrich Jiang, Jennifer Jang, and Jakub Lacki. 2020. Faster DBSCAN via subsampled similarity queries. In Advances in Neural Information Processing Systems, Vol. 33. 22407--22419.Google Scholar
Robert Krauthgamer and James R. Lee. 2004. Navigating nets: Simple algorithms for proximity search. In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '04). USA, 798--807.Google ScholarDigital Library
K. Mahesh Kumar and A. Rama Mohan Reddy. 2016. A fast DBSCAN clustering algorithm by accelerating neighbor searching using Groups method. Pattern Recognition, Vol. 58 (2016), 39--48.Google ScholarDigital Library
Bing Liu. 2006. A fast density-based clustering algorithm for large databases. In 2006 International Conference on Machine Learning and Cybernetics. IEEE, 996--1000.Google ScholarCross Ref
Alessandro Lulli, Matteo Dell'Amico, Pietro Michiardi, and Laura Ricci. 2016. NG-DBSCAN: Scalable density-based clustering for arbitrary data. Proceedings of the VLDB Endowment, Vol. 10, 3 (2016), 157--168.Google ScholarDigital Library
Shaaban Mahran and Khaled Mahar. 2008. Using grid for accelerating density-based clustering. In 8th IEEE International Conference on Computer and Information Technology. IEEE, 35--40.Google ScholarCross Ref
Son T. Mai, Ira Assent, and Martin Storgaard. 2016. AnyDBC: An efficient anytime density-based clustering algorithm for very large complex datasets. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). 1025--1034.Google ScholarDigital Library
Son T. Mai, Jon Jacobsen, Sihem Amer-Yahia, Ivor Spence, Nhat-Phuong Tran, Ira Assent, and Quoc Viet Hung Nguyen. 2022. Incremental density-based clustering on multicore processors. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 3 (2022), 1338--1356.Google ScholarCross Ref
Md Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-keng Liao, Fredrik Manne, and Alok Choudhary. 2012. A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC'12). 1--11.Google ScholarCross Ref
Aditya Sarma, Poonam Goyal, Sonal Kumari, Anand Wani, Jagat Sesh Challa, Saiyedul Islam, and Navneet Goyal. 2019. μDBSCAN: an exact scalable DBSCAN algorithm for big data exploiting spatial locality. In 2019 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 1--11.Google ScholarCross Ref
Erich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu. 2017. DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS), Vol. 42, 3 (2017), 1--21.Google ScholarDigital Library
Hwanjun Song and Jae-Gil Lee. 2018. RP-DBSCAN: A superfast parallel DBSCAN algorithm based on random partitioning. In Proceedings of the 2018 International Conference on Management of Data. 1173--1187.Google ScholarDigital Library
Robert Endre Tarjan. 1979. A class of algorithms which require nonlinear time to maintain disjoint sets. Journal of computer and system sciences, Vol. 18, 2 (1979), 110--127.Google ScholarCross Ref
Isaac Todhunter. 1863. Spherical trigonometry, for the use of colleges and schools: with numerous examples. Macmillan.Google Scholar
Manik Varma and Andrew Zisserman. 2003. Texture classification: Are filter banks necessary?. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings., Vol. 2. IEEE, II--691.Google ScholarCross Ref
P. Viswanath and V. Suresh Babu. 2009. Rough-DBSCAN: A fast hybrid density based clustering method for large data sets. Pattern Recognition Letters, Vol. 30, 16 (2009), 1477--1488.Google ScholarDigital Library
P. Viswanath and R. Pinkesh. 2006. l-DBSCAN: A fast hybrid density based clustering method. In 18th International Conference on Pattern Recognition (ICPR '06), Vol. 1. IEEE, 912--915.Google Scholar
Yiqiu Wang, Yan Gu, and Julian Shun. 2020. Theoretically-efficient and practical parallel DBSCAN. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD '20). 2555--2571.Google ScholarDigital Library
Guoqing Wu, Liqiang Cao, Hongyun Tian, and Wei Wang. 2022. HY-DBSCAN: A hybrid parallel DBSCAN clustering algorithm scalable on distributed-memory computers. J. Parallel and Distrib. Comput. (2022).Google Scholar
Shuigeng Zhou, Aoying Zhou, Jing Cao, Jin Wen, Ye Fan, and Yunfa Hu. 2000. Combining sampling technique with DBSCAN algorithm for clustering large spatial databases. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 169--172.Google ScholarCross Ref

Index Terms

Fast Density-Based Clustering: Geometric Approach
1. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Unsupervised learning and clustering

Recommendations

FINEX: A Fast Index for Exact & Flexible Density-Based Clustering
PACMMOD

Density-based clustering aims to find groups of similar objects (i.e., clusters) in a given dataset. Applications include, e.g., process mining and anomaly detection. It comes with two user parameters (ε, MinPts) that determine the clustering result, but ...
Read More
A new hybrid method based on partitioning-based DBSCAN and ant clustering

Clustering problem is an unsupervised learning problem. It is a procedure that partition data objects into matching clusters. The data objects in the same cluster are quite similar to each other and dissimilar in the other clusters. Density-based ...
Read More
AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities

Clustering is a typical data mining technique that partitions a dataset into multiple subsets of similar objects according to similarity metrics. In particular, density-based algorithms can find clusters of different shapes and sizes while remaining ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the ACM on Management of Data Volume 1, Issue 1
PACMMOD
May 2023
2807 pages
EISSN:2836-6573
DOI:10.1145/3603164
Editor:
Divyakant Agrawal
UC Santa Barbara, United States
Issue’s Table of Contents
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 May 2023
Published in pacmmod Volume 1, Issue 1

Permissions
Request permissions about this article.
Request Permissions
Author Tags
DBSCAN
algorithm
density-based clustering
geometric approach
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 175
  Total Downloads
- Downloads (Last 12 months)175
- Downloads (Last 6 weeks)16
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Fast Density-Based Clustering: Geometric Approach

Proceedings of the ACM on Management of Data

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

FINEX: A Fast Index for Exact & Flexible Density-Based Clustering

A new hybrid method based on partitioning-based DBSCAN and ant clustering

AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Fast Density-Based Clustering: Geometric Approach

Proceedings of the ACM on Management of Data

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

FINEX: A Fast Index for Exact & Flexible Density-Based Clustering

A new hybrid method based on partitioning-based DBSCAN and ant clustering

AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media