skip to main content
10.1145/3132847.3133112acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper

A Communication Efficient Parallel DBSCAN Algorithm based on Parameter Server

Published: 06 November 2017 Publication History

Abstract

Recent benchmark studies show that MPI-based distributed implementations of DBSCAN, e.g., PDSDBSCAN, outperform other implementations such as apache Spark etc. However, the communication cost of MPI DBSCAN increases drastically with the number of processors, which makes it inefficient for large scale problems.
In this paper, we propose PS-DBSCAN, a parallel DBSCAN algorithm that combines the disjoint-set data structure and Parameter Server framework, to minimize communication cost. Since data points within the same cluster may be distributed over different workers which result in several disjoint-sets, merging them incurs large communication costs. In our algorithm, we employ a fast global union approach to union the disjoint-sets to alleviate the communication burden. Experiments over the datasets of different scales demonstrate that PS-DBSCAN outperforms the PDSDBSCAN with 2-10 times speedup on communication efficiency. We have released our PS-DBSCAN in an algorithm platform called Platform of AI (PAI) in Alibaba Cloud.

References

[1]
Stefan Brecheisen, Hans-Peter Kriegel, and Martin Pfeifle. 2006. Parallel Density-Based Clustering of Complex Objects. PAKDD.
[2]
Cen Chen, Peilin Zhao, Longfei Li, Jun Zhou, Xiaolong Li, and Minghui Qiu. 2017. Locally Connected Deep Learning Framework for Industrial-scale Recommender Systems. WWW'17.
[3]
Min Chen, Xuedong Gao, and Huifei Li. 2010. Parallel DBSCAN with priority r-tree. In The 2nd IEEE International Conference on Information Management and Engineering (ICIME). 508--501.
[4]
Massimo Coppola and Marco Vanneschi. 2002. High-performance data mining with skeleton-based structured parallel programming. Parallel Comput. Vol. 28, 5 (2002), 793--813.
[5]
Irving Cordova and Teng-Sheng Moh. 2015. DBSCAN on Resilient Distributed Datasets. (2015), 531--540.
[6]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise SIGKDD. 226--231.
[7]
Yanxiang Fu, Weizhong Zhao, and Huifang Ma. 2011. Research on parallel DBSCAN algorithm design based on mapreduce. Advanced Materials Research. 1133--1138.
[8]
Junhao Gao and Yufei Tao. 2015. DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation SIGMOD. 519--530.
[9]
Markus Götz, Christian Bodenstein, and Morris Riedel. 2015. HPDBSCAN: Highly Parallel DBSCAN. In Proc. of the Workshop on Machine Learning in High-Performance Computing Environments (MLHPC '15). Article bibinfoarticleno2, bibinfonumpages10 pages.
[10]
Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng, and Jianping Fan. 2011. MR-DBSCAN: An Efficient Parallel Density-Based Clustering Algorithm Using MapReduce ICPADS '11. 473--480.
[11]
Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. OSDI. 583--598.
[12]
Aliaksei Litouka. 2014. Spark DBSCAN source code. (2014). https://github.com/alitouka/spark_dbscan
[13]
Helmut Neukirchen. 2016. Survey and Performance Evaluation of DBSCAN Spatial Clustering Implementations for Big Data and High-Performance Computing Paradigms. (2016).
[14]
Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-keng Liao, Fredrik Manne, and Alok Choudhary. 2012. A New Scalable Parallel DBSCAN Algorithm Using the Disjoint-set Data Structure Proc. of the International Conference on High Performance Computing, Networking, Storage and Analysis.
[15]
Xiaowei Xu, Jochen Jäger, and Hans-Peter Kriegel. 2002. A fast parallel clustering algorithm for large spatial databases. High Performance Data Mining. 263--290.
[16]
Jun Zhou, Xiaolong Li, Peilin Zhao, Chaochao Chen, Longfei Li, Xinxing Yang, Qing Cui, Jin Yu, Xu Chen, Yi Ding, and Yuan Alan Qi. 2017. KunPeng: Parameter Server Based Distributed Learning Systems and Its Applications in Alibaba and Ant Financial. In KDD. 1693--1702.

Cited By

View all
  • (2023)A fast parallelized DBSCAN algorithm based on OpenMp for detection of criminals on streaming servicesFrontiers in Big Data10.3389/fdata.2023.12929236Online publication date: 31-Oct-2023
  • (2023)Fast tree-based algorithms for DBSCAN for low-dimensional data on GPUsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605594(503-512)Online publication date: 7-Aug-2023
  • (2022)Incremental Density-Based Clustering on Multicore ProcessorsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2020.302312544:3(1338-1356)Online publication date: 1-Mar-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
November 2017
2604 pages
ISBN:9781450349185
DOI:10.1145/3132847
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. density-based clustering
  2. parallel dbscan
  3. parameter server

Qualifiers

  • Short-paper

Conference

CIKM '17
Sponsor:

Acceptance Rates

CIKM '17 Paper Acceptance Rate 171 of 855 submissions, 20%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)A fast parallelized DBSCAN algorithm based on OpenMp for detection of criminals on streaming servicesFrontiers in Big Data10.3389/fdata.2023.12929236Online publication date: 31-Oct-2023
  • (2023)Fast tree-based algorithms for DBSCAN for low-dimensional data on GPUsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605594(503-512)Online publication date: 7-Aug-2023
  • (2022)Incremental Density-Based Clustering on Multicore ProcessorsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2020.302312544:3(1338-1356)Online publication date: 1-Mar-2022
  • (2021)Fast Parallel Algorithms for Euclidean Minimum Spanning Tree and Hierarchical Spatial ClusteringProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457296(1982-1995)Online publication date: 9-Jun-2021
  • (2021)Integral Curve Clustering and Simplification for Flow Visualization: A Comparative EvaluationIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2019.294093527:3(1967-1985)Online publication date: 1-Mar-2021
  • (2020)Distributed and consistent multi-image feature matching via QuickMatchThe International Journal of Robotics Research10.1177/0278364920917465(027836492091746)Online publication date: 5-Jun-2020
  • (2020)Theoretically-Efficient and Practical Parallel DBSCANProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380582(2555-2571)Online publication date: 11-Jun-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media