MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data

He, Yaobin; Tan, Haoyu; Luo, Wuman; Feng, Shengzhong; Fan, Jianping

doi:10.1007/s11704-013-3158-3

MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data

Research Article
Published: 19 December 2013

Volume 8, pages 83–99, (2014)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Yaobin He^1,3,
Haoyu Tan²,
Wuman Luo²,
Shengzhong Feng¹ &
…
Jianping Fan¹

1114 Accesses
127 Citations
3 Altmetric
Explore all metrics

Abstract

DBSCAN (density-based spatial clustering of applications with noise) is an important spatial clustering technique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel processing of complex data analysis such as DBSCAN becomes indispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to properly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these algorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily designed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the critical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential processing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation using real large datasets with up to 1.2 billion points. The experiment results well confirm the efficiency and scalability of MR-DBSCAN.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

A survey of density based clustering algorithms

Article 29 September 2020

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Article Open access 05 June 2020

References

Ester M, Kriegel H P, Sander J, Xu X. A densitybased algorithm for discovering clusters in large spatial databases. Data Mining and Knowledge Discovery, 1996, 96: 226–231
Google Scholar
MacQueen J B. Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967, 281–297
Google Scholar
Zhang T, Ramakrishnan R, Livny M. Birch: an efficient data clustering method for very large databases. In: Proceedings of 1996 the ACM SIGMOD Conference on Managemnet of Data. 1996, 103–114
Chapter Google Scholar
Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statisticai Societ, 1977, 39(1): 1–38
MATH MathSciNet Google Scholar
Wang W, Yang J, Muntz R R. Sting: A statistical information grid approach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Data Bases, 1997, 186–195
Google Scholar
Microsoft Academic Search. Top publications in data mining. http://academic.research.microsoft.com/CSDirectory/paper_category_ 7.html. 2013
Google Scholar
Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. 2008, 107–113
Google Scholar
White T. Hadoop: The Definitive Guide, 1st edition. O’Reilly Media, Inc., 2009
Google Scholar
Berger M, Bokhari S. A partitioning strategy for nonuniform problems on multiprocessors. IEEE Transactions on Computers, 1987, 36: 570–580
Article Google Scholar
Dai B R, Lin I C. Efficient map/reduce-based dbscan algorithm with optimized data partition. In: Proceedings of the 5th IEEE International Conference on Cloud Computing. 2012, 59–66
Google Scholar
Leutenegger S T, Edgington J M, Lopez M A. Str: a simple and efficient algorithm for r-tree packing. In: Proceedings of the 1997 IEEE International Conference on Data Engineering. 1997, 497–506
Google Scholar
Theodoridis Y, Sellis T. A model for the prediction of r-tree perfor mance. In: Proceedings of the 15th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 1996, 161–171
Google Scholar
United States Census Bureau. TIGER/Line Shapefiles. http://www.census.gov/geo/maps-data/data/tiger-line.html
Sander J, Ester M, Kriegel H P, Xu X. Density-based clustering in spatial databases: The algorithm gdbscan and its applications. Data Mining and Knowledge Discovery, 1998, 2(2): 169–194
Article Google Scholar
Ankerst M, Breunig M M, Kriegel H P, Sander J. Optics: ordering points to identify the clustering structure. SIGMOD Record, 1999, 28: 49–60
Article Google Scholar
Januzaj E, Kriegel H P, Pfeifle M. Scalable density-based distributed clustering. In: Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2004, 231–244
Google Scholar
Zhao W, Ma H, He Q. Parallel k-means clustering based on mapreduce. In: Proceedings of the 1st International Conference on Cloud Computing. 2009, 674-679
Kwon Y, Nunley D, Gardner J P, Balazinska M, Howe B, Loebman S. Scalable clustering algorithm for n-body simulations in a sharednothing cluster. In: Proceedings of the 22nd International Conference on Scientific and Statistical Database Management. 2010, 132–150
Google Scholar
Bentley J L. Multidimensional binary search trees used for associative searching. Communications of the ACM, 1975, 18: 509–517
Article MATH Google Scholar
Xu X, Jäger J, Kriegel H P. A fast parallel clustering algorithm for large spatial databases. Data Mining and Knowledge Discovery, 1999, 3: 263–290
Article Google Scholar
He Y, Tan H, Luo W, Mao H, Ma D, Feng S, Fan J. MR-DBSCAN: an efficient parallel density-based clustering algorithm using mapreduce. In: Proceedings of the 2011 IEEE International Conference on Parallel and Distributed Systems. 2011, 473–480
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
Yaobin He, Shengzhong Feng & Jianping Fan
Department of Computer Science, Guangzhou HKUST Fok Ying Tung Research Institute, Hong Kong University of Science and Technology, Hong Kong, 999077, China
Haoyu Tan & Wuman Luo
University of Chinese Academy of Sciences, Beijing, 100049, China
Yaobin He

Authors

Yaobin He
View author publications
You can also search for this author in PubMed Google Scholar
Haoyu Tan
View author publications
You can also search for this author in PubMed Google Scholar
Wuman Luo
View author publications
You can also search for this author in PubMed Google Scholar
Shengzhong Feng
View author publications
You can also search for this author in PubMed Google Scholar
Jianping Fan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yaobin He.

Additional information

Yaobin He is a PhD candidate of University of Chinese Academy of Sciences (CAS), China. He is also working as an engineer at Shenzhen Institutes of Advanced Technology, CAS. His research interests include parallel computing, high performance computing, and data mining.

Haoyu Tan is a research associate at Guangzhou HKUST Fok Ying Tung Research Institute, China. He received the PhD degree in computer science and engineering from HKUST in 2013. His research interests include big data processing, large scale data mining, and distributed systems.

Wuman Luo is a research associate at Guangzhou HKUST Fok Ying Tung Research Institute, China. She received the PhD degree in computer science and engineering from HKUST in 2013. Her research interests include big data processing, distributed database, and spatio-temporal database.

Shengzhong Feng is a professor at the Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China. His research focuses on parallel algorithms, grid computing and bioinformatics. Specially, now his interests are in developing novel methods for digital city modeling and application.

Jianping FAN is the president of Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China. He took part in designing and building Dawning series supercomputers from 1990s’. He accomplished 11 projects of 863 programs, held 5 patents and published a book, and over 60 papers.

Rights and permissions

Reprints and permissions

About this article

Cite this article

He, Y., Tan, H., Luo, W. et al. MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Front. Comput. Sci. 8, 83–99 (2014). https://doi.org/10.1007/s11704-013-3158-3

Download citation

Received: 16 February 2013
Accepted: 03 June 2013
Published: 19 December 2013
Issue Date: February 2014
DOI: https://doi.org/10.1007/s11704-013-3158-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

A survey of density based clustering algorithms

MongoDB Vs PostgreSQL: A comparative study on performance aspects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

A survey of density based clustering algorithms

MongoDB Vs PostgreSQL: A comparative study on performance aspects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation