research-article

Scalable parallel OPTICS data clustering using graph algorithmic techniques

Authors:

Mostofa Ali Patwary,

Diana Palsetia,

Alok ChoudharyAuthors Info & Claims

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Article No.: 49, Pages 1 - 12

https://doi.org/10.1145/2503210.2503255

Published: 17 November 2013 Publication History

Abstract

OPTICS is a hierarchical density-based data clustering algorithm that discovers arbitrary-shaped clusters and eliminates noise using adjustable reachability distance thresholds. Parallelizing OPTICS is considered challenging as the algorithm exhibits a strongly sequential data access order. We present a scalable parallel OPTICS algorithm (Poptics) designed using graph algorithmic concepts. To break the data access sequentiality, POPTICS exploits the similarities between the OPTICS algorithm and Prim's Minimum Spanning Tree algorithm. Additionally, we use the disjoint-set data structure to achieve a high parallelism for distributed cluster extraction. Using high dimensional datasets containing up to a billion floating point numbers, we show scalable speedups of up to 27.5 for our OpenMP implementation on a 40-core shared-memory machine, and up to 3,008 for our MPI implementation on a 4,096-core distributed-memory machine. We also show that the quality of the results given by POPTICS is comparable to those given by the classical OPTICS algorithm.

References

[1]

Parallel K-means data clustering, 2005. http://users.eecs.northwestern.edu/wkliao/Kmeans/.

[2]

CLUTO - clustering high-dimensional datasets, 2006. http://glaros.dtc.umn.edu/gkhome/cluto/cluto/.

[3]

Cliquemod, 2009. http://www.cs.bris.ac.uk/steve/networks/cliquemod/.

[4]

R. Agrawal and R. Srikant. Quest synthetic data generator. IBM Almaden Research Center, 1994.

[5]

M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. Optics: ordering points to identify the clustering structure. In Proceedings of the 1999 ACM SIGMOD, pages 49--60, New York, NY, USA, 1999. ACM.

Digital Library

[6]

D. Arlia and M. Coppola. Experiments in parallel clustering with DBSCAN. In Euro-Par 2001 Parallel Processing, pages 326--331. Springer, LNCS, 2001.

Digital Library

[7]

H. Backlund, A. Hedblom, and N. Neijman. A density-based spatial clustering of application with noise. 2011. http://staffwww.itn.liu.se/aidvi/courses/06/dm/Seminars2011.

[8]

N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The r*-tree: an efficient and robust access method for points and rectangles. Proceedings of the 1990 ACM SIGMOD, 19(2):322--331, 1990.

Digital Library

[9]

J. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509--517, 1975.

Digital Library

[10]

S. Bertone, G. De Lucia, and P. Thomas. The recycling of gas and metals in galaxy formation: predictions of a dynamical feedback model. Monthly Notices of the Royal Astronomical Society, 379(3):1143--1154, 2007.

[11]

D. Birant and A. Kut. ST-DBSCAN: An algorithm for clustering spatial-temporal data. Data & Knowledge Engineering, 60(1):208--221, 2007.

Digital Library

[12]

R. Bower, A. Benson, R. Malbon, J. Helly, C. Frenk, C. Baugh, S. Cole, and C. Lacey. Breaking the hierarchy of galaxy formation. Monthly Notices of the Royal Astronomical Society, 370(2):645--655, 2006.

[13]

S. Brecheisen, H. Kriegel, and M. Pfeifle. Parallel density-based clustering of complex objects. Adv. in Know. Discovery and Data Mining, pages 179--188, 2006.

Digital Library

[14]

M. Chen, X. Gao, and H. Li. Parallel DBSCAN with priority r-tree. In Information Management and Engineering (ICIME), 2010 The 2nd IEEE International Conference on, pages 508--511. IEEE, 2010.

[15]

S. Chung and A. Condon. Parallel implementation of bouvka's minimum spanning tree algorithm. In Parallel Processing Symposium, 1996., Proceedings of IPPS'96, The 10th International, pages 302--308. IEEE, 1996.

Digital Library

[16]

L. M. Collins and C. W. Dent. Omega: A general formulation of the rand index of cluster recovery suitable for non-disjoint solutions. Multivariate Behavioral Research, 23(2):231--242, 1988.

[17]

M. Coppola and M. Vanneschi. High-performance data mining with skeleton-based structured parallel programming. Parallel Computing, 28(5):793--813, 2002.

Digital Library

[18]

T. Cormen. Introduction to algorithms. The MIT press, 2001.

Digital Library

[19]

G. De Lucia and J. Blaizot. The hierarchical formation of the brightest cluster galaxies. Monthly Notices of the Royal Astronomical Society, 375(1):2--14, 2007.

[20]

M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data mining, volume 1996, pages 226--231. AAAI Press, 1996.

[21]

U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From data mining to knowledge discovery in databases. AI magazine, 17(3):37, 1996.

[22]

M. Forina, M. C. Oliveros, C. Casolino, and M. Casale. Minimum spanning tree: ordering edges to identify clustering structure. Analytica Chimica Acta, 515(1):43--53, 2004.

[23]

Y. Fu, W. Zhao, and H. Ma. Research on parallel DBSCAN algorithm design based on mapreduce. Advanced Materials Research, 301:1133--1138, 2011.

[24]

B. Galler and M. Fisher. An improved equivalence algorithm. Communications of the ACM, 7:301--303, 1964.

Digital Library

[25]

J. C. Gower and G. J. S. Ross. Minimum spanning trees and single linkage cluster analysis. Journal of the Royal Statistical Society. Series C (Applied Statistics), 18(1):pp. 54--64, 1969.

[26]

J. Han, M. Kamber, and J. Pei. Data mining: concepts and techniques. Morgan Kaufmann, 2011.

Digital Library

[27]

H. Kargupta and J. Han. Next generation of data mining, volume 7. Chapman & Hall/CRC, 2009.

Digital Library

[28]

M. B. Kennel. KDTREE 2: Fortran 95 and C++ software to efficiently search for near neighbors in a multi-dimensional Euclidean space, 2004. Institute for Nonlinear Science, University of California.

[29]

H.-P. Kriegel and M. Pfeifle. Hierarchical density-based clustering of uncertain data. In Data Mining, Fifth IEEE International Conference on, pages 4--pp. IEEE, 2005.

Digital Library

[30]

J. B. Kruskal. On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem. Proceedings of the American Mathematical Society, 7(1):48--50, Feb. 1956.

[31]

L. Lelis and J. Sander. Semi-supervised density-based clustering. In Data Mining, 2009. ICDM'09. Ninth IEEE International Conference on, pages 842--847. IEEE, 2009.

Digital Library

[32]

G. Lemson and the Virgo Consortium. Halo and galaxy formation histories from the millennium simulation: Public release of a VO-oriented and SQL-queryable database for studying the evolution of galaxies in the LambdaCDM cosmogony. Arxiv preprint astro-ph/0608019, 2006.

[33]

Y. Liu, W.-k. Liao, and A. Choudhary. Design and evaluation of a parallel HOP clustering algorithm for cosmological simulation. In Proceedings of IPDPS 2003, page 82.1, Washington, DC, USA, 2003. IEEE.

Digital Library

[34]

Z. Lukić, D. Reed, S. Habib, and K. Heitmann. The structure of halos: Implications for group and cluster cosmology. The Astrophysical Journal, 692(1):217, 2009.

[35]

J. MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281--297. USA, 1967.

[36]

S. Madeira and A. Oliveira. Biclustering algorithms for biological data analysis: a survey. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 1(1):24--45, 2004.

Digital Library

[37]

F. Manne and M. Patwary. A scalable parallel union-find algorithm for distributed memory computers. In Parallel Processing and Applied Mathematics, pages 186--195. Springer, LNCS, 2010.

Digital Library

[38]

A. Mukhopadhyay and U. Maulik. Unsupervised satellite image segmentation by combining SA based fuzzy clustering with support vector machine. In Proceedings of 7th ICAPR'09, pages 381--384. IEEE, 2009.

Digital Library

[39]

G. Murray, G. Carenini, and R. Ng. Using the omega index for evaluating abstractive community detection. In Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization, pages 10--18, Stroudsburg, PA, USA, 2012.

Digital Library

[40]

S. Nobari, T.-T. Cao, P. Karras, and S. Bressan. Scalable parallel minimum spanning forest computation. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, pages 205--214. ACM, 2012.

Digital Library

[41]

H. Park and C. Jun. A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2):3336--3341, 2009.

Digital Library

[42]

M. Patwary, M. Ali, P. Refsnes, and F. Manne. Multi-core spanning forest algorithms using the disjoint-set data structure. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 827--835. IEEE, 2012.

Digital Library

[43]

M. Patwary, J. Blair, and F. Manne. Experiments on union-find algorithms for the disjoint-set data structure. In Proceedings of the 9th International Symposium on Experimental Algorithms (SEA 2010), pages 411--423. Springer, LNCS 6049, 2010.

Digital Library

[44]

M. A. Patwary, D. Palsetia, A. Agrawal, W.-k. Liao, F. Manne, and A. Choudhary. A new scalable parallel dbscan algorithm using the disjoint-set data structure. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 62:1--62:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press.

Digital Library

[45]

J. Pisharath, Y. Liu, W. Liao, A. Choudhary, G. Memik, and J. Parhi. NU-MineBench 3.0. Technical report, Technical Report CUCIS-2005-08-01, Northwestern University, 2010.

[46]

R. C. Prim. Shortest connection networks and some generalizations. Bell System Technology Journal, 36:1389--1401, 1957.

[47]

R. Setia, A. Nedunchezhian, and S. Balachandran. A new parallel algorithm for minimum spanning tree problem. In Proc. International Conference on High Performance Computing (HiPC), pages 1--5, 2009.

[48]

G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: a wavelet-based clustering approach for spatial data in very large databases. The VLDB Journal, 8(3):289--304, 2000.

Digital Library

[49]

V. Springel, S. White, A. Jenkins, C. Frenk, N. Yoshida, L. Gao, J. Navarro, R. Thacker, D. Croton, J. Helly, et al. Simulations of the formation, evolution and clustering of galaxies and quasars. Nature, 435(7042):629--636, 2005.

[50]

M. Surdeanu, J. Turmo, and A. Ageno. A hybrid unsupervised approach for document clustering. In Proceedings of the 11th ACM SIGKDD, pages 685--690. ACM, 2005.

Digital Library

[51]

R. Tarjan. A class of algorithms which require nonlinear time to maintain disjoint sets. Journal of computer and system sciences, 18(2):110--127, 1979.

[52]

W. Wang, J. Yang, and R. Muntz. STING: A statistical information grid approach to spatial data mining. In Proceedings of the International Conference on Very Large Data Bases, pages 186--195. IEEE, 1997.

Digital Library

[53]

J. Xie, S. Kelley, and B. K. Szymanski. Overlapping community detection in networks: the state of the art and comparative study. ACM Computing Surveys, 45(4), 2013.

Digital Library

[54]

X. Xu, J. Jäger, and H. Kriegel. A fast parallel clustering algorithm for large spatial databases. High Performance Data Mining, pages 263--290, 2002.

Digital Library

[55]

T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method for very large databases. In ACM SIGMOD Record, volume 25(2), pages 103--114. ACM, 1996.

Digital Library

[56]

A. Zhou, S. Zhou, J. Cao, Y. Fan, and Y. Hu. Approaches for scaling DBSCAN algorithm to large spatial databases. Computer science and technology, 15(6):509--526, 2000.

Digital Library

Cited By

Chen YRuys WBiros G(2025)KNN-DBSCAN: a DBSCAN in high dimensionsACM Transactions on Parallel Computing10.1145/370162412:1(1-27)Online publication date: 12-Feb-2025
https://dl.acm.org/doi/10.1145/3701624
Xing ZZhao W(2024)Block-Diagonal Guided DBSCAN ClusteringIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.340107536:11(5709-5722)Online publication date: Nov-2024
https://doi.org/10.1109/TKDE.2024.3401075
Challa JGoyal NSharma ASreekumar NBalasubramaniam SGoyal P(2024)A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering AlgorithmsJournal of Computer Science and Technology10.1007/s11390-024-2700-039:3(610-636)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s11390-024-2700-0
Show More Cited By

Index Terms

Scalable parallel OPTICS data clustering using graph algorithmic techniques

Recommendations

Parallelizing OPTICS for Commodity Clusters
ICDCN '15: Proceedings of the 16th International Conference on Distributed Computing and Networking

In this paper, we propose an algorithm, DOPTICS, a parallelized version of a popular density based cluster-ordering algorithm OPTICS. Parallelizing OPTICS is challenging because of its strong sequential data access behavior. To achieve high parallelism, ...
A new scalable parallel DBSCAN algorithm using the disjoint-set data structure
SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

DBSCAN is a well-known density based clustering algorithm capable of discovering arbitrary shaped clusters and eliminating noise data. However, parallelization of Dbscan is challenging as it exhibits an inherent sequential data access order. Moreover, ...
A new scalable parallel DBSCAN algorithm using the disjoint-set data structure
SC '12: Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis

DBSCAN is a well-known density based clustering algorithm capable of discovering arbitrary shaped clusters and eliminating noise data. However, parallelization of Dbscan is challenging as it exhibits an inherent sequential data access order. Moreover, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

November 2013

1123 pages

ISBN:9781450323789

DOI:10.1145/2503210

General Chair:
William Gropp
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Satoshi Matsuoka
Tokyo Institute of Technology, Tokyo, Japan

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SC13

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC13: International Conference for High Performance Computing, Networking, Storage and Analysis

November 17 - 21, 2013

Colorado, Denver

Acceptance Rates

SC '13 Paper Acceptance Rate 91 of 449 submissions, 20%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

34
Total Citations
View Citations
716
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)1

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen YRuys WBiros G(2025)KNN-DBSCAN: a DBSCAN in high dimensionsACM Transactions on Parallel Computing10.1145/370162412:1(1-27)Online publication date: 12-Feb-2025
https://dl.acm.org/doi/10.1145/3701624
Xing ZZhao W(2024)Block-Diagonal Guided DBSCAN ClusteringIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.340107536:11(5709-5722)Online publication date: Nov-2024
https://doi.org/10.1109/TKDE.2024.3401075
Challa JGoyal NSharma ASreekumar NBalasubramaniam SGoyal P(2024)A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering AlgorithmsJournal of Computer Science and Technology10.1007/s11390-024-2700-039:3(610-636)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s11390-024-2700-0
Chen YZhou LPei SYu ZChen YLiu XDu JXiong N(2021)KNN-BLOCK DBSCAN: Fast Clustering for Large-Scale DataIEEE Transactions on Systems, Man, and Cybernetics: Systems10.1109/TSMC.2019.295652751:6(3939-3953)Online publication date: Jun-2021
https://doi.org/10.1109/TSMC.2019.2956527
Akritidis LAlamaniotis MFevgas ABozanis P(2021)A Scalable Short-Text Clustering Algorithm Using Apache Spark2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI52525.2021.00149(927-934)Online publication date: Nov-2021
https://doi.org/10.1109/ICTAI52525.2021.00149
Rizwan AIqbal NKhan AAhmad RKim D(2021)Toward Effective Pattern Recognition Based on Enhanced Weighted K-Mean Clustering Algorithm for Groundwater Resource Planning in Point CloudIEEE Access10.1109/ACCESS.2021.31111129(130154-130169)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3111112
He JZhou JWang HCai L(2021)DACA: Distributed adaptive grid decision graph based clustering algorithmSoftware: Practice and Experience10.1002/spe.306052:5(1199-1215)Online publication date: 29-Dec-2021
https://doi.org/10.1002/spe.3060
Byma SDhasade AAltenhoff ADessimoz CLarus JSarkar VKim H(2020)Parallel and Scalable Precise ClusteringProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414646(217-228)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414646
Sousa RBoukerche ALoureiro A(2020)Vehicle Trajectory SimilarityACM Computing Surveys10.1145/340609653:5(1-32)Online publication date: 28-Sep-2020
https://dl.acm.org/doi/10.1145/3406096
Tschirhart ZSchulz K(2020)Evaluation of Clustering Techniques for GPS Phenotyping Using Mobile Sensor DataPractice and Experience in Advanced Research Computing 2020: Catch the Wave10.1145/3311790.3396665(364-371)Online publication date: 26-Jul-2020
https://dl.acm.org/doi/10.1145/3311790.3396665
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten