skip to main content
10.1145/2503210.2503255acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Scalable parallel OPTICS data clustering using graph algorithmic techniques

Published: 17 November 2013 Publication History

Abstract

OPTICS is a hierarchical density-based data clustering algorithm that discovers arbitrary-shaped clusters and eliminates noise using adjustable reachability distance thresholds. Parallelizing OPTICS is considered challenging as the algorithm exhibits a strongly sequential data access order. We present a scalable parallel OPTICS algorithm (Poptics) designed using graph algorithmic concepts. To break the data access sequentiality, POPTICS exploits the similarities between the OPTICS algorithm and Prim's Minimum Spanning Tree algorithm. Additionally, we use the disjoint-set data structure to achieve a high parallelism for distributed cluster extraction. Using high dimensional datasets containing up to a billion floating point numbers, we show scalable speedups of up to 27.5 for our OpenMP implementation on a 40-core shared-memory machine, and up to 3,008 for our MPI implementation on a 4,096-core distributed-memory machine. We also show that the quality of the results given by POPTICS is comparable to those given by the classical OPTICS algorithm.

References

[1]
Parallel K-means data clustering, 2005. http://users.eecs.northwestern.edu/wkliao/Kmeans/.
[2]
CLUTO - clustering high-dimensional datasets, 2006. http://glaros.dtc.umn.edu/gkhome/cluto/cluto/.
[3]
Cliquemod, 2009. http://www.cs.bris.ac.uk/steve/networks/cliquemod/.
[4]
R. Agrawal and R. Srikant. Quest synthetic data generator. IBM Almaden Research Center, 1994.
[5]
M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. Optics: ordering points to identify the clustering structure. In Proceedings of the 1999 ACM SIGMOD, pages 49--60, New York, NY, USA, 1999. ACM.
[6]
D. Arlia and M. Coppola. Experiments in parallel clustering with DBSCAN. In Euro-Par 2001 Parallel Processing, pages 326--331. Springer, LNCS, 2001.
[7]
H. Backlund, A. Hedblom, and N. Neijman. A density-based spatial clustering of application with noise. 2011. http://staffwww.itn.liu.se/aidvi/courses/06/dm/Seminars2011.
[8]
N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The r*-tree: an efficient and robust access method for points and rectangles. Proceedings of the 1990 ACM SIGMOD, 19(2):322--331, 1990.
[9]
J. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509--517, 1975.
[10]
S. Bertone, G. De Lucia, and P. Thomas. The recycling of gas and metals in galaxy formation: predictions of a dynamical feedback model. Monthly Notices of the Royal Astronomical Society, 379(3):1143--1154, 2007.
[11]
D. Birant and A. Kut. ST-DBSCAN: An algorithm for clustering spatial-temporal data. Data & Knowledge Engineering, 60(1):208--221, 2007.
[12]
R. Bower, A. Benson, R. Malbon, J. Helly, C. Frenk, C. Baugh, S. Cole, and C. Lacey. Breaking the hierarchy of galaxy formation. Monthly Notices of the Royal Astronomical Society, 370(2):645--655, 2006.
[13]
S. Brecheisen, H. Kriegel, and M. Pfeifle. Parallel density-based clustering of complex objects. Adv. in Know. Discovery and Data Mining, pages 179--188, 2006.
[14]
M. Chen, X. Gao, and H. Li. Parallel DBSCAN with priority r-tree. In Information Management and Engineering (ICIME), 2010 The 2nd IEEE International Conference on, pages 508--511. IEEE, 2010.
[15]
S. Chung and A. Condon. Parallel implementation of bouvka's minimum spanning tree algorithm. In Parallel Processing Symposium, 1996., Proceedings of IPPS'96, The 10th International, pages 302--308. IEEE, 1996.
[16]
L. M. Collins and C. W. Dent. Omega: A general formulation of the rand index of cluster recovery suitable for non-disjoint solutions. Multivariate Behavioral Research, 23(2):231--242, 1988.
[17]
M. Coppola and M. Vanneschi. High-performance data mining with skeleton-based structured parallel programming. Parallel Computing, 28(5):793--813, 2002.
[18]
T. Cormen. Introduction to algorithms. The MIT press, 2001.
[19]
G. De Lucia and J. Blaizot. The hierarchical formation of the brightest cluster galaxies. Monthly Notices of the Royal Astronomical Society, 375(1):2--14, 2007.
[20]
M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data mining, volume 1996, pages 226--231. AAAI Press, 1996.
[21]
U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From data mining to knowledge discovery in databases. AI magazine, 17(3):37, 1996.
[22]
M. Forina, M. C. Oliveros, C. Casolino, and M. Casale. Minimum spanning tree: ordering edges to identify clustering structure. Analytica Chimica Acta, 515(1):43--53, 2004.
[23]
Y. Fu, W. Zhao, and H. Ma. Research on parallel DBSCAN algorithm design based on mapreduce. Advanced Materials Research, 301:1133--1138, 2011.
[24]
B. Galler and M. Fisher. An improved equivalence algorithm. Communications of the ACM, 7:301--303, 1964.
[25]
J. C. Gower and G. J. S. Ross. Minimum spanning trees and single linkage cluster analysis. Journal of the Royal Statistical Society. Series C (Applied Statistics), 18(1):pp. 54--64, 1969.
[26]
J. Han, M. Kamber, and J. Pei. Data mining: concepts and techniques. Morgan Kaufmann, 2011.
[27]
H. Kargupta and J. Han. Next generation of data mining, volume 7. Chapman & Hall/CRC, 2009.
[28]
M. B. Kennel. KDTREE 2: Fortran 95 and C++ software to efficiently search for near neighbors in a multi-dimensional Euclidean space, 2004. Institute for Nonlinear Science, University of California.
[29]
H.-P. Kriegel and M. Pfeifle. Hierarchical density-based clustering of uncertain data. In Data Mining, Fifth IEEE International Conference on, pages 4--pp. IEEE, 2005.
[30]
J. B. Kruskal. On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem. Proceedings of the American Mathematical Society, 7(1):48--50, Feb. 1956.
[31]
L. Lelis and J. Sander. Semi-supervised density-based clustering. In Data Mining, 2009. ICDM'09. Ninth IEEE International Conference on, pages 842--847. IEEE, 2009.
[32]
G. Lemson and the Virgo Consortium. Halo and galaxy formation histories from the millennium simulation: Public release of a VO-oriented and SQL-queryable database for studying the evolution of galaxies in the LambdaCDM cosmogony. Arxiv preprint astro-ph/0608019, 2006.
[33]
Y. Liu, W.-k. Liao, and A. Choudhary. Design and evaluation of a parallel HOP clustering algorithm for cosmological simulation. In Proceedings of IPDPS 2003, page 82.1, Washington, DC, USA, 2003. IEEE.
[34]
Z. Lukić, D. Reed, S. Habib, and K. Heitmann. The structure of halos: Implications for group and cluster cosmology. The Astrophysical Journal, 692(1):217, 2009.
[35]
J. MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281--297. USA, 1967.
[36]
S. Madeira and A. Oliveira. Biclustering algorithms for biological data analysis: a survey. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 1(1):24--45, 2004.
[37]
F. Manne and M. Patwary. A scalable parallel union-find algorithm for distributed memory computers. In Parallel Processing and Applied Mathematics, pages 186--195. Springer, LNCS, 2010.
[38]
A. Mukhopadhyay and U. Maulik. Unsupervised satellite image segmentation by combining SA based fuzzy clustering with support vector machine. In Proceedings of 7th ICAPR'09, pages 381--384. IEEE, 2009.
[39]
G. Murray, G. Carenini, and R. Ng. Using the omega index for evaluating abstractive community detection. In Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization, pages 10--18, Stroudsburg, PA, USA, 2012.
[40]
S. Nobari, T.-T. Cao, P. Karras, and S. Bressan. Scalable parallel minimum spanning forest computation. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, pages 205--214. ACM, 2012.
[41]
H. Park and C. Jun. A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2):3336--3341, 2009.
[42]
M. Patwary, M. Ali, P. Refsnes, and F. Manne. Multi-core spanning forest algorithms using the disjoint-set data structure. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 827--835. IEEE, 2012.
[43]
M. Patwary, J. Blair, and F. Manne. Experiments on union-find algorithms for the disjoint-set data structure. In Proceedings of the 9th International Symposium on Experimental Algorithms (SEA 2010), pages 411--423. Springer, LNCS 6049, 2010.
[44]
M. A. Patwary, D. Palsetia, A. Agrawal, W.-k. Liao, F. Manne, and A. Choudhary. A new scalable parallel dbscan algorithm using the disjoint-set data structure. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 62:1--62:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press.
[45]
J. Pisharath, Y. Liu, W. Liao, A. Choudhary, G. Memik, and J. Parhi. NU-MineBench 3.0. Technical report, Technical Report CUCIS-2005-08-01, Northwestern University, 2010.
[46]
R. C. Prim. Shortest connection networks and some generalizations. Bell System Technology Journal, 36:1389--1401, 1957.
[47]
R. Setia, A. Nedunchezhian, and S. Balachandran. A new parallel algorithm for minimum spanning tree problem. In Proc. International Conference on High Performance Computing (HiPC), pages 1--5, 2009.
[48]
G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: a wavelet-based clustering approach for spatial data in very large databases. The VLDB Journal, 8(3):289--304, 2000.
[49]
V. Springel, S. White, A. Jenkins, C. Frenk, N. Yoshida, L. Gao, J. Navarro, R. Thacker, D. Croton, J. Helly, et al. Simulations of the formation, evolution and clustering of galaxies and quasars. Nature, 435(7042):629--636, 2005.
[50]
M. Surdeanu, J. Turmo, and A. Ageno. A hybrid unsupervised approach for document clustering. In Proceedings of the 11th ACM SIGKDD, pages 685--690. ACM, 2005.
[51]
R. Tarjan. A class of algorithms which require nonlinear time to maintain disjoint sets. Journal of computer and system sciences, 18(2):110--127, 1979.
[52]
W. Wang, J. Yang, and R. Muntz. STING: A statistical information grid approach to spatial data mining. In Proceedings of the International Conference on Very Large Data Bases, pages 186--195. IEEE, 1997.
[53]
J. Xie, S. Kelley, and B. K. Szymanski. Overlapping community detection in networks: the state of the art and comparative study. ACM Computing Surveys, 45(4), 2013.
[54]
X. Xu, J. Jäger, and H. Kriegel. A fast parallel clustering algorithm for large spatial databases. High Performance Data Mining, pages 263--290, 2002.
[55]
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method for very large databases. In ACM SIGMOD Record, volume 25(2), pages 103--114. ACM, 1996.
[56]
A. Zhou, S. Zhou, J. Cao, Y. Fan, and Y. Hu. Approaches for scaling DBSCAN algorithm to large spatial databases. Computer science and technology, 15(6):509--526, 2000.

Cited By

View all
  • (2025)KNN-DBSCAN: a DBSCAN in high dimensionsACM Transactions on Parallel Computing10.1145/370162412:1(1-27)Online publication date: 12-Feb-2025
  • (2024)Block-Diagonal Guided DBSCAN ClusteringIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.340107536:11(5709-5722)Online publication date: Nov-2024
  • (2024)A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering AlgorithmsJournal of Computer Science and Technology10.1007/s11390-024-2700-039:3(610-636)Online publication date: 1-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
November 2013
1123 pages
ISBN:9781450323789
DOI:10.1145/2503210
  • General Chair:
  • William Gropp,
  • Program Chair:
  • Satoshi Matsuoka
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. density-based clustering
  2. disjoint-set data structure
  3. minimum spanning tree
  4. union-find algorithm

Qualifiers

  • Research-article

Funding Sources

Conference

SC13
Sponsor:

Acceptance Rates

SC '13 Paper Acceptance Rate 91 of 449 submissions, 20%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)1
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)KNN-DBSCAN: a DBSCAN in high dimensionsACM Transactions on Parallel Computing10.1145/370162412:1(1-27)Online publication date: 12-Feb-2025
  • (2024)Block-Diagonal Guided DBSCAN ClusteringIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.340107536:11(5709-5722)Online publication date: Nov-2024
  • (2024)A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering AlgorithmsJournal of Computer Science and Technology10.1007/s11390-024-2700-039:3(610-636)Online publication date: 1-May-2024
  • (2021)KNN-BLOCK DBSCAN: Fast Clustering for Large-Scale DataIEEE Transactions on Systems, Man, and Cybernetics: Systems10.1109/TSMC.2019.295652751:6(3939-3953)Online publication date: Jun-2021
  • (2021)A Scalable Short-Text Clustering Algorithm Using Apache Spark2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI52525.2021.00149(927-934)Online publication date: Nov-2021
  • (2021)Toward Effective Pattern Recognition Based on Enhanced Weighted K-Mean Clustering Algorithm for Groundwater Resource Planning in Point CloudIEEE Access10.1109/ACCESS.2021.31111129(130154-130169)Online publication date: 2021
  • (2021)DACA: Distributed adaptive grid decision graph based clustering algorithmSoftware: Practice and Experience10.1002/spe.306052:5(1199-1215)Online publication date: 29-Dec-2021
  • (2020)Parallel and Scalable Precise ClusteringProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414646(217-228)Online publication date: 30-Sep-2020
  • (2020)Vehicle Trajectory SimilarityACM Computing Surveys10.1145/340609653:5(1-32)Online publication date: 28-Sep-2020
  • (2020)Evaluation of Clustering Techniques for GPS Phenotyping Using Mobile Sensor DataPractice and Experience in Advanced Research Computing 2020: Catch the Wave10.1145/3311790.3396665(364-371)Online publication date: 26-Jul-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media