ABSTRACT
OPTICS is a hierarchical density-based data clustering algorithm that discovers arbitrary-shaped clusters and eliminates noise using adjustable reachability distance thresholds. Parallelizing OPTICS is considered challenging as the algorithm exhibits a strongly sequential data access order. We present a scalable parallel OPTICS algorithm (Poptics) designed using graph algorithmic concepts. To break the data access sequentiality, POPTICS exploits the similarities between the OPTICS algorithm and Prim's Minimum Spanning Tree algorithm. Additionally, we use the disjoint-set data structure to achieve a high parallelism for distributed cluster extraction. Using high dimensional datasets containing up to a billion floating point numbers, we show scalable speedups of up to 27.5 for our OpenMP implementation on a 40-core shared-memory machine, and up to 3,008 for our MPI implementation on a 4,096-core distributed-memory machine. We also show that the quality of the results given by POPTICS is comparable to those given by the classical OPTICS algorithm.
- Parallel K-means data clustering, 2005. http://users.eecs.northwestern.edu/wkliao/Kmeans/.Google Scholar
- CLUTO - clustering high-dimensional datasets, 2006. http://glaros.dtc.umn.edu/gkhome/cluto/cluto/.Google Scholar
- Cliquemod, 2009. http://www.cs.bris.ac.uk/steve/networks/cliquemod/.Google Scholar
- R. Agrawal and R. Srikant. Quest synthetic data generator. IBM Almaden Research Center, 1994.Google Scholar
- M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. Optics: ordering points to identify the clustering structure. In Proceedings of the 1999 ACM SIGMOD, pages 49--60, New York, NY, USA, 1999. ACM. Google ScholarDigital Library
- D. Arlia and M. Coppola. Experiments in parallel clustering with DBSCAN. In Euro-Par 2001 Parallel Processing, pages 326--331. Springer, LNCS, 2001. Google ScholarDigital Library
- H. Backlund, A. Hedblom, and N. Neijman. A density-based spatial clustering of application with noise. 2011. http://staffwww.itn.liu.se/aidvi/courses/06/dm/Seminars2011.Google Scholar
- N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The r*-tree: an efficient and robust access method for points and rectangles. Proceedings of the 1990 ACM SIGMOD, 19(2):322--331, 1990. Google ScholarDigital Library
- J. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509--517, 1975. Google ScholarDigital Library
- S. Bertone, G. De Lucia, and P. Thomas. The recycling of gas and metals in galaxy formation: predictions of a dynamical feedback model. Monthly Notices of the Royal Astronomical Society, 379(3):1143--1154, 2007.Google ScholarCross Ref
- D. Birant and A. Kut. ST-DBSCAN: An algorithm for clustering spatial-temporal data. Data & Knowledge Engineering, 60(1):208--221, 2007. Google ScholarDigital Library
- R. Bower, A. Benson, R. Malbon, J. Helly, C. Frenk, C. Baugh, S. Cole, and C. Lacey. Breaking the hierarchy of galaxy formation. Monthly Notices of the Royal Astronomical Society, 370(2):645--655, 2006.Google ScholarCross Ref
- S. Brecheisen, H. Kriegel, and M. Pfeifle. Parallel density-based clustering of complex objects. Adv. in Know. Discovery and Data Mining, pages 179--188, 2006. Google ScholarDigital Library
- M. Chen, X. Gao, and H. Li. Parallel DBSCAN with priority r-tree. In Information Management and Engineering (ICIME), 2010 The 2nd IEEE International Conference on, pages 508--511. IEEE, 2010.Google ScholarCross Ref
- S. Chung and A. Condon. Parallel implementation of bouvka's minimum spanning tree algorithm. In Parallel Processing Symposium, 1996., Proceedings of IPPS'96, The 10th International, pages 302--308. IEEE, 1996. Google ScholarDigital Library
- L. M. Collins and C. W. Dent. Omega: A general formulation of the rand index of cluster recovery suitable for non-disjoint solutions. Multivariate Behavioral Research, 23(2):231--242, 1988.Google ScholarCross Ref
- M. Coppola and M. Vanneschi. High-performance data mining with skeleton-based structured parallel programming. Parallel Computing, 28(5):793--813, 2002. Google ScholarDigital Library
- T. Cormen. Introduction to algorithms. The MIT press, 2001. Google ScholarDigital Library
- G. De Lucia and J. Blaizot. The hierarchical formation of the brightest cluster galaxies. Monthly Notices of the Royal Astronomical Society, 375(1):2--14, 2007.Google ScholarCross Ref
- M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data mining, volume 1996, pages 226--231. AAAI Press, 1996.Google Scholar
- U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From data mining to knowledge discovery in databases. AI magazine, 17(3):37, 1996.Google Scholar
- M. Forina, M. C. Oliveros, C. Casolino, and M. Casale. Minimum spanning tree: ordering edges to identify clustering structure. Analytica Chimica Acta, 515(1):43--53, 2004.Google ScholarCross Ref
- Y. Fu, W. Zhao, and H. Ma. Research on parallel DBSCAN algorithm design based on mapreduce. Advanced Materials Research, 301:1133--1138, 2011.Google ScholarCross Ref
- B. Galler and M. Fisher. An improved equivalence algorithm. Communications of the ACM, 7:301--303, 1964. Google ScholarDigital Library
- J. C. Gower and G. J. S. Ross. Minimum spanning trees and single linkage cluster analysis. Journal of the Royal Statistical Society. Series C (Applied Statistics), 18(1):pp. 54--64, 1969.Google Scholar
- J. Han, M. Kamber, and J. Pei. Data mining: concepts and techniques. Morgan Kaufmann, 2011. Google ScholarDigital Library
- H. Kargupta and J. Han. Next generation of data mining, volume 7. Chapman & Hall/CRC, 2009. Google ScholarDigital Library
- M. B. Kennel. KDTREE 2: Fortran 95 and C++ software to efficiently search for near neighbors in a multi-dimensional Euclidean space, 2004. Institute for Nonlinear Science, University of California.Google Scholar
- H.-P. Kriegel and M. Pfeifle. Hierarchical density-based clustering of uncertain data. In Data Mining, Fifth IEEE International Conference on, pages 4--pp. IEEE, 2005. Google ScholarDigital Library
- J. B. Kruskal. On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem. Proceedings of the American Mathematical Society, 7(1):48--50, Feb. 1956.Google ScholarCross Ref
- L. Lelis and J. Sander. Semi-supervised density-based clustering. In Data Mining, 2009. ICDM'09. Ninth IEEE International Conference on, pages 842--847. IEEE, 2009. Google ScholarDigital Library
- G. Lemson and the Virgo Consortium. Halo and galaxy formation histories from the millennium simulation: Public release of a VO-oriented and SQL-queryable database for studying the evolution of galaxies in the LambdaCDM cosmogony. Arxiv preprint astro-ph/0608019, 2006.Google Scholar
- Y. Liu, W.-k. Liao, and A. Choudhary. Design and evaluation of a parallel HOP clustering algorithm for cosmological simulation. In Proceedings of IPDPS 2003, page 82.1, Washington, DC, USA, 2003. IEEE. Google ScholarDigital Library
- Z. Lukić, D. Reed, S. Habib, and K. Heitmann. The structure of halos: Implications for group and cluster cosmology. The Astrophysical Journal, 692(1):217, 2009.Google ScholarCross Ref
- J. MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281--297. USA, 1967.Google Scholar
- S. Madeira and A. Oliveira. Biclustering algorithms for biological data analysis: a survey. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 1(1):24--45, 2004. Google ScholarDigital Library
- F. Manne and M. Patwary. A scalable parallel union-find algorithm for distributed memory computers. In Parallel Processing and Applied Mathematics, pages 186--195. Springer, LNCS, 2010. Google ScholarDigital Library
- A. Mukhopadhyay and U. Maulik. Unsupervised satellite image segmentation by combining SA based fuzzy clustering with support vector machine. In Proceedings of 7th ICAPR'09, pages 381--384. IEEE, 2009. Google ScholarDigital Library
- G. Murray, G. Carenini, and R. Ng. Using the omega index for evaluating abstractive community detection. In Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization, pages 10--18, Stroudsburg, PA, USA, 2012. Google ScholarDigital Library
- S. Nobari, T.-T. Cao, P. Karras, and S. Bressan. Scalable parallel minimum spanning forest computation. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, pages 205--214. ACM, 2012. Google ScholarDigital Library
- H. Park and C. Jun. A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2):3336--3341, 2009. Google ScholarDigital Library
- M. Patwary, M. Ali, P. Refsnes, and F. Manne. Multi-core spanning forest algorithms using the disjoint-set data structure. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 827--835. IEEE, 2012. Google ScholarDigital Library
- M. Patwary, J. Blair, and F. Manne. Experiments on union-find algorithms for the disjoint-set data structure. In Proceedings of the 9th International Symposium on Experimental Algorithms (SEA 2010), pages 411--423. Springer, LNCS 6049, 2010. Google ScholarDigital Library
- M. A. Patwary, D. Palsetia, A. Agrawal, W.-k. Liao, F. Manne, and A. Choudhary. A new scalable parallel dbscan algorithm using the disjoint-set data structure. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 62:1--62:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. Google ScholarDigital Library
- J. Pisharath, Y. Liu, W. Liao, A. Choudhary, G. Memik, and J. Parhi. NU-MineBench 3.0. Technical report, Technical Report CUCIS-2005-08-01, Northwestern University, 2010.Google Scholar
- R. C. Prim. Shortest connection networks and some generalizations. Bell System Technology Journal, 36:1389--1401, 1957.Google ScholarCross Ref
- R. Setia, A. Nedunchezhian, and S. Balachandran. A new parallel algorithm for minimum spanning tree problem. In Proc. International Conference on High Performance Computing (HiPC), pages 1--5, 2009.Google Scholar
- G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: a wavelet-based clustering approach for spatial data in very large databases. The VLDB Journal, 8(3):289--304, 2000. Google ScholarDigital Library
- V. Springel, S. White, A. Jenkins, C. Frenk, N. Yoshida, L. Gao, J. Navarro, R. Thacker, D. Croton, J. Helly, et al. Simulations of the formation, evolution and clustering of galaxies and quasars. Nature, 435(7042):629--636, 2005.Google ScholarCross Ref
- M. Surdeanu, J. Turmo, and A. Ageno. A hybrid unsupervised approach for document clustering. In Proceedings of the 11th ACM SIGKDD, pages 685--690. ACM, 2005. Google ScholarDigital Library
- R. Tarjan. A class of algorithms which require nonlinear time to maintain disjoint sets. Journal of computer and system sciences, 18(2):110--127, 1979.Google Scholar
- W. Wang, J. Yang, and R. Muntz. STING: A statistical information grid approach to spatial data mining. In Proceedings of the International Conference on Very Large Data Bases, pages 186--195. IEEE, 1997. Google ScholarDigital Library
- J. Xie, S. Kelley, and B. K. Szymanski. Overlapping community detection in networks: the state of the art and comparative study. ACM Computing Surveys, 45(4), 2013. Google ScholarDigital Library
- X. Xu, J. Jäger, and H. Kriegel. A fast parallel clustering algorithm for large spatial databases. High Performance Data Mining, pages 263--290, 2002. Google ScholarDigital Library
- T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method for very large databases. In ACM SIGMOD Record, volume 25(2), pages 103--114. ACM, 1996. Google ScholarDigital Library
- A. Zhou, S. Zhou, J. Cao, Y. Fan, and Y. Hu. Approaches for scaling DBSCAN algorithm to large spatial databases. Computer science and technology, 15(6):509--526, 2000. Google ScholarDigital Library
Index Terms
- Scalable parallel OPTICS data clustering using graph algorithmic techniques
Recommendations
Parallelizing OPTICS for Commodity Clusters
ICDCN '15: Proceedings of the 16th International Conference on Distributed Computing and NetworkingIn this paper, we propose an algorithm, DOPTICS, a parallelized version of a popular density based cluster-ordering algorithm OPTICS. Parallelizing OPTICS is challenging because of its strong sequential data access behavior. To achieve high parallelism, ...
A new scalable parallel DBSCAN algorithm using the disjoint-set data structure
SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisDBSCAN is a well-known density based clustering algorithm capable of discovering arbitrary shaped clusters and eliminating noise data. However, parallelization of Dbscan is challenging as it exhibits an inherent sequential data access order. Moreover, ...
A new scalable parallel DBSCAN algorithm using the disjoint-set data structure
SC '12: Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and AnalysisDBSCAN is a well-known density based clustering algorithm capable of discovering arbitrary shaped clusters and eliminating noise data. However, parallelization of Dbscan is challenging as it exhibits an inherent sequential data access order. Moreover, ...
Comments