skip to main content
10.1145/2503210.2503255acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Scalable parallel OPTICS data clustering using graph algorithmic techniques

Published:17 November 2013Publication History

ABSTRACT

OPTICS is a hierarchical density-based data clustering algorithm that discovers arbitrary-shaped clusters and eliminates noise using adjustable reachability distance thresholds. Parallelizing OPTICS is considered challenging as the algorithm exhibits a strongly sequential data access order. We present a scalable parallel OPTICS algorithm (Poptics) designed using graph algorithmic concepts. To break the data access sequentiality, POPTICS exploits the similarities between the OPTICS algorithm and Prim's Minimum Spanning Tree algorithm. Additionally, we use the disjoint-set data structure to achieve a high parallelism for distributed cluster extraction. Using high dimensional datasets containing up to a billion floating point numbers, we show scalable speedups of up to 27.5 for our OpenMP implementation on a 40-core shared-memory machine, and up to 3,008 for our MPI implementation on a 4,096-core distributed-memory machine. We also show that the quality of the results given by POPTICS is comparable to those given by the classical OPTICS algorithm.

References

  1. Parallel K-means data clustering, 2005. http://users.eecs.northwestern.edu/wkliao/Kmeans/.Google ScholarGoogle Scholar
  2. CLUTO - clustering high-dimensional datasets, 2006. http://glaros.dtc.umn.edu/gkhome/cluto/cluto/.Google ScholarGoogle Scholar
  3. Cliquemod, 2009. http://www.cs.bris.ac.uk/steve/networks/cliquemod/.Google ScholarGoogle Scholar
  4. R. Agrawal and R. Srikant. Quest synthetic data generator. IBM Almaden Research Center, 1994.Google ScholarGoogle Scholar
  5. M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. Optics: ordering points to identify the clustering structure. In Proceedings of the 1999 ACM SIGMOD, pages 49--60, New York, NY, USA, 1999. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Arlia and M. Coppola. Experiments in parallel clustering with DBSCAN. In Euro-Par 2001 Parallel Processing, pages 326--331. Springer, LNCS, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. H. Backlund, A. Hedblom, and N. Neijman. A density-based spatial clustering of application with noise. 2011. http://staffwww.itn.liu.se/aidvi/courses/06/dm/Seminars2011.Google ScholarGoogle Scholar
  8. N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The r*-tree: an efficient and robust access method for points and rectangles. Proceedings of the 1990 ACM SIGMOD, 19(2):322--331, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509--517, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Bertone, G. De Lucia, and P. Thomas. The recycling of gas and metals in galaxy formation: predictions of a dynamical feedback model. Monthly Notices of the Royal Astronomical Society, 379(3):1143--1154, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  11. D. Birant and A. Kut. ST-DBSCAN: An algorithm for clustering spatial-temporal data. Data & Knowledge Engineering, 60(1):208--221, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Bower, A. Benson, R. Malbon, J. Helly, C. Frenk, C. Baugh, S. Cole, and C. Lacey. Breaking the hierarchy of galaxy formation. Monthly Notices of the Royal Astronomical Society, 370(2):645--655, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  13. S. Brecheisen, H. Kriegel, and M. Pfeifle. Parallel density-based clustering of complex objects. Adv. in Know. Discovery and Data Mining, pages 179--188, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Chen, X. Gao, and H. Li. Parallel DBSCAN with priority r-tree. In Information Management and Engineering (ICIME), 2010 The 2nd IEEE International Conference on, pages 508--511. IEEE, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  15. S. Chung and A. Condon. Parallel implementation of bouvka's minimum spanning tree algorithm. In Parallel Processing Symposium, 1996., Proceedings of IPPS'96, The 10th International, pages 302--308. IEEE, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. L. M. Collins and C. W. Dent. Omega: A general formulation of the rand index of cluster recovery suitable for non-disjoint solutions. Multivariate Behavioral Research, 23(2):231--242, 1988.Google ScholarGoogle ScholarCross RefCross Ref
  17. M. Coppola and M. Vanneschi. High-performance data mining with skeleton-based structured parallel programming. Parallel Computing, 28(5):793--813, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. Cormen. Introduction to algorithms. The MIT press, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. De Lucia and J. Blaizot. The hierarchical formation of the brightest cluster galaxies. Monthly Notices of the Royal Astronomical Society, 375(1):2--14, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  20. M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data mining, volume 1996, pages 226--231. AAAI Press, 1996.Google ScholarGoogle Scholar
  21. U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From data mining to knowledge discovery in databases. AI magazine, 17(3):37, 1996.Google ScholarGoogle Scholar
  22. M. Forina, M. C. Oliveros, C. Casolino, and M. Casale. Minimum spanning tree: ordering edges to identify clustering structure. Analytica Chimica Acta, 515(1):43--53, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  23. Y. Fu, W. Zhao, and H. Ma. Research on parallel DBSCAN algorithm design based on mapreduce. Advanced Materials Research, 301:1133--1138, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  24. B. Galler and M. Fisher. An improved equivalence algorithm. Communications of the ACM, 7:301--303, 1964. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. C. Gower and G. J. S. Ross. Minimum spanning trees and single linkage cluster analysis. Journal of the Royal Statistical Society. Series C (Applied Statistics), 18(1):pp. 54--64, 1969.Google ScholarGoogle Scholar
  26. J. Han, M. Kamber, and J. Pei. Data mining: concepts and techniques. Morgan Kaufmann, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. H. Kargupta and J. Han. Next generation of data mining, volume 7. Chapman & Hall/CRC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. B. Kennel. KDTREE 2: Fortran 95 and C++ software to efficiently search for near neighbors in a multi-dimensional Euclidean space, 2004. Institute for Nonlinear Science, University of California.Google ScholarGoogle Scholar
  29. H.-P. Kriegel and M. Pfeifle. Hierarchical density-based clustering of uncertain data. In Data Mining, Fifth IEEE International Conference on, pages 4--pp. IEEE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. B. Kruskal. On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem. Proceedings of the American Mathematical Society, 7(1):48--50, Feb. 1956.Google ScholarGoogle ScholarCross RefCross Ref
  31. L. Lelis and J. Sander. Semi-supervised density-based clustering. In Data Mining, 2009. ICDM'09. Ninth IEEE International Conference on, pages 842--847. IEEE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. G. Lemson and the Virgo Consortium. Halo and galaxy formation histories from the millennium simulation: Public release of a VO-oriented and SQL-queryable database for studying the evolution of galaxies in the LambdaCDM cosmogony. Arxiv preprint astro-ph/0608019, 2006.Google ScholarGoogle Scholar
  33. Y. Liu, W.-k. Liao, and A. Choudhary. Design and evaluation of a parallel HOP clustering algorithm for cosmological simulation. In Proceedings of IPDPS 2003, page 82.1, Washington, DC, USA, 2003. IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Z. Lukić, D. Reed, S. Habib, and K. Heitmann. The structure of halos: Implications for group and cluster cosmology. The Astrophysical Journal, 692(1):217, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  35. J. MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281--297. USA, 1967.Google ScholarGoogle Scholar
  36. S. Madeira and A. Oliveira. Biclustering algorithms for biological data analysis: a survey. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 1(1):24--45, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. F. Manne and M. Patwary. A scalable parallel union-find algorithm for distributed memory computers. In Parallel Processing and Applied Mathematics, pages 186--195. Springer, LNCS, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. A. Mukhopadhyay and U. Maulik. Unsupervised satellite image segmentation by combining SA based fuzzy clustering with support vector machine. In Proceedings of 7th ICAPR'09, pages 381--384. IEEE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. G. Murray, G. Carenini, and R. Ng. Using the omega index for evaluating abstractive community detection. In Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization, pages 10--18, Stroudsburg, PA, USA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. S. Nobari, T.-T. Cao, P. Karras, and S. Bressan. Scalable parallel minimum spanning forest computation. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, pages 205--214. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. H. Park and C. Jun. A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2):3336--3341, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. M. Patwary, M. Ali, P. Refsnes, and F. Manne. Multi-core spanning forest algorithms using the disjoint-set data structure. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 827--835. IEEE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. M. Patwary, J. Blair, and F. Manne. Experiments on union-find algorithms for the disjoint-set data structure. In Proceedings of the 9th International Symposium on Experimental Algorithms (SEA 2010), pages 411--423. Springer, LNCS 6049, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. M. A. Patwary, D. Palsetia, A. Agrawal, W.-k. Liao, F. Manne, and A. Choudhary. A new scalable parallel dbscan algorithm using the disjoint-set data structure. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 62:1--62:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. Pisharath, Y. Liu, W. Liao, A. Choudhary, G. Memik, and J. Parhi. NU-MineBench 3.0. Technical report, Technical Report CUCIS-2005-08-01, Northwestern University, 2010.Google ScholarGoogle Scholar
  46. R. C. Prim. Shortest connection networks and some generalizations. Bell System Technology Journal, 36:1389--1401, 1957.Google ScholarGoogle ScholarCross RefCross Ref
  47. R. Setia, A. Nedunchezhian, and S. Balachandran. A new parallel algorithm for minimum spanning tree problem. In Proc. International Conference on High Performance Computing (HiPC), pages 1--5, 2009.Google ScholarGoogle Scholar
  48. G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: a wavelet-based clustering approach for spatial data in very large databases. The VLDB Journal, 8(3):289--304, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. V. Springel, S. White, A. Jenkins, C. Frenk, N. Yoshida, L. Gao, J. Navarro, R. Thacker, D. Croton, J. Helly, et al. Simulations of the formation, evolution and clustering of galaxies and quasars. Nature, 435(7042):629--636, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  50. M. Surdeanu, J. Turmo, and A. Ageno. A hybrid unsupervised approach for document clustering. In Proceedings of the 11th ACM SIGKDD, pages 685--690. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. R. Tarjan. A class of algorithms which require nonlinear time to maintain disjoint sets. Journal of computer and system sciences, 18(2):110--127, 1979.Google ScholarGoogle Scholar
  52. W. Wang, J. Yang, and R. Muntz. STING: A statistical information grid approach to spatial data mining. In Proceedings of the International Conference on Very Large Data Bases, pages 186--195. IEEE, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. J. Xie, S. Kelley, and B. K. Szymanski. Overlapping community detection in networks: the state of the art and comparative study. ACM Computing Surveys, 45(4), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. X. Xu, J. Jäger, and H. Kriegel. A fast parallel clustering algorithm for large spatial databases. High Performance Data Mining, pages 263--290, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method for very large databases. In ACM SIGMOD Record, volume 25(2), pages 103--114. ACM, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. A. Zhou, S. Zhou, J. Cao, Y. Fan, and Y. Hu. Approaches for scaling DBSCAN algorithm to large spatial databases. Computer science and technology, 15(6):509--526, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scalable parallel OPTICS data clustering using graph algorithmic techniques

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
              November 2013
              1123 pages
              ISBN:9781450323789
              DOI:10.1145/2503210
              • General Chair:
              • William Gropp,
              • Program Chair:
              • Satoshi Matsuoka

              Copyright © 2013 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 17 November 2013

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              SC '13 Paper Acceptance Rate91of449submissions,20%Overall Acceptance Rate1,516of6,373submissions,24%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader