skip to main content
10.1145/2949689.2949705acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

Multi-Assignment Single Joins for Parallel Cross-Match of Astronomic Catalogs on Heterogeneous Clusters

Authors Info & Claims
Published:18 July 2016Publication History

ABSTRACT

Cross-match is a central operation in astronomic databases to integrate multiple catalogs of celestial objects. With the rapid development of new astronomy projects, large amounts of astronomic catalogs are generated and require fast cross-match with existing databases. In this paper, we propose to adopt a Multi-Assignment Single Join (MASJ) method for cross-match on heterogeneous clusters that consist of both CPUs and GPUs. We chose MASJ for cross-match, because (1) cross-matching records from astronomic catalogs is essentially a spatial distance join on two sets of points, and (2) each reference point is mapped to only a small number of search intervals. As a result, the MASJ cross-match, or MASJ-CM algorithm is feasible and highly efficient in a heterogeneous cluster environment. We have implemented MASJ-CM in two packages: one is an MPI-CUDA implementation, which fully utilizes the multi-core CPUs, GPUs, and InfiniBand communications; the other is on top of the popular distributed computing platform Spark, which greatly simplifies the programming. Our results on a six-node CPU-GPU cluster show that the MPI-CUDA implementation achieved a speedup of 2.69 times over a previous indexed nested-loop join algorithm. The Spark-based implementation was an order of magnitude slower than the MPI-CUDA; nevertheless, it is widely applicable and its source code much simpler.

References

  1. Equatorial coordinate system. https://en.wikipedia.org/wiki/Equatorial-coordinate-system.Google ScholarGoogle Scholar
  2. Healpix home page. http://healpix.sourceforge.net.Google ScholarGoogle Scholar
  3. Sdss dr12. http://www.sdss.org/dr12/scope/.Google ScholarGoogle Scholar
  4. T. Brinkhoff, H.-P. Kriegel, and B. Seeger. Parallel processing of spatial joins using r-trees. In Data Engineering, 1996. Proceedings of the Twelfth International Conference on, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. Budavari and M. A. Lee. Xmatch: Gpu enhanced astronomic catalog cross-matching. Astrophysics Source Code Library, 1:03021, 2013.Google ScholarGoogle Scholar
  6. T. Budavári and A. S. Szalay. Probabilistic cross-identification of astronomical sources. The Astrophysical Journal, 679(1):301, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  7. R. Cutri et al. Vizier online data catalog: Wise all-sky data release (cutri+ 2012). VizieR Online Data Catalog, 2311:0, 2012.Google ScholarGoogle Scholar
  8. A. Davies and A. Orsaria. Scale out with glusterfs. Linux Journal, 2013(235):1, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Donovan, G. Huizenga, A. Hutton, C. Ross, M. Petersen, and P. Schwan. Lustre: Building a file system for 1000-node clusters. In Proceedings of the Linux Symposium, 2003.Google ScholarGoogle Scholar
  10. D. Fan, T. Budavári, R. P. Norris, and A. M. Hopkins. Matching radio catalogues with realistic geometry: application to swire and atlas. Monthly Notices of the Royal Astronomical Society, 451:1299--1305, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  11. D. Fan, T. Budavári, A. S. Szalay, C. Cui, and Y. Zhao. Efficient catalog matching with dropout detection. Publications of the Astronomical Society of the Pacific, 125(924):218--223, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  12. J. Freeman, N. Vladimirov, T. Kawashima, Y. Mu, N. J. Sofroniew, D. V. Bennett, J. Rosen, C.-T. Yang, L. L. Looger, and M. B. Ahrens. Mapping brain activity at scale with cluster computing. Nature methods, 11(9):941--950, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  13. K. M. Gorski, E. Hivon, A. Banday, B. D. Wandelt, F. K. Hansen, M. Reinecke, and M. Bartelmann. Healpix: a framework for high-resolution discretization and fast analysis of data distributed on the sphere. The Astrophysical Journal, 622(2):759, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  14. O. Günther. Efficient computation of spatial joins. In ICDE, 1993.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and P. Sander. Relational joins on graphics processors. In SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. E. G. Hoel and H. Samet. Performance of data-parallel spatial operations. In VLDB, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. E. G. Hoel, H. Samet, et al. Benchmarking spatial join operations with spatial output. In VLDB, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y.-W. Huang, N. Jing, and E. A. Rundensteiner. Spatial joins using r-trees: Breadth-first traversal with global optimizations. In VLDB, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. I. Kamel and C. Faloutsos. Parallel r-trees. In SIGMOD, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. N. Koudas, C. Faloutsos, and I. Kamel. Declustering spatial databases on a multi-computer architecture. Springer, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  21. V. S. Kumar, T. Kurc, J. Saltz, G. Abdulla, S. R. Kohn, and C. Matarazzo. Architectural implications for spatial object association algorithms. In IPDPS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Lee and T. Budavári. Cross-identification of astronomical catalogs on multiple gpus. In Astronomical Data Analysis Software and Systems XXII, 2013.Google ScholarGoogle Scholar
  23. M.-L. Lo and C. V. Ravishankar. Spatial joins using seeded trees. In SIGMOD, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M.-L. Lo and C. V. Ravishankar. Spatial hash-joins. In SIGMOD, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. G. Luo, J. F. Naughton, and C. J. Ellmann. A non-blocking parallel spatial join algorithm. In ICDE, 2002.Google ScholarGoogle Scholar
  26. D. G. Monet et al. The usno-b catalog. The Astronomical Journal, 125(2):984, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  27. M. A. Nieto-Santisteban, A. R. Thakar, and A. S. Szalay. Cross-matching very large datasets. In National Science and Technology Council (NSTC) NASA Conference, 2007.Google ScholarGoogle Scholar
  28. M. A. Nieto-Santisteban, A. R. Thakar, A. S. Szalay, and J. Gray. Large-scale query and xmatch, entering the parallel zone. In Astronomical Data Analysis Software and Systems XV, 2006.Google ScholarGoogle Scholar
  29. J. Nievergelt, H. Hinterberger, and K. C. Sevcik. The grid file: An adaptable, symmetric multikey file structure. TODS, (1):38--71, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. F. A. Nothaft, M. Massie, T. Danford, Z. Zhang, U. Laserson, C. Yeksigian, J. Kottalam, A. Ahuja, J. Hammerbacher, M. Linderman, et al. Rethinking data-intensive science using scalable analytics systems. In SIGMOD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. M. Patel and D. J. DeWitt. Partition based spatial-merge join. In SIGMOD, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. M. Patel and D. J. DeWitt. Clone join and shadow join: two parallel spatial join algorithms. In GIS, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. F.-X. Pineau, T. Boch, and S. Derriere. Efficient and scalable cross-matching of (very) large catalogs. In Astronomical Data Analysis Software and Systems XX, 2011.Google ScholarGoogle Scholar
  34. J. Reinders. Intel threading building blocks: outfitting C++ for multi-core processor parallelism. " O'Reilly Media, Inc.", 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S. Shekhar, D. Chubb, and G. Turner. Declustering and load-balancing methods for parallelizing geographic information systems. TKDE, (4):632--655, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In MSST, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Skrutskie, R. Cutri, R. Stiening, M. Weinberg, S. Schneider, J. Carpenter, C. Beichman, R. Capps, T. Chester, J. Elias, et al. The two micron all sky survey (2mass). The Astronomical Journal, 131(2):1163, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  38. A. S. Szalay, G. Fekete, W. O'Mullane, M. A. Nieto-Santisteban, A. R. Thakar, G. Heber, and A. H. Rots. There goes the neighborhood: Relational algebra for spatial data search. MSR-TR-2004-32, 2004.Google ScholarGoogle Scholar
  39. A. S. Szalay, J. Gray, G. Fekete, P. Z. Kunszt, P. Kukol, and A. Thakar. Indexing the sphere with the hierarchical triangular mesh. arXiv preprint cs/0701164, 2007.Google ScholarGoogle Scholar
  40. S. Wang, Y. Zhao, Q. Luo, C. Wu, and Y. Xv. Accelerating in-memory cross match of astronomical catalogs. In eScience, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. D. F. Xiaoying Jia, Qiong Luo. Cross-matching large astronomical catalogs on heterogeneous clusters. In ICPADS, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. S. You, J. Zhang, and L. Gruenwald. Large-scale spatial join query processing in cloud. In ICDEW, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  43. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. HotCloud, 10:10--10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. Zhang and S. You. Speeding up large-scale point-in-polygon test based spatial join on gpus. In SIGSPATIAL, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. J. Zhang, S. You, and L. Gruenwald. Parallel online spatial and temporal aggregations on multi-core cpus and many-core gpus. Information Systems, pages 134--154, 2014.Google ScholarGoogle Scholar
  47. S. Zhang, J. Han, Z. Liu, K. Wang, and Z. Xu. Sjmr: Parallelizing spatial join with mapreduce on clusters. In CLUSTER, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  48. Z. Zhang, K. Barbary, F. A. Nothaft, E. Sparks, O. Zahn, M. J. Franklin, D. A. Patterson, and S. Perlmutter. Scientific computing meets big data technology: An astronomy use case. In Big Data, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Q. Zhao, J. Sun, C. Yu, C. Cui, L. Lv, and J. Xiao. A paralleled large-scale astronomical cross-matching function. In Algorithms and Architectures for Parallel Processing, pages 604--614. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. X. Zhou, D. J. Abel, and D. Truffet. Data partitioning for parallel spatial join processing. Geoinformatica, 2(2):175--204, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    SSDBM '16: Proceedings of the 28th International Conference on Scientific and Statistical Database Management
    July 2016
    290 pages

    Copyright © 2016 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 18 July 2016

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate56of146submissions,38%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader