ABSTRACT
Cross-match is a central operation in astronomic databases to integrate multiple catalogs of celestial objects. With the rapid development of new astronomy projects, large amounts of astronomic catalogs are generated and require fast cross-match with existing databases. In this paper, we propose to adopt a Multi-Assignment Single Join (MASJ) method for cross-match on heterogeneous clusters that consist of both CPUs and GPUs. We chose MASJ for cross-match, because (1) cross-matching records from astronomic catalogs is essentially a spatial distance join on two sets of points, and (2) each reference point is mapped to only a small number of search intervals. As a result, the MASJ cross-match, or MASJ-CM algorithm is feasible and highly efficient in a heterogeneous cluster environment. We have implemented MASJ-CM in two packages: one is an MPI-CUDA implementation, which fully utilizes the multi-core CPUs, GPUs, and InfiniBand communications; the other is on top of the popular distributed computing platform Spark, which greatly simplifies the programming. Our results on a six-node CPU-GPU cluster show that the MPI-CUDA implementation achieved a speedup of 2.69 times over a previous indexed nested-loop join algorithm. The Spark-based implementation was an order of magnitude slower than the MPI-CUDA; nevertheless, it is widely applicable and its source code much simpler.
- Equatorial coordinate system. https://en.wikipedia.org/wiki/Equatorial-coordinate-system.Google Scholar
- Healpix home page. http://healpix.sourceforge.net.Google Scholar
- Sdss dr12. http://www.sdss.org/dr12/scope/.Google Scholar
- T. Brinkhoff, H.-P. Kriegel, and B. Seeger. Parallel processing of spatial joins using r-trees. In Data Engineering, 1996. Proceedings of the Twelfth International Conference on, 1996. Google ScholarDigital Library
- T. Budavari and M. A. Lee. Xmatch: Gpu enhanced astronomic catalog cross-matching. Astrophysics Source Code Library, 1:03021, 2013.Google Scholar
- T. Budavári and A. S. Szalay. Probabilistic cross-identification of astronomical sources. The Astrophysical Journal, 679(1):301, 2008.Google ScholarCross Ref
- R. Cutri et al. Vizier online data catalog: Wise all-sky data release (cutri+ 2012). VizieR Online Data Catalog, 2311:0, 2012.Google Scholar
- A. Davies and A. Orsaria. Scale out with glusterfs. Linux Journal, 2013(235):1, 2013. Google ScholarDigital Library
- S. Donovan, G. Huizenga, A. Hutton, C. Ross, M. Petersen, and P. Schwan. Lustre: Building a file system for 1000-node clusters. In Proceedings of the Linux Symposium, 2003.Google Scholar
- D. Fan, T. Budavári, R. P. Norris, and A. M. Hopkins. Matching radio catalogues with realistic geometry: application to swire and atlas. Monthly Notices of the Royal Astronomical Society, 451:1299--1305, 2015.Google ScholarCross Ref
- D. Fan, T. Budavári, A. S. Szalay, C. Cui, and Y. Zhao. Efficient catalog matching with dropout detection. Publications of the Astronomical Society of the Pacific, 125(924):218--223, 2013.Google ScholarCross Ref
- J. Freeman, N. Vladimirov, T. Kawashima, Y. Mu, N. J. Sofroniew, D. V. Bennett, J. Rosen, C.-T. Yang, L. L. Looger, and M. B. Ahrens. Mapping brain activity at scale with cluster computing. Nature methods, 11(9):941--950, 2014.Google ScholarCross Ref
- K. M. Gorski, E. Hivon, A. Banday, B. D. Wandelt, F. K. Hansen, M. Reinecke, and M. Bartelmann. Healpix: a framework for high-resolution discretization and fast analysis of data distributed on the sphere. The Astrophysical Journal, 622(2):759, 2005.Google ScholarCross Ref
- O. Günther. Efficient computation of spatial joins. In ICDE, 1993.Google ScholarDigital Library
- B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and P. Sander. Relational joins on graphics processors. In SIGMOD, 2008. Google ScholarDigital Library
- E. G. Hoel and H. Samet. Performance of data-parallel spatial operations. In VLDB, 1994. Google ScholarDigital Library
- E. G. Hoel, H. Samet, et al. Benchmarking spatial join operations with spatial output. In VLDB, 1995. Google ScholarDigital Library
- Y.-W. Huang, N. Jing, and E. A. Rundensteiner. Spatial joins using r-trees: Breadth-first traversal with global optimizations. In VLDB, 1997. Google ScholarDigital Library
- I. Kamel and C. Faloutsos. Parallel r-trees. In SIGMOD, 1992. Google ScholarDigital Library
- N. Koudas, C. Faloutsos, and I. Kamel. Declustering spatial databases on a multi-computer architecture. Springer, 1996.Google ScholarCross Ref
- V. S. Kumar, T. Kurc, J. Saltz, G. Abdulla, S. R. Kohn, and C. Matarazzo. Architectural implications for spatial object association algorithms. In IPDPS, 2009. Google ScholarDigital Library
- M. Lee and T. Budavári. Cross-identification of astronomical catalogs on multiple gpus. In Astronomical Data Analysis Software and Systems XXII, 2013.Google Scholar
- M.-L. Lo and C. V. Ravishankar. Spatial joins using seeded trees. In SIGMOD, 1994. Google ScholarDigital Library
- M.-L. Lo and C. V. Ravishankar. Spatial hash-joins. In SIGMOD, 1996. Google ScholarDigital Library
- G. Luo, J. F. Naughton, and C. J. Ellmann. A non-blocking parallel spatial join algorithm. In ICDE, 2002.Google Scholar
- D. G. Monet et al. The usno-b catalog. The Astronomical Journal, 125(2):984, 2003.Google ScholarCross Ref
- M. A. Nieto-Santisteban, A. R. Thakar, and A. S. Szalay. Cross-matching very large datasets. In National Science and Technology Council (NSTC) NASA Conference, 2007.Google Scholar
- M. A. Nieto-Santisteban, A. R. Thakar, A. S. Szalay, and J. Gray. Large-scale query and xmatch, entering the parallel zone. In Astronomical Data Analysis Software and Systems XV, 2006.Google Scholar
- J. Nievergelt, H. Hinterberger, and K. C. Sevcik. The grid file: An adaptable, symmetric multikey file structure. TODS, (1):38--71, 1984. Google ScholarDigital Library
- F. A. Nothaft, M. Massie, T. Danford, Z. Zhang, U. Laserson, C. Yeksigian, J. Kottalam, A. Ahuja, J. Hammerbacher, M. Linderman, et al. Rethinking data-intensive science using scalable analytics systems. In SIGMOD. Google ScholarDigital Library
- J. M. Patel and D. J. DeWitt. Partition based spatial-merge join. In SIGMOD, 1996. Google ScholarDigital Library
- J. M. Patel and D. J. DeWitt. Clone join and shadow join: two parallel spatial join algorithms. In GIS, 2000. Google ScholarDigital Library
- F.-X. Pineau, T. Boch, and S. Derriere. Efficient and scalable cross-matching of (very) large catalogs. In Astronomical Data Analysis Software and Systems XX, 2011.Google Scholar
- J. Reinders. Intel threading building blocks: outfitting C++ for multi-core processor parallelism. " O'Reilly Media, Inc.", 2007. Google ScholarDigital Library
- S. Shekhar, D. Chubb, and G. Turner. Declustering and load-balancing methods for parallelizing geographic information systems. TKDE, (4):632--655, 1998. Google ScholarDigital Library
- K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In MSST, 2010. Google ScholarDigital Library
- M. Skrutskie, R. Cutri, R. Stiening, M. Weinberg, S. Schneider, J. Carpenter, C. Beichman, R. Capps, T. Chester, J. Elias, et al. The two micron all sky survey (2mass). The Astronomical Journal, 131(2):1163, 2006.Google ScholarCross Ref
- A. S. Szalay, G. Fekete, W. O'Mullane, M. A. Nieto-Santisteban, A. R. Thakar, G. Heber, and A. H. Rots. There goes the neighborhood: Relational algebra for spatial data search. MSR-TR-2004-32, 2004.Google Scholar
- A. S. Szalay, J. Gray, G. Fekete, P. Z. Kunszt, P. Kukol, and A. Thakar. Indexing the sphere with the hierarchical triangular mesh. arXiv preprint cs/0701164, 2007.Google Scholar
- S. Wang, Y. Zhao, Q. Luo, C. Wu, and Y. Xv. Accelerating in-memory cross match of astronomical catalogs. In eScience, 2013. Google ScholarDigital Library
- D. F. Xiaoying Jia, Qiong Luo. Cross-matching large astronomical catalogs on heterogeneous clusters. In ICPADS, 2015. Google ScholarDigital Library
- S. You, J. Zhang, and L. Gruenwald. Large-scale spatial join query processing in cloud. In ICDEW, 2015.Google ScholarCross Ref
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. HotCloud, 10:10--10, 2010. Google ScholarDigital Library
- J. Zhang and S. You. Speeding up large-scale point-in-polygon test based spatial join on gpus. In SIGSPATIAL, 2012. Google ScholarDigital Library
- J. Zhang, S. You, and L. Gruenwald. Parallel online spatial and temporal aggregations on multi-core cpus and many-core gpus. Information Systems, pages 134--154, 2014.Google Scholar
- S. Zhang, J. Han, Z. Liu, K. Wang, and Z. Xu. Sjmr: Parallelizing spatial join with mapreduce on clusters. In CLUSTER, 2009.Google ScholarCross Ref
- Z. Zhang, K. Barbary, F. A. Nothaft, E. Sparks, O. Zahn, M. J. Franklin, D. A. Patterson, and S. Perlmutter. Scientific computing meets big data technology: An astronomy use case. In Big Data, 2015. Google ScholarDigital Library
- Q. Zhao, J. Sun, C. Yu, C. Cui, L. Lv, and J. Xiao. A paralleled large-scale astronomical cross-matching function. In Algorithms and Architectures for Parallel Processing, pages 604--614. 2009. Google ScholarDigital Library
- X. Zhou, D. J. Abel, and D. Truffet. Data partitioning for parallel spatial join processing. Geoinformatica, 2(2):175--204, 1998. Google ScholarDigital Library
Recommendations
Cross-Matching Large Astronomical Catalogs on Heterogeneous Clusters
ICPADS '15: Proceedings of the 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS)Cross-matching astronomical catalogs is a central operation in astronomical data integration and analysis. As current commodity clusters typically consist of heterogeneous processors including both multi-core CPUs and GPUs, we study how to efficiently ...
Joins in a heterogeneous memory hierarchy: exploiting high-bandwidth memory
DAMON '18: Proceedings of the 14th International Workshop on Data Management on New HardwareWith High-Bandwidth Memory (HBM), an additional opportunity on hardware side for performance benefits is given. The large amount of available bandwidth compared to regular DRAM allows the execution of high numbers of threads in parallel masking ...
Comments