skip to main content
10.1145/3274895.3274942acmconferencesArticle/Chapter ViewAbstractPublication PagesgisConference Proceedingsconference-collections
research-article

Efficient astronomical query processing using spark

Published: 06 November 2018 Publication History

Abstract

Sky surveys represent a fundamental data source in astronomy. Today, these surveys are moving into a petascale regime produced by modern telescopes. Due to the exponential growth of astronomical data, there is a pressing need to provide efficient astronomical query processing. Our goal is to bridge the gap between existing distributed systems and high-level languages for astronomers. In this paper, we present efficient techniques for query processing of astronomical data using ASTROIDE. Our framework helps astronomers to take advantage of the richness of the astronomical data. The proposed model supports complex astronomical operators expressed using ADQL (Astronomical Data Query Language), an extension of SQL commonly used by astronomers. ASTROIDE proposes spatial indexing and partitioning techniques to better filter the data access. It also implements a query optimizer that injects spatial-aware optimization rules and strategies. Experimental evaluation based on real datasets demonstrates that the present framework is scalable and efficient.

References

[1]
2013. HEALPix Softaware. http://healpix.sourceforge.net/
[2]
2013. IGSL. http://cdsarc.u-strasbg.fr/viz-bin/Cat?I/324
[3]
2017. COST-BASED OPTIMIZER IN APACHE SPARK 2.2. https://spark-summit.org/2017/events/cost-based-optimizer-in-apache-spark-22/
[4]
2018. ADQL. http://www.ivoa.net/documents/latest/ADQL.html
[5]
2018. ADQL CDS. http://cdsportal.u-strasbg.fr/adqltuto/
[6]
2018. GAIA. http://www.cosmos.esa.int/web/gaia
[7]
2018. SciDB. https://www.paradigm4.com/try_scidb/
[8]
Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. 2015. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1383--1394.
[9]
Mariem Brahem, Stephane Lopes, Laurent Yeh, and Karine Zeitouni. 2016. Astro-Spark: towards a distributed data server for big data in astronomy. In Proceedings of the 3rd ACM SIGSPATIAL PhD Symposium. ACM, 3.
[10]
Mariem Brahem, Karine Zeitouni, and Laurent Yeh. 2017. HX-MATCH: In-Memory Cross-Matching Algorithm for Astronomical Big Data. In International Symposium on Spatial and Temporal Databases. Springer, 411--415.
[11]
Ahmed Eldawy and Mohamed F Mokbel. 2015. Spatialhadoop: A mapreduce framework for spatial data. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on. IEEE, 1352--1363.
[12]
Ahmed Eldawy and Mohamed F. Mokbel. 2017. The Era of Big Spatial Data. Proc. VLDB Endow. 10, 12 (2017), 1992--1995.
[13]
Krzysztof M Gorski, Eric Hivon, AJ Banday, Benjamin D Wandelt, Frode K Hansen, Mstvos Reinecke, and Matthia Bartelmann. 2005. HEALPix: a framework for high-resolution discretization and fast analysis of data distributed on the sphere. The Astrophysical Journal 622, 2 (2005), 759.
[14]
S Koposov and O Bartunov. 2006. Q3C, Quad Tree Cube-the new sky-indexing concept for huge astronomical catalogues and its realization for main astronomical queries (cone search and Xmatch) in open source database PostgreSQL. In Astronomical Data Analysis Software and Systems XV, Vol. 351. 735.
[15]
Amin Mesmoudi, Mohand-Saïd Hacid, and Farouk Toumani. 2016. Benchmarking SQL on MapReduce systems using large astronomy databases. Distributed and Parallel Databases 34, 3 (2016), 347--378.
[16]
María A Nieto-Santisteban, Aniruddha R Thakar, and Alexander S Szalay. 2007. Cross-matching very large datasets. In National Science and Technology Council (NSTC) NASA Conference.
[17]
Shoji Nishimura, Sudipto Das, Divyakant Agrawal, and Amr El Abbadi. 2013. MDHBase: design and implementation of an elastic data infrastructure for cloud-scale location services. Distributed and Parallel Databases 31, 2 (2013), 289--319.
[18]
François Ochsenbein, Patricia Bauer, and James Marcout. 2000. The VizieR database of astronomical catalogues. Astronomy and Astrophysics Supplement Series 143, 1 (2000), 23--32.
[19]
William O'Mullane, AJ Banday, KM Gorski, Peter Kunszt, and AS Szalay. 2000. Splitting the sky-HTM and HEALPix. In Mining the Sky. Springer, 638--648.
[20]
Alexander S Szalay, Jim Gray, George Fekete, Peter Z Kunszt, Peter Kukol, and Ani Thakar. 2007. Indexing the sphere with the hierarchical triangular mesh. arXiv preprint cs/0701164 (2007).
[21]
Jacob VanderPlas, Emad Soroush, K Simon Krughoff, Magdalena Balazinska, and Andrew Connolly. 2013. Squeezing a Big Orange into Little Boxes: The AscotDB System for Parallel Processing of Data on a Sphere. IEEE Data Eng. Bull. 36, 4 (2013), 11--20.
[22]
Chenyi Xia, Hongjun Lu, Beng Chin Ooi, and Jing Hu. 2004. Gorder: an efficient method for KNN join processing. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30. VLDB Endowment, 756--767.
[23]
Dong Xie, Feifei Li, Bin Yao, Gefei Li, Liang Zhou, and Minyi Guo. 2016. Simba: Efficient in-memory spatial analytics. In Proceedings of the 2016 International Conference on Management of Data. ACM, 1071--1085.
[24]
Jia Yu, Jinxuan Wu, and Mohamed Sarwat. 2015. Geospark: A cluster computing framework for processing large-scale spatial data. In Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, 70.
[25]
Chi Zhang, Feifei Li, and Jeffrey Jestes. 2012. Efficient parallel kNN joins for large data in MapReduce. In Proceedings of the 15th International Conference on Extending Database Technology. ACM, 38--49.
[26]
Qing Zhao, Jizhou Sun, Ce Yu, Chenzhou Cui, Liqiang Lv, and Jian Xiao. 2009. A paralleled large-scale astronomical cross-matching function. In International Conference on Algorithms and Architectures for Parallel Processing. Springer, 604--614.

Cited By

View all
  • (2023)Persistent and occasional: Searching for the variable population of the ZTF/4MOST sky using ZTF Data Release 11Astronomy & Astrophysics10.1051/0004-6361/202346077675(A195)Online publication date: 20-Jul-2023
  • (2021)The Automatic Learning for the Rapid Classification of Events (ALeRCE) Alert BrokerThe Astronomical Journal10.3847/1538-3881/abe9bc161:5(242)Online publication date: 27-Apr-2021
  • (2021)Scaling pair count to next galaxy surveysMonthly Notices of the Royal Astronomical Society10.1093/mnras/stab3640510:2(3085-3097)Online publication date: 15-Dec-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGSPATIAL '18: Proceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
November 2018
655 pages
ISBN:9781450358897
DOI:10.1145/3274895
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. astronomical survey data management
  2. big data
  3. query processing
  4. spark framework

Qualifiers

  • Research-article

Funding Sources

  • European Union

Conference

SIGSPATIAL '18
Sponsor:

Acceptance Rates

SIGSPATIAL '18 Paper Acceptance Rate 30 of 150 submissions, 20%;
Overall Acceptance Rate 257 of 1,238 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Persistent and occasional: Searching for the variable population of the ZTF/4MOST sky using ZTF Data Release 11Astronomy & Astrophysics10.1051/0004-6361/202346077675(A195)Online publication date: 20-Jul-2023
  • (2021)The Automatic Learning for the Rapid Classification of Events (ALeRCE) Alert BrokerThe Astronomical Journal10.3847/1538-3881/abe9bc161:5(242)Online publication date: 27-Apr-2021
  • (2021)Scaling pair count to next galaxy surveysMonthly Notices of the Royal Astronomical Society10.1093/mnras/stab3640510:2(3085-3097)Online publication date: 15-Dec-2021
  • (2020)Prospective Data Model and Distributed Query Processing for Mobile Sensing Data StreamsMultiple-Aspect Analysis of Semantic Trajectories10.1007/978-3-030-38081-6_6(66-82)Online publication date: 4-Jan-2020
  • (2019)Analysing billion-objects catalogue interactively: Apache Spark for physicistsAstronomy and Computing10.1016/j.ascom.2019.100305(100305)Online publication date: Jul-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media