Abstract
The increasing popularity of NoSQL systems has lead to the model of polyglot persistence, in which several data management systems with different data models are used. Data lakes realize the polyglot persistence model by collecting data from various sources, by storing the data in its original structure, and by providing the datasets for querying and analysis. Thus, one of the key tasks of data lakes is to provide a unified querying interface, which is able to rewrite queries expressed in a general data model into a union of queries for data sources spanning heterogeneous data stores. To address this challenge, we propose a novel framework for query rewriting that combines logical methods for data integration based on declarative mappings with a scalable big data query processing system (i.e., Apache Spark) to efficiently execute the rewritten queries and to reconcile the query results into an integrated dataset. Because of the diversity of NoSQL systems, our approach is based on a flexible and extensible architecture that currently supports the major data structures such as relational data, semi-structured data (e.g., JSON, XML), and graphs. We show the applicability of our query rewriting engine with six real world datasets and demonstrate its scalability using an artificial data integration scenario with multiple storage systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
Queries and datasets are available at https://bit.ly/2l9lXhc.
References
Bugiotti, F., et al.: Invisible glue: scalable self-tuning multi-stores. In: Proceedings of CIDR (2015)
Chasseur, C., Li, Y., Patel, J.M.: Enabling JSON document stores in relational systems. In: Proceedings of WebDB, pp. 1–6 (2013)
Duggan, J., et al.: The BigDAWG polystore system. SIGMOD Rec. 44(2), 11–16 (2015)
Florescu, D., Fourny, G.: JSONiq: the history of a query language. IEEE Int. Comput. 17(5), 86–90 (2013)
Giannakouris, V., Papailiou, N., Tsoumakos, D., Koziris, N.: MuSQLE: distributed SQL query execution over multiple engine environments. In: Proceedings of Big Data, pp. 452–461 (2016)
Hai, R., Geisler, S., Quix, C.: Constance: an intelligent data lake system. In: Proceedings of SIGMOD, pp. 2097–2100 (2016)
DeWitt, D.J., et al.: Split query processing in polybase. In: Proceedings of SIGMOD, pp. 1255–1266. 22–27 June 2013
Jarke, M., Quix, C.: On warehouses, lakes, and spaces: the changing role of conceptual modeling for data integration. In: Cabot, J., Gómez, C., Pastor, O., Sancho, M., Teniente, E. (eds.) Conceptual Modeling Perspectives, pp. 231–245. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67271-7_16
Kolev, B., et al.: CloudMdsQL: querying heterogeneous cloud data stores with a common language. Distrib. Parallel Databases 34(4), 463–503 (2016)
LeFevre, J., et al.: MISO: souping up big data query processing with a multistore system. In: Proceedings of SIGMOD, pp. 1591–1602 (2014)
Leis, V., et al.: How good are query optimizers, really? In: Proceedings of VLDB, pp. 204–215 (2015)
Ong, K.W., Papakonstantinou, Y., Vernoux, R.: The SQL++ unifying semi-structured query language, and an expressiveness benchmark of SQL-on-Hadoop, NoSQL and NewSQL databases. CoRR, abs/1405.3631 (2014)
Quix, C., Hai, R., Vatov, I.: Metadata extraction and management in data lakes with GEMMS. Complex Syst. Inf. Model. Q. 9, 67–83 (2016)
Sharma, B., LaPlante, A.: Architecting data lakes. O’Reilly Media (2016). https://resources.zaloni.com/ebooks/architecting-data-lakes
Terrizzano, I., Schwarz, P.M., Roth, M., Colino, J.E.: Data wrangling: the challenging yourney from the wild to the lake. In: Proceedings of CIDR (2015)
Yu, C., Popa, L.: Constraint-based XML query rewriting for data integration. In: Proceedings of SIGMOD, pp. 371–382 (2004)
Zhu, M., Risch, T.: Querying combined cloud-based and relational databases. In: 2011 International Conference Cloud and Service Computing (CSC) (2011)
Acknowledgements
This work has been partially funded by the German Federal Ministry of Education and Research (BMBF) (project HUMIT, http://humit.de/, grant no. 01IS14007A), German Research Foundation (DFG) within the Cluster of Excellence “Integrative Production Technology for High Wage Countries” (EXC 128), and by the Joint Research (IGF) of the German Federal Ministry of Economic Affairs and Energy (BMWI, project charMant, http://charmant-projekt.de/, IGF promotion plan 18504N).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Hai, R., Quix, C., Zhou, C. (2018). Query Rewriting for Heterogeneous Data Lakes. In: Benczúr, A., Thalheim, B., Horváth, T. (eds) Advances in Databases and Information Systems. ADBIS 2018. Lecture Notes in Computer Science(), vol 11019. Springer, Cham. https://doi.org/10.1007/978-3-319-98398-1_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-98398-1_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98397-4
Online ISBN: 978-3-319-98398-1
eBook Packages: Computer ScienceComputer Science (R0)