Query Rewriting for Heterogeneous Data Lakes

Hai, Rihan; Quix, Christoph; Zhou, Chen

doi:10.1007/978-3-319-98398-1_3

Rihan Hai¹⁶,
Christoph Quix^16,17 &
Chen Zhou¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11019))

Included in the following conference series:

European Conference on Advances in Databases and Information Systems

1384 Accesses
29 Citations

Abstract

The increasing popularity of NoSQL systems has lead to the model of polyglot persistence, in which several data management systems with different data models are used. Data lakes realize the polyglot persistence model by collecting data from various sources, by storing the data in its original structure, and by providing the datasets for querying and analysis. Thus, one of the key tasks of data lakes is to provide a unified querying interface, which is able to rewrite queries expressed in a general data model into a union of queries for data sources spanning heterogeneous data stores. To address this challenge, we propose a novel framework for query rewriting that combines logical methods for data integration based on declarative mappings with a scalable big data query processing system (i.e., Apache Spark) to efficiently execute the rewritten queries and to reconcile the query results into an integrated dataset. Because of the diversity of NoSQL systems, our approach is based on a flexible and extensible architecture that currently supports the major data structures such as relational data, semi-structured data (e.g., JSON, XML), and graphs. We show the applicability of our query rewriting engine with six real world datasets and demonstrate its scalability using an artificial data integration scenario with multiple storage systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Ontario: Federated Query Processing Against a Semantic Data Lake

Managing Polyglot Systems Metadata with Hypergraphs

Chimera: A Bridge Between Big Data Analytics and Semantic Technologies

Notes

1.
https://docs.mongodb.com/spark-connector/v1.1/java-api/.
2.
https://github.com/neo4j-contrib/neo4j-spark-connector.
3.
https://github.com/databricks/spark-xml.
4.
http://dblp.org/.
5.
https://europepmc.org/.
6.
https://www.drugbank.ca/.
7.
http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/.
8.
http://jsonstudio.com/resources/.
9.
Queries and datasets are available at https://bit.ly/2l9lXhc.

References

Bugiotti, F., et al.: Invisible glue: scalable self-tuning multi-stores. In: Proceedings of CIDR (2015)
Google Scholar
Chasseur, C., Li, Y., Patel, J.M.: Enabling JSON document stores in relational systems. In: Proceedings of WebDB, pp. 1–6 (2013)
Google Scholar
Duggan, J., et al.: The BigDAWG polystore system. SIGMOD Rec. 44(2), 11–16 (2015)
Article Google Scholar
Florescu, D., Fourny, G.: JSONiq: the history of a query language. IEEE Int. Comput. 17(5), 86–90 (2013)
Article Google Scholar
Giannakouris, V., Papailiou, N., Tsoumakos, D., Koziris, N.: MuSQLE: distributed SQL query execution over multiple engine environments. In: Proceedings of Big Data, pp. 452–461 (2016)
Google Scholar
Hai, R., Geisler, S., Quix, C.: Constance: an intelligent data lake system. In: Proceedings of SIGMOD, pp. 2097–2100 (2016)
Google Scholar
DeWitt, D.J., et al.: Split query processing in polybase. In: Proceedings of SIGMOD, pp. 1255–1266. 22–27 June 2013
Google Scholar
Jarke, M., Quix, C.: On warehouses, lakes, and spaces: the changing role of conceptual modeling for data integration. In: Cabot, J., Gómez, C., Pastor, O., Sancho, M., Teniente, E. (eds.) Conceptual Modeling Perspectives, pp. 231–245. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67271-7_16
Chapter Google Scholar
Kolev, B., et al.: CloudMdsQL: querying heterogeneous cloud data stores with a common language. Distrib. Parallel Databases 34(4), 463–503 (2016)
Article Google Scholar
LeFevre, J., et al.: MISO: souping up big data query processing with a multistore system. In: Proceedings of SIGMOD, pp. 1591–1602 (2014)
Google Scholar
Leis, V., et al.: How good are query optimizers, really? In: Proceedings of VLDB, pp. 204–215 (2015)
Google Scholar
Ong, K.W., Papakonstantinou, Y., Vernoux, R.: The SQL++ unifying semi-structured query language, and an expressiveness benchmark of SQL-on-Hadoop, NoSQL and NewSQL databases. CoRR, abs/1405.3631 (2014)
Google Scholar
Quix, C., Hai, R., Vatov, I.: Metadata extraction and management in data lakes with GEMMS. Complex Syst. Inf. Model. Q. 9, 67–83 (2016)
Google Scholar
Sharma, B., LaPlante, A.: Architecting data lakes. O’Reilly Media (2016). https://resources.zaloni.com/ebooks/architecting-data-lakes
Terrizzano, I., Schwarz, P.M., Roth, M., Colino, J.E.: Data wrangling: the challenging yourney from the wild to the lake. In: Proceedings of CIDR (2015)
Google Scholar
Yu, C., Popa, L.: Constraint-based XML query rewriting for data integration. In: Proceedings of SIGMOD, pp. 371–382 (2004)
Google Scholar
Zhu, M., Risch, T.: Querying combined cloud-based and relational databases. In: 2011 International Conference Cloud and Service Computing (CSC) (2011)
Google Scholar

Download references

Acknowledgements

This work has been partially funded by the German Federal Ministry of Education and Research (BMBF) (project HUMIT, http://humit.de/, grant no. 01IS14007A), German Research Foundation (DFG) within the Cluster of Excellence “Integrative Production Technology for High Wage Countries” (EXC 128), and by the Joint Research (IGF) of the German Federal Ministry of Economic Affairs and Energy (BMWI, project charMant, http://charmant-projekt.de/, IGF promotion plan 18504N).

Author information

Authors and Affiliations

Databases and Information Systems, RWTH Aachen University, Aachen, Germany
Rihan Hai, Christoph Quix & Chen Zhou
Fraunhofer-Institute for Applied Information Technology FIT, Sankt Augustin, Germany
Christoph Quix

Authors

Rihan Hai
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Quix
View author publications
You can also search for this author in PubMed Google Scholar
Chen Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rihan Hai .

Editor information

Editors and Affiliations

Eötvös Loránd University, Budapest, Hungary
András Benczúr
Christian-Albrechts-Universität, Kiel, Germany
Bernhard Thalheim
Eötvös Loránd University, Budapest, Hungary
Tomáš Horváth

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hai, R., Quix, C., Zhou, C. (2018). Query Rewriting for Heterogeneous Data Lakes. In: Benczúr, A., Thalheim, B., Horváth, T. (eds) Advances in Databases and Information Systems. ADBIS 2018. Lecture Notes in Computer Science(), vol 11019. Springer, Cham. https://doi.org/10.1007/978-3-319-98398-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-98398-1_3
Published: 29 July 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98397-4
Online ISBN: 978-3-319-98398-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics