Efficient provenance tracking for datalog using top-k queries

Deutch, Daniel; Gilad, Amir; Moskovitch, Yuval

doi:10.1007/s00778-018-0496-7

Efficient provenance tracking for datalog using top-k queries

Regular Paper
Published: 22 February 2018

Volume 27, pages 245–269, (2018)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

597 Accesses
6 Citations
Explore all metrics

Abstract

Highly expressive declarative languages, such as datalog, are now commonly used to model the operational logic of data-intensive applications. The typical complexity of such datalog programs, and the large volume of data that they process, call for result explanation. Results may be explained through the tracking and presentation of data provenance, defined here as the set of derivation trees of a given fact. While informative, the size of such full provenance information is typically too large and complex (even when compactly represented) to allow displaying it to the user. To this end, we propose a novel top-k query language for querying datalog provenance, supporting selection criteria based on tree patterns and ranking based on the rules and database facts used in derivation. We propose an efficient novel algorithm that computes in polynomial data complexity a compact representation of the top-k trees which may be explicitly constructed in linear time with respect to their size. We further experimentally study the algorithm performance, showing its scalability even for complex datalog programs where full provenance tracking is infeasible.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Provenance in Databases: Principles and Applications

Efficient Computation of Provenance for Query Result Exploration

Using SQL for Efficient Generation and Querying of Provenance Information

Notes

This requires a slight change of the definition of patterns, which is easy to support, to allow * in relation names.

References

Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Boston (1995)
MATH Google Scholar
Ailamaki, A., Ioannidis, Y.E., Livny, M.: Scientific workflow management by database management. In: SSDBM (1998)
Arora, T., Ramakrishnan, R., Roth, W.G., Seshadri, P., Srivastava, D.: Explaining program execution in deductive systems. In: DOOD (1993)
Bao, Z., Davidson, S.B., Milo, T.: Labeling recursive workflow executions on-the-fly. In: SIGMOD (2011)
Bao, Z., Köhler, H., Wang, L., Zhou, X., Sadiq, S.: Efficient provenance storage for relational queries. In: CIKM (2012)
Benjelloun, O., Sarma, A., Halevy, A., Theobald, M., Widom, J.: Databases with uncertainty and lineage. VLDB J. 17, 243 (2008)
Article Google Scholar
Buneman, P., Cheney, J., Vansummeren, S.: On the expressiveness of implicit provenance in query and update languages. ACM Trans. Database Syst. 33(4), 1 (2008)
Article Google Scholar
Chang, L., Yu, J.X., Qin, L.: Query ranking in probabilistic XML data. In: EDBT (2009)
Chapman, A.P., Jagadish, H.V., Ramanan, P.: Efficient provenance storage. In: ACM SIGMOD, SIGMOD ’08 (2008)
Cheney, J., Ahmed, A., Acar, U.A.: Database queries that explain their work. In: CoRR, abs/1408.1675 (2014)
Cheney, J., Chiticariu, L., Tan, W.C.: Provenance in databases: why, how, and where. Found. Trends Databases 1(4), 379 (2009)
Article Google Scholar
Cohen, S., Kimelfeld, B.: Querying parse trees of stochastic context-free grammars. In: ICDT (2010)
Cohn, D., Hull, R.: Business artifacts: a data-centric approach to modeling business operations and processes. IEEE Data Eng. Bull. 32(3), 3 (2009)
Google Scholar
Damásio, C.V., Analyti, A., Antoniou, G.: Justifications for logic programming. In: Logic Programming and Nonmonotonic Reasoning (2013)
Davidson, S.B., Freire, J.: Provenance and scientific workflows: challenges and opportunities. In: SIGMOD (2008)
Deutch, D., Gilad, A., Moskovitch, Y.: Selective provenance for datalog programs using top-k queries. PVLDB 8(12), 1394 (2015)
Google Scholar
Deutch, D., Gilad, A., Moskovitch, Y.: selp: selective tracking and presentation of data provenance (demo). In: ICDE (2015)
Deutch, D., Koch, C., Milo, T.: On probabilistic fixpoint and markov chain query languages. In: PODS (2010)
Deutch, D., Milo, T., Roy, S., Tannen, V.: Circuits for datalog provenance. In: ICDT (2014)
Eppstein, D.: Finding the k shortest paths. SIAM J. Comput. 28(2), 652 (1998)
Article MathSciNet MATH Google Scholar
Fink, R., Han, L., Olteanu, D.: Aggregation in probabilistic databases via knowledge compilation. PVLDB 5(5), 490 (2012)
Google Scholar
Foster, I., Vockler, J., Wilde, M., Zhao, A.: Chimera: a virtual data system for representing, querying, and automating data derivation. In: SSDBM (2002)
Fuhr, N.: Probabilistic datalog:a logic for powerful retrieval methods. In: SIGIR (1995)
Galárraga, L.A., Teflioudi, C., Hose, K., Suchanek, F.M.: Amie: association rule mining under incomplete evidence in ontological knowledge bases. In: WWW (2013)
Geerts, F., Poggi, A.: On database query languages for k-relations. J. Appl. Logic 8(2), 173–185 (2010)
Article MathSciNet MATH Google Scholar
Glavic B., Alonso, G.: Perm: processing provenance and data on the same data model through query rewriting. In: ICDE, pp. 174–185 (2009)
Glavic, B., Alonso, G., Miller, R.J., Haas, L.M.: TRAMP: understanding the behavior of schema mappings through provenance. PVLDB 3(1), 1314–1325 (2010)
Google Scholar
Glavic, B., Miller, R.J., Alonso, G.: Using sql for efficient generation and querying of provenance information. In: In Search of Elegance in the Theory and Practice of Computation. Springer (2013)
Glavic, B., Siddique, J., Andritsos, P., Miller, R.J.: Provenance for data mining. In: Tapp (2013)
Green, T.J., Karvounarakis, G., Tannen, V.: Provenance semirings. In: PODS (2007)
Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M., Li, P., Oinn, T.: Taverna: a tool for building and running workflows of services. Nucleic Acids Res. 34, W729 (2006)
Article Google Scholar
http://www.iris-reasoner.org
Imieliński, T., Lipski Jr., W.: Incomplete information in relational databases. J. ACM 31(4), 761 (1984)
Article MathSciNet MATH Google Scholar
Ives, Z.G., Haeberlen, A., Feng, T., Gatterbauer, W.: Querying provenance for ranking and recommending. In: TaPP (2012)
Jha, A.K., Suciu, D.: Probabilistic databases with markoviews. PVLDB 5(11), 1160 (2012)
Google Scholar
Karvounarakis, G., Ives, Z.G., Tannen, V.: Querying data provenance. In: SIGMOD (2010)
Kenig, B., Gal, A., Strichman, O.: A new class of lineage expressions over probabilistic databases computable in p-time. In: SUM, pp. 219–232 (2013)
Kimelfeld, B., Kosharovsky, Y., Sagiv, Y.: Query evaluation over probabilistic XML. VLDB J. 18(5), 1117 (2009)
Article Google Scholar
Kimelfeld, B., Sagiv, Y.: Matching twigs in probabilistic XML. In: VLDB (2007)
Knuth, D.E.: A generalization of Dijkstra’s algorithm. Inf. Process. Lett. 6(1), 1 (1977)
Article MathSciNet MATH Google Scholar
Köhler, S., Ludäscher, B., Smaragdakis, Y.: Declarative datalog debugging for mere mortals. In: Datalog in Academia and Industry (2012)
Köstler, G., Kießling, W., Thöne, H., Güntzer, U.: Fixpoint iteration with subsumption in deductive databases. J. Intell. Inf. Syst. 4(2), 123 (1995)
Article Google Scholar
Li, J., Liu, C., Zhou, R., Wang, W.: Top-k keyword search over probabilistic XML data. In: ICDE (2011)
Loo, B.T. et al.: Declarative networking: language, execution and optimization. In: SIGMOD (2006)
Meliou, A., Gatterbauer, W., Suciu, D.: Reverse data management. PVLDB 4(12), 1490 (2011)
Google Scholar
Meliou, A., Suciu, D.: Tiresias: the database oracle for how-to queries. In: SIGMOD (2012)
Missier, P., Paton, N., Belhajjame, K.: Fine-grained and efficient lineage querying of collection-based workflow provenance. In: EDBT (2010)
Ning, B., Liu, C., Yu, J.X.: Efficient processing of top-k twig queries over probabilistic XML data. World Wide Web 16(3), 299 (2013)
Article Google Scholar
Niu, F., Zhang, C., Re, C., Shavlik, J.W.: Deepdive: Web-scale knowledge-base construction using statistical learning and inference. In: VLDS, pp. 25–28 (2012)
Olteanu, D., Zavodny, J.: Factorised representations of query results: size bounds and readability. In: ICDT (2012)
Perera, R., Acar, U.A., Cheney, J., Levy, P.B.: Functional programs that explain their work. In: SIGPLAN (2012)
Prov-overview, w3c working group note. http://www.w3.org/TR/prov-overview/ 2013
Richardson, M., Domingos, P.: Markov logic networks. Mach. Learn. 62(1–2), 107 (2006)
Article Google Scholar
Ronen, R., Shmueli, O.: Automated interaction in social networks with datalog. In: CIKM (2010)
Roy, S., Suciu, D.: A formal approach to finding explanations for database queries. In: SIGMOD (2014)
Shmueli, O., Tsur, S.: Logical diagnosis of LDL programs. New Gener. Comput. 9(3/4), 277 (1991)
Article MATH Google Scholar
Simhan, Y.L., Plale, B., Gammon, D.: Karma2: provenance management for data-driven workflows. Int. J. Web Serv. Res. 5(2), 317 (2008)
Google Scholar
Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: WWW (2007)
Suciu, D., Olteanu, D., Ré, C., Koch, C.: Probabilistic Databases. Synthesis Lectures on Data Management. Morgan & Claypool Publishers (2011)

Download references

Acknowledgements

This research has been partially funded by the Israeli Science Foundation (978/17, 1636/13) and the Blavatnik Interdisciplinary Cyber Research Center (TAU ICRC). The contribution of Yuval Moskovitch is part of Ph.D. thesis research conducted at Tel Aviv University.

Author information

Authors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Daniel Deutch, Amir Gilad & Yuval Moskovitch

Authors

Daniel Deutch
View author publications
You can also search for this author in PubMed Google Scholar
Amir Gilad
View author publications
You can also search for this author in PubMed Google Scholar
Yuval Moskovitch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amir Gilad.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Deutch, D., Gilad, A. & Moskovitch, Y. Efficient provenance tracking for datalog using top-k queries. The VLDB Journal 27, 245–269 (2018). https://doi.org/10.1007/s00778-018-0496-7

Download citation

Received: 21 December 2016
Revised: 15 November 2017
Accepted: 31 January 2018
Published: 22 February 2018
Issue Date: April 2018
DOI: https://doi.org/10.1007/s00778-018-0496-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient provenance tracking for datalog using top-k queries

Abstract

Access this article

Similar content being viewed by others

Provenance in Databases: Principles and Applications

Efficient Computation of Provenance for Query Result Exploration

Using SQL for Efficient Generation and Querying of Provenance Information

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient provenance tracking for datalog using top-k queries

Abstract

Access this article

Similar content being viewed by others

Provenance in Databases: Principles and Applications

Efficient Computation of Provenance for Query Result Exploration

Using SQL for Efficient Generation and Querying of Provenance Information

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation