Skip to main content
Log in

PUG: a framework and practical implementation for why and why-not provenance

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Explaining why an answer is (or is not) returned by a query is important for many applications including auditing, debugging data and queries, and answering hypothetical questions about data. In this work, we present the first practical approach for answering such questions for queries with negation (first-order queries). Specifically, we introduce a graph-based provenance model that, while syntactic in nature, supports reverse reasoning and is proven to encode a wide range of provenance models from the literature. The implementation of this model in our PUG (Provenance Unification through Graphs) system takes a provenance question and Datalog query as an input and generates a Datalog program that computes an explanation, i.e., the part of the provenance that is relevant to answer the question. Furthermore, we demonstrate how a desirable factorization of provenance can be achieved by rewriting an input query. We experimentally evaluate our approach demonstrating its efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Notes

  1. or, equivalently, queries in full relational algebra (without aggregation), formulas in FO logic under the closed-world assumption, and SPJUD-queries (select, project, join, union, difference).

  2. This follows from the semantics of the type of 2-player game used here. The details are beyond the scope of this paper.

  3. We only restrict the discussion to sentences for simplicity. The arguments here also hold for formulas with free variables.

  4. In [34], relational algebra is used to express queries and nodes of f-trees represent equivalence classes of attributes which in Datalog correspond to query variables.

References

  1. Arab, B., Gawlick, D., Radhakrishnan, V., Guo, H., Glavic, B.: A generic provenance middleware for database queries, updates, and transactions. In: TaPP (2014)

  2. Bidoit, N., Herschel, M., Tzompanaki, K.: Immutably answering why-not questions for equivalent conjunctive queries. In: TaPP (2014)

  3. Bidoit, N., Herschel, M., Tzompanaki, K., et al.: Query-based why-not provenance with NedExplain. In: EDBT, pp. 145–156 (2014)

  4. Chapman, A., Jagadish, H.V.: Why not? In: SIGMOD, pp. 523–534 (2009)

  5. Cheney, J., Chiticariu, L., Tan, W.: Provenance in databases: why, how, and where. Found. Trends Databases 1(4), 379–474 (2009)

    Article  Google Scholar 

  6. Damásio, C.V., Analyti, A., Antoniou, G.: Justifications for logic programming. In: Logic Programming and Nonmonotonic Reasoning, pp. 530–542 (2013)

  7. Deutch, D., Gilad, A., Moskovitch, Y.: Selective provenance for datalog programs using top-k queries. PVLDB 8(12), 1394–1405 (2015)

    Google Scholar 

  8. Deutch, D., Milo, T., Roy, S., Tannen, V.: Circuits for datalog provenance. In: ICDT, pp. 201–212 (2014)

  9. Fehrenbach, S., Cheney, J.: Language-integrated provenance. Sci. Comput. Programm. 155, 103–145 (2017)

    Article  Google Scholar 

  10. Flum, J., Kubierschky, M., Ludäscher, B.: Total and partial well-founded datalog coincide. In: ICDT, pp. 113–124 (1997)

  11. Glavic, B., Köhler, S., Riddle, S., Ludäscher, B.: Towards constraint-based explanations for answers and non-answers. In: TaPP (2015)

  12. Glavic, B., Miller, R.J., Alonso, G.: Using sql for efficient generation and querying of provenance information. In: Tannen, V., Wong, L., Libkin, L., Fan, W., Tan, W.C., Fourman, M. (eds.) In Search of Elegance in the Theory and Practice of Computation, pp. 291–320. Springer, Berlin (2013)

    Chapter  Google Scholar 

  13. Grädel, E., Tannen, V.: Semiring provenance for first-order model checking (2017). arXiv:1712.01980

  14. Green, T.: Containment of conjunctive queries on annotated relations. Theory Comput. Syst. 49(2), 429–459 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  15. Green, T., Karvounarakis, G., Tannen, V.: Provenance semirings. In: PODS, pp. 31–40 (2007)

  16. Green, T.J., Aref, M., Karvounarakis, G.: Logicblox, platform and language: a tutorial. In: Datalog in Academia and Industry, pp. 1–8. Springer, Berlin (2012)

  17. Green, T.J., Karvounarakis, G., Ives, Z.G., Tannen, V.: Update exchange with mappings and provenance. In: VLDB, pp. 675–686 (2007)

  18. Green, T.J., Tannen, V.: The semiring framework for database provenance. In: PODS, pp. 93–99 (2017)

  19. Herschel, M., Diestelkämper, R., Lahmar, H.B.: A survey on provenance: What for? what form? what from? VLDB J 9(3), 1–26 (2017)

    Google Scholar 

  20. Herschel, M., Hernandez, M.: Explaining missing answers to SPJUA queries. PVLDB 3(1), 185–196 (2010)

    Google Scholar 

  21. Huang, J., Chen, T., Doan, A., Naughton, J.: On the provenance of non-answers to queries over extracted data. In: VLDB, pp. 736–747 (2008)

  22. Karvounarakis, G., Green, T.J.: Semiring-annotated data: queries and provenance. SIGMOD Rec. 41(3), 5–14 (2012)

    Article  Google Scholar 

  23. Köhler, S., Ludäscher, B., Smaragdakis, Y.: Declarative datalog debugging for mere mortals. In: Datalog 2.0: Datalog in Academia and Industry, pp. 111–122 (2012)

  24. Köhler, S., Ludäscher, B., Zinn, D.: First-order provenance games. In: Tannen, V., Wong, L., Libkin, L., Fan, W., Tan, W.C., Fourman, M. (eds.) Search of Elegance in the Theory and Practice of Computation, pp. 382–399. Springer, Berlin (2013)

    Chapter  Google Scholar 

  25. Lee, S., Köhler, S., Ludäscher, B., Glavic, B.: Efficiently computing provenance graphs for queries with negation. Technical Report CoRR (2016). arXiv:1701.05699

  26. Lee, S., Köhler, S., Ludäscher, B., Glavic, B.: A SQL-middleware unifying why and why-not provenance for first-order queries. In: ICDE, pp. 485–496 (2017)

  27. Lee, S., Ludäscher, B., Glavic, B.: Pug: A framework and practical implementation for why and why-not provenance (extended version). Technical Report CoRR (2018). arXiv:1808.05752

  28. Lee, S., Niu, X., Ludäscher, B., Glavic, B.: Integrating approximate summarization with provenance capture. In: TaPP (2017)

  29. Meliou, A., Gatterbauer, W., Moore, K., Suciu, D.: The complexity of causality and responsibility for query answers and non-answers. PVLDB 4(1), 34–45 (2010)

    Google Scholar 

  30. Meliou, A., Gatterbauer, W., Suciu, D.: Reverse data management. PVLDB 4(12), 1490–1493 (2011)

    Google Scholar 

  31. Meliou, A., Suciu, D.: Tiresias: The database oracle for how-to queries. In: SIGMOD, pp. 337–348 (2012)

  32. Niu, X., Kapoor, R., Glavic, B., Gawlick, D., Liu, Z.H., Krishnaswamy, V., Radhakrishnan, V.: Provenance-aware query optimization. In: ICDE, pp. 473–484 (2017)

  33. Olteanu, D., Závodnỳ, J.: Factorised representations of query results: size bounds and readability. In: ICDT, pp. 285–298. ACM (2012)

  34. Olteanu, D., Závodnỳ, J.: Size bounds for factorised representations of query results. ACM Trans. Database Syst. (TODS) 40(1), 2 (2015)

    Article  MathSciNet  Google Scholar 

  35. Riddle, S., Köhler, S., Ludäscher, B.: Towards constraint provenance games. In: TaPP (2014)

  36. Roy, S., Orr, L., Suciu, D.: Explaining query answers with explanation-ready databases. Proc. VLDB Endow. 9(4), 348–359 (2015)

    Article  Google Scholar 

  37. Roy, S., Suciu, D.: A formal approach to finding explanations for database queries. In: SIGMOD (2014)

  38. Senellart, P.: Provenance and probabilities in relational databases. ACM SIGMOD Rec. 46(4), 5–15 (2018)

    Article  MathSciNet  Google Scholar 

  39. Tannen, V.: Provenance analysis for FOL model checking. ACM SIGLOG News 4(1), 24–36 (2017)

    Google Scholar 

  40. Tran, Q.T., Chan, C.-Y.: How to conquer why-not questions. In: SIGMOD, pp. 15–26 (2010)

  41. Wu, E., Madden, S.: Scorpion: explaining away outliers in aggregate queries. PVLDB 6(8), 553–564 (2013)

    Google Scholar 

  42. Wu, Y., Zhao, M., Haeberlen, A., Zhou, W., Loo, B.T.: Diagnosing missing events in distributed systems with negative provenance. In: SIGCOMM, pp. 383–394 (2014)

  43. Xu, J., Zhang, W., Alawini, A., Tannen, V.: Provenance analysis for missing answers and integrity repairs. IEEE Data Eng. Bull. 41(1), 39–50 (2018)

    Google Scholar 

  44. Zhou, W., Sherr, M., Tao, T., Li, X., Loo, B.T., Mao, Y.: Efficient querying and maintenance of network provenance at internet-scale. In: SIGMOD, pp. 615–626 (2010)

Download references

Acknowledgements

This work was supported by NSF Awards OAC-{1640864, 1541450} and SMA-1637155. Opinions and findings expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Seokki Lee.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, S., Ludäscher, B. & Glavic, B. PUG: a framework and practical implementation for why and why-not provenance. The VLDB Journal 28, 47–71 (2019). https://doi.org/10.1007/s00778-018-0518-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-018-0518-5

Keywords

Navigation