skip to main content
10.1145/3530800.3534530acmconferencesArticle/Chapter ViewAbstractPublication PagestappConference Proceedingsconference-collections
research-article

Towards practical approximate lineage

Published: 12 June 2022 Publication History

Abstract

Traditionally, provenance and lineage mainly referred to query results. We take a more holistic approach. We consider a system in which tuples (records) that are produced by a query may affect other tuple insertions into the DB, as part of a normal workflow. Therefore, we consider both direct lineage (dependence of a query result on database tuples directly used in solving the query) and distant lineage (dependence on older tuples that caused the existence of the tuples directly used in solving the query). We aim to formulate practical methods for supporting direct and distant lineage.
We use a novel genetics-inspired approach for approximating lineage tracking, which is based on word embedding to endow an explicitly inserted tuple with a small set of vectors that "encode" its content, and on an algebra on such sets of vectors that derives a set of vectors which encodes the lineage of a query-inserted tuple. During the execution of a query, we construct the lineage vectors of the final (and intermediate) result tuples in a similar fashion to that of semiring-based exact provenance calculations. We extend the + and operations to generate sets of lineage vectors, while retaining the ability to propagate information and preserve the compact representation. Therefore, our solution does not suffer from space complexity blow-up over time, and it "naturally ranks" explanations to the existence of a tuple.
We introduce several fundamental improvements and extensions to the basic method of [19]. The aim is to make the basic scheme practical by taking advantage of additional knowledge regarding timing, query structure, and relative importance. The improvements include: tuple creation timestamps and a temporal sequence of search structures, column emphasis and query dependency DAG. We integrate our lineage computations with these improvements into the PostgreSQL system via an extension (ProvSQL) and perform extensive experiments. The experiments exhibit useful results in terms of accuracy against (the gold-standard) exact, semiring-based, tuple justifications, especially for the column-based (CV) method which exhibits high precision and high per-level recall. We focus on target tuples with multiple generations of tuples in their lineage, namely having a distant lineage, and analyze them in terms of generational lineage accuracy.

References

[1]
Eleanor Ainy, Pierre Bourhis, Susan B. Davidson, Daniel Deutch, and Tova Milo. 2015. Approximated Summarization of Data Provenance. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (Melbourne, Australia) (CIKM '15). Association for Computing Machinery, New York, NY, USA, 483--492.
[2]
Siddhant Arora and Srikanta Bedathur. 2020. On Embeddings in Relational Databases. arXiv:2005.06437 [cs.DB]
[3]
Rajesh Bordawekar, Bortik Bandyopadhyay, and Oded Shmueli. 2017. Cognitive Database: A Step towards Endowing Relational Databases with Artificial Intelligence Capabilities. CoRR abs/1712.07199 (2017). arXiv:1712.07199 http://arxiv.org/abs/1712.07199
[4]
Rajesh Bordawekar and Oded Shmueli. [n.d.]. Creating cognitive intelligence queries from multiple data corpuses. United States Patent 10,984,030 (2021).
[5]
Peter Buneman, Sanjeev Khanna, and Wang Chiew Tan. 2001. Why and Where: A Characterization of Data Provenance. In Database Theory - ICDT 2001, 8th International Conference, London, UK, January 4--6, 2001, Proceedings (Lecture Notes in Computer Science, Vol. 1973), Jan Van den Bussche and Victor Vianu (Eds.). Springer, 316--330.
[6]
Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2020. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 1335--1349.
[7]
James Cheney, Laura Chiticariu, Wang-Chiew Tan, et al. 2009. Provenance in databases: Why, how, and where. Foundations and Trends® in Databases 1, 4 (2009), 379--474.
[8]
Daniel Deutch, Nave Frost, and Amir Gilad. 2017. Provenance for Natural Language Queries. Vldb 10, 5 (2017), 577--588.
[9]
Daniel Deutch, Amir Gilad, and Yuval Moskovitch. 2015. Selective provenance for datalog programs using top-k queries. Proceedings of the VLDB Endowment (2015).
[10]
Daniel Deutch, Tova Milo, Sudeepa Roy, and Val Tannen. 2014. Circuits for Datalog Provenance. Icdt (2014).
[11]
Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. 2013. Optimized product quantization for approximate nearest neighbor search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2946--2953.
[12]
Aristides Gionis, Piotr Indyk, and Rajeev Motwani. 1999. Similarity Search in High Dimensions via Hashing. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB '99). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 518--529.
[13]
Todd J Green, Grigoris Karvounarakis, and Val Tannen. 2007. Provenance semirings. In Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 31--40.
[14]
Michael Günther, Maik Thiele, Julius Gonsior, and Wolfgang Lehner. 2021. Pre-Trained Web Table Embeddings for Table Discovery. In Fourth Workshop in Exploiting AI Techniques for Data Management (Virtual Event, China) (aiDM '21). Association for Computing Machinery, New York, NY, USA, 24--31.
[15]
Zack Ives, Yi Zhang, Soonbo Han, and Nan Zheng. 2019. Dataset Relationship Management. In CIDR 2019, 9th Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 13--16, 2019, Online Proceedings. www.cidrdb.org. http://cidrdb.org/cidr2019/papers/p55-ives-cidr19.pdf
[16]
Alison Kretser, Delia Murphy, and Pamela Starke-Reed. 2017. A partnership for public health: USDA branded food products database. Journal of Food Composition and Analysis 64 (2017), 10 -- 12. The 39th National Nutrient Databank Conference: The Future of Food and Nutrient Databases: Invention, Innovation, and Inspiration.
[17]
Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From Word Embeddings To Document Distances. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6--11 July 2015 (JMLR Workshop and Conference Proceedings, Vol. 37), Francis R. Bach and David M. Blei (Eds.). JMLR.org, 957--966. http://proceedings.mlr.press/v37/kusnerb15.html
[18]
Seokki Lee, Bertram Ludäscher, and Boris Glavic. 2018. Provenance Summaries for Answers and Non-Answers. Proc. VLDB Endow. 11, 12 (Aug. 2018), 1954--1957.
[19]
Michael Leybovich and Oded Shmueli. 2020. ML Based Provenance in Databases. In AIDB@VLDB 2020, 2nd International Workshop on Applied AI for Database Systems and Applications, Held with VLDB 2020, Monday, August 31, 2020, Online Event / Tokyo, Japan, Bingsheng He, Berthold Reinwald, and Yingjun Wu (Eds.). https://tinyurl.com/LeybovichS20
[20]
Michael Leybovich and Oded Shmueli. 2021. Efficient Approximate Search for Sets of Vectors.
[21]
Michael Leybovich and Oded Shmueli. 2021. ML Based Lineage in Databases.
[22]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2--4, 2013, Workshop Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1301.3781
[23]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 1532--1543. http://www.aclweb.org/anthology/D14-1162
[24]
Pierre Senellart, Louis Jachiet, and D I Ens. 2018. ProvSQL : Provenance and Probability Management in PostgreSQL. Vldb (2018), 2034--2037.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
TaPP '22: Proceedings of the 14th International Workshop on the Theory and Practice of Provenance
June 2022
67 pages
ISBN:9781450393492
DOI:10.1145/3530800
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. database embedding
  2. lineage
  3. provenance

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '22
Sponsor:

Acceptance Rates

TaPP '22 Paper Acceptance Rate 10 of 17 submissions, 59%;
Overall Acceptance Rate 10 of 17 submissions, 59%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 79
    Total Downloads
  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)1
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media