ABSTRACT
While provenance has been extensively studied in the literature, the efficient evaluation of provenance queries remains an open problem. Traditional query optimization techniques, like the use of general-purpose indexes, or the materialization of provenance data, fail on different fronts to address the problem. Therefore, the need to develop provenance-aware access methods becomes apparent. This paper starts by identifying some key requirements that are to a large extent specific to provenance queries and are necessary for their efficient evaluation. The first such property, called duality, requires that a single access method is used to evaluate both backward provenance queries (which input items of some analysis generate an output item) and forward provenance queries (which outputs of some analysis does an input item generate). The second property, called locality, guarantees that provenance query evaluation times should depend mainly on the size of the provenance query results and should be largely independent of the total size of provenance data. Motivated by the above, we identify proper data structures with the aforementioned properties, we implement them, and through a detailed set of experiments, we illustrate their effectiveness on the evaluation of provenance queries.
- J. Barbay, A. Golynski, J. I. Munro, and S. S. Rao. Adaptive searching in succinctly encoded binary relations and tree-structured documents. Theor. Comput. Sci., 387(3):284--297, 2007. Google ScholarDigital Library
- D. Bhagwat, L. Chiticariu, W. C. Tan, and G. Vijayvargiya. An annotation management system for relational databases. In VLDB, pages 900--911, 2004. Google ScholarDigital Library
- O. Biton, S. C. Boulakia, and S. B. Davidson. Zoom*userviews: Querying relevant provenance in workflow systems. In VLDB, pages 1366--1369, 2007. Google ScholarDigital Library
- P. Buneman, S. Khanna, and W. C. Tan. Why and where: A characterization of data provenance. In ICDT, pages 316--330, 2001 Google ScholarDigital Library
- P. Buneman and W.-C. Tan. Provenance in databases. In SIGMOD, pages 1171--1173, 2007. Google ScholarDigital Library
- A. P. Chapman, H. V. Jagadish, and P. Ramanan. Efficient provenance storage. In SIGMOD, pages 993--1006, 2008. Google ScholarDigital Library
- L. Chiticariu and W. C. Tan. Debugging schema mappings with routes. In VLDB, pages 79--90, 2006. Google ScholarDigital Library
- Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Syst., 25(2):179--227, 2000. Google ScholarDigital Library
- S. B. Davidson. On provenance and user views in scientific workflows. In DBIR2008 (Keynote speech), 2008.Google Scholar
- S. B. Davidson and J. Freire. Provenance and scientific workflows: challenges and opportunities. In SIGMOD, pages 1345--1350, 2008. Google ScholarDigital Library
- F. Geerts, A. Kementsietsidis, and D. Milano. Mondrian: Annotating and querying databases through colors and blocks. In ICDE, 2006. Google ScholarDigital Library
- A. Golynski, J. I. Munro, and S. S. Rao. Rank/select operations on large alphabets: a tool for text indexing. In SODA, pages 368--373, 2006. Google ScholarDigital Library
- T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, pages 31--40, 2007. Google ScholarDigital Library
- D. T. Liu and M. J. Franklin. The design of griddb: A data-centric overlay for the scientific grid. In VLDB, pages 600--611, 2004. Google ScholarDigital Library
- M. Mavromatis. Indexing in the mondrian annotation management system. Technical Report EDI-INF-IM060399, School of Informatics, University of Edinburgh, 2006.Google Scholar
- A. Misra, M. Blount, A. Kementsietsidis, D. Sow, and M. Wang. Advances and challenges for scalable data provenance in stream processing systems. In IPAW, 2008. Google ScholarDigital Library
- D. R. Morrison. Patricia-practical algorithm to retrieve information coded in alphanumeric. J. ACM, 15(4), 1968. Google ScholarDigital Library
- D. Srivastava and Y. Velegrakis. Intensional associations between data and metadata. In SIGMOD Conference, pages 401--412, 2007. Google ScholarDigital Library
- W. C. Tan. Provenance in databases: Past, current, and future. IEEE Data Eng. Bull., 30(4):3--12, 2007.Google Scholar
- J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262--276, 2005.Google Scholar
- D. E. Willard. Log-logarithmic worst-case range queries are possible in space theta(n). Inf. Process. Lett., 17(2):81--84, 1983.Google ScholarCross Ref
Index Terms
- Provenance query evaluation: what's so special about it?
Recommendations
On Provenance Minimization
Provenance information has been proved to be very effective in capturing the computational process performed by queries, and has been used extensively as the input to many advanced data management tools (e.g., view maintenance, trust assessment, or ...
The perm provenance management system in action
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of dataIn this demonstration we present the Perm provenance management system (PMS). Perm is capable of computing, storing and querying provenance information for the relational data model. Provenance is computed by using query rewriting techniques to annotate ...
On the expressiveness of implicit provenance in query and update languages
Information describing the origin of data, generally referred to as provenance, is important in scientific and curated databases where it is the basis for the trust one puts in their contents. Since such databases are constructed using operations of ...
Comments