Understanding provenance black boxes

Chapman, Adriane; Jagadish, H. V.

doi:10.1007/s10619-009-7058-3

Understanding provenance black boxes

Published: 16 January 2010

Volume 27, pages 139–167, (2010)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Adriane Chapman¹ &
H. V. Jagadish²

140 Accesses
7 Citations
3 Altmetric
Explore all metrics

Abstract

Current provenance stores associated with workflow management systems (WfMSs) capture enough coarse-grained information to describe which datasets were used and which processes were run. While this information is enough to rebuild a workflow run, it is not enough to facilitate user understanding. Because the data is manipulated via a series of black boxes, it is often impossible for a human to understand what happened to the data. In this work, we highlight the missing information that can assist user understanding. Unfortunately, provenance information is already very complex and difficult for a user to comprehend, which can be exacerbated by adding the extra information needed for deeper blackbox understanding. In order to alleviate this, we develop a model of provenance answers that follow a “roll up”, “drill down” strategy. We evaluate these techniques to determine if users have better understanding of provenance information. We show how this information can be captured by workflow management systems, and that the structures and information needed for this model are a negligible addition to standard provenance stores. Finally, we implement these techniques in a real provenance system, and evaluate implementation feasibility.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unmanaged Workflows: Their Provenance and Use

A Model and System for Querying Provenance from Data Cleaning Workflows

A Brief Tour Through Provenance in Scientific Workflows and Databases

References

http://hissa.nist.gov/unravel/ (1998)
Anand, M.K., Bowers, S., McPhillips, T., Ludäscher, B.: Efficient provenance storage over nested data collections. In: SSDM, pp. 958–969 (2009)
Benjelloun, O., Sarma, A.D., Halevy, A., Widom, J.: ULDBs: Databases with uncertainty and lineage. In: VLDB Seoul, Korea, pp. 953–964 (2006)
Bowers, S., McPhillips, T., Wu, M., Ludäscher, B.: Project histories: managing data provenance across collection-oriented scientific workflow runs. In: DILS, pp. 27–29 (2007)
Bowers, S., McPhillips, T., Riddle, S., Anand, M., Ludäscher, B.: Kepler/pPOD: scientific workflow and provenance support for assembling the tree of life (2008). Provenance and annotation of data and processes edn. In: Lecture Notes in Computer Science, pp. 70–77. Springer, Berlin (2008)
Google Scholar
Buneman, P., Chapman, A., Cheney, J.: Provenance management in curated databases. In: ACM SIGMOD, pp. 539–550 (2006)
Buneman, P., Khanna, S., Tan, W.C.: Why and Where: a characterization of data provenance. In: ICDT, pp. 316–330 (2001)
Buneman, P., Khanna, S., Tan, W.C.: On propagation of deletions and annotations through views. In: PODS, pp. 150–158 (2002)
Callahan, S.P., Freire, J., Santos, E., Scheidegger, C.E., Vo, C.T.S.H.T.: VisTrails: Visualization meets data management. In: SIGMOD, pp. 745–747 (2006)
Cheung, K., Hunter, J.: Provenance Explorer—customized provenance views using semantic inferencing. In: International Semantic Web Conference, pp. 215–227 (2006)
Cohen-Boulakia, S., Biton, O., Cohen, S., Davidson, S.: Addressing the provenance challenge using ZOOM. Concurr. Comput., Pract. Exp. 20, 497–506 (2008)
Article Google Scholar
Cui, Y., Widom, J.: Lineage tracing for general data warehouse transformations. In: VLDB, pp. 41–58 (2001)
da Cruz, S.M.S., Barros, P.M., Bisch, P.M., Campos, M.L.M., Mattoso, M.: Provenance services for distributed workflows. In: CCGRID, pp. 526–533 (2008)
Davidson, S., Cohen-Boulakia, S., Eyal, A., Ludascher, B., McPhillips, T., Bowers, S., Freire, J.: Provenance in scientific workflow systems. IEEE Data Eng. Bull. 32(4), 44–50 (2007)
Google Scholar
Foster, I., Vockler, J., Eilde, M., Zhao, Y.: Chimera: a virtual data system for representing, querying, and automating data derivation. In: SSDBM, pp. 37–46 (2002)
Frew, J., Metzger, D., Slaughter, P.: Automatic capture and reconstruction of computational provenance. Concurr. Comput., Pract. Exp. 20(5), 485–496 (2008)
Article Google Scholar
Green, T.J., Karvounarakis, G., Taylor, N.E., Biton, O., Ives, Z.G., Tannen, V.: ORCHESTRA: Facilitating collaborative data sharing. In: SIGMOD, pp. 1131–1133 (2007)
Groth, P., Miles, S., Moreau, L.: PReServ: Provenance recording for services. In: Proceedings of the UK OST e-Science second All Hands Meeting 2005 (AHM’05) (2005)
Hermjakob, H., et al.: IntAct—an open source molecular interaction database. Nucleic Acids Res. D 32, 452–455 (2004)
Article Google Scholar
Jayapandian, M., Chapman, A., et al.: Michigan Molecular Interactions (MiMI): putting the jigsaw puzzle together. Nucleic Acids Res., D566–D571 (2007)
Kim, Y.J., Boyd, A., Athey, B.D., Patel, J.M.: miBLAST: scalable evaluation of a batch of nucleotide sequence queries with blast. Nucleic Acids Res. 33(13), 4335–4344 (2005)
Article Google Scholar
Lal, A., Reps, T.: Solving multiple dataflow queries. In: Static Analysis Symposium (2008)
Lenz, H.J., Shoshani, A.: Summarizability in OLAP and statistical data bases. In: SSDM, pp. 132–143 (1997)
Missier, P., Embury, S., Greenwood, M., Preece, A., Jin, B.: Quality views: capturing and exploiting the user perspective on data quality. In: VLDB, pp. 977–988 (2006)
Missier, P., Embury, S.M., Greenwood, M., Preece, A., Jin, B.: Managing information quality in e-science: the qurator workbench. In: SIGMOD, pp. 1150–1152 (2007)
Moreau, L., Ludäscher, B., et al.: The First Provenance Challenge. Concurrency and computation: practice and experience (2007). http://twiki.ipaw.info/bin/view/Challenge/SecondProvenanceChallenge
Moreau, L., Ludäscher, B., et al.: The provenance challenge. http://twiki.ipaw.info/bin/view/Challenge/ThirdProvenanceChallenge
Moreau, L., Ludäscher, B., Altintas, I., Barga, R.S., Bowers, S., Callahan, S., Chin, G. Jr., Clifford, B., Cohen, S., Cohen-Boulakia, S., Davidson, S., Deelman, E., Digiampietri, L., Foster, I., Freire, J., Frew, J., Futrelle, J., Gibson, T., Gil, Y., Goble, C., Golbeck, J., Groth, P., Holland, D.A., Jiang, S., Kim, J., Koop, D., Krenek, A., McPhillips, T., Mehta, G., Miles, S., Metzger, D., Munroe, S., Myers, J., Plale, B., Podhorszki, N., Ratnakar, V., Santos, E., Scheidegger, C., Schuchardt, K., Seltzer, M., Simmhan, Y.L., Silva, C., Slaughter, P., Stephan, E., Stevens, R., Turi, D., Vo, H., Wilde, M., Zhao, J., Zhao, Y.: Special issue: The first provenance challenge. Concurr. Comput., Pract. Exp. 20, 409–418 (2008)
Article Google Scholar
Muniswamy-Reddy, K.K., Holland, D.A., Braun, U., Seltzer, M.I.: Provenance-aware storage systems. In: USENIX Annual Technical Conference, pp. 43–56 (2006)
Oinn, T., Greenwood, M., Addis, M., Alpdemir, M.N., Ferris, J., Glover, K., Goble, C., et al.: Taverna: lessons in creating a workflow environment for the life sciences. Concurr. Comput. Pract. Exp. 18(10), 1067–1100 (2006)
Article Google Scholar
Open provenance model: http://twiki.ipaw.info/bin/view/Challenge/OPM (2008)
Peri, S., et al.: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 13, 2363–2371 (2003)
Article Google Scholar
Salwinski, L., et al.: The database of interacting proteins: 2004 update. Nucleic Acids Res. D 32, 449–451 (2004)
Article Google Scholar
Scheidegger, C.E., Vo, H.T., Koop, D., Freire, J., Silva, C.T.: Querying and creating visualizations by analogy. IEEE Trans. Vis. Comput. Graph. 13(6), 1560–1567 (2007)
Article Google Scholar
Stef-Praun, T., Clifford, B., Foster, I., Hasson, U., Hategan, M., Small, S., Wilde, M., Zhao, Y.: Accelerating medical research using the swift workflow system. Health Grid (2007)
Tip, F.: A survey of program slicing techniques. J. Program. Lang. 3, 121–189 (1995)
Google Scholar
Weiser, M.: Program slicing. In: International Conference on Software Engineering, pp. 439–449 (1981)
Wiwatwattana, N., Jagadish, H.V., Lakshmanan, L.V.S., Srivastava, D.: X3: a cube operator for XML OLAP. In: ICDE, pp. 916–925 (2007)
Woodruff, A., Stonebraker, M.: Supporting fine-grained data lineage in a database visualization environment. In: ICDE, pp. 97–102 (1997)
Zhang, M., Zhang, X., Zhang, X., Prabhakar, S.: Tracing lineage beyond relational operators. In: VLDB, pp. 1116–1127 (2007)

Download references

Author information

Authors and Affiliations

The MITRE Corporation, McLean, VA, USA
Adriane Chapman
University of Michigan, Ann Arbor, MI, USA
H. V. Jagadish

Authors

Adriane Chapman
View author publications
You can also search for this author in PubMed Google Scholar
H. V. Jagadish
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adriane Chapman.

Additional information

Communicated by Walid G. Aref and Ouzzani Mourad.

This work was supported in part by NSF grant number IIS 0741620 and by NIH grant number U54 DA021519.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chapman, A., Jagadish, H.V. Understanding provenance black boxes. Distrib Parallel Databases 27, 139–167 (2010). https://doi.org/10.1007/s10619-009-7058-3

Download citation

Published: 16 January 2010
Issue Date: April 2010
DOI: https://doi.org/10.1007/s10619-009-7058-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Understanding provenance black boxes

Abstract

Access this article

Similar content being viewed by others

Unmanaged Workflows: Their Provenance and Use

A Model and System for Querying Provenance from Data Cleaning Workflows

A Brief Tour Through Provenance in Scientific Workflows and Databases

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Understanding provenance black boxes

Abstract

Access this article

Similar content being viewed by others

Unmanaged Workflows: Their Provenance and Use

A Model and System for Querying Provenance from Data Cleaning Workflows

A Brief Tour Through Provenance in Scientific Workflows and Databases

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation