Skip to main content

Project Histories: Managing Data Provenance Across Collection-Oriented Scientific Workflow Runs

  • Conference paper
Data Integration in the Life Sciences (DILS 2007)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4544))

Included in the following conference series:

Abstract

While a number of scientific workflow systems support data provenance, they primarily focus on collecting and querying provenance for single workflow runs. Scientific research projects, however, typically involve (1) many interrelated workflows (where data from one or more workflow runs are selected and used as input to subsequent runs) and (2) tasks between workflow runs that cannot be fully automated. This paper addresses the need for recording data dependencies across multiple workflow runs and accommodating data management activities performed between runs. We define a new conceptual model for representing project-level provenance based on the notion of project histories and folders, and describe mechanisms to support this model in the collection-oriented modeling and design framework of Kepler. Our approach allows users to conveniently organize their projects and data using the familiar folder-hierarchy metaphor, while at the same time integrating this information with detailed provenance of data products generated via automated scientific workflows.

This work supported in part by NSF grants DBI-053368, EAR-0225673, IIS-0630033, IIS-0612326, and EF-0228651; and DOE grant DE-FC02-01ER25486.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  • Barga, R.S., Digiampietri, L.S.: Automatic generation of workflow provenance. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  • Bowers, S., McPhillips, T.M., Ludäscher, B.: Provenance in collection-oriented scientific workflows. Concurrency and Computation: Practice and Experience (To appear 2007)

    Google Scholar 

  • Bowers, S., McPhillips, T.M., Ludäscher, B., Cohen, S., Davidson, S.B.: A model for user-oriented data provenance in pipelined scientific workflows. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  • Callahan, S.P., Freire, J., Santos, E., eidegger, C.E.S., Silva, C.T., Vo, H.T.: Managing the evolution of dataflows with VisTrails. In: IEEE Workshop on Workflow and Data-Flow for Scientific Applications (SciFlow) (2006)

    Google Scholar 

  • Churches, D., Gombas, G., Harrison, A., Maassen, J., Robinson, C., Shields, M., Taylor, I., Wang, I.: Programming scientific and distributed workflow with Triana services. Concurrency and Computation: Practice and Experience, Special Issue on Scientific Workflows (2005)

    Google Scholar 

  • Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Patil, S., Su, M.-H., Vahi, K., Livny, M.: Pegasus: Mapping scientific workflows onto the grid. In: European Across Grids Conference (2004)

    Google Scholar 

  • Eddy, S.R.: Profile hidden markov models. Bioinformatics 14(9), 755–763 (1998)

    Article  Google Scholar 

  • Fuxman, A., Hernández, M.A., Ho, C.T.H., Miller, R.J., Papotti, P., Popa, L.: Nested mappings: Schema mapping reloaded. In: VLDB, pp. 67–78 (2006)

    Google Scholar 

  • Jones, W., Phuwanartnurak, A.J., Gill, R., Bruce, H.: Don’t take my folders away!: Organizing personal information to get things done. In: CHI Extended Abstracts (2005)

    Google Scholar 

  • Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger-Frank, E., Jones, M., Lee, E., Tao, J., Zhao, Y.: Scientific workflow management and the Kepler system. Concurrency and Computation: Practice & Experience, Special Issue on Scientific Workflows (2005)

    Google Scholar 

  • McPhillips, T.M., Bowers, S., Ludäscher, B.: Collection-oriented scientific workflows for integrating and analyzing biological data. In: Leser, U., Naumann, F., Eckman, B. (eds.) DILS 2006. LNCS (LNBI), vol. 4075, pp. 248–263. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  • Miles, S., Groth, P., Branco, M., Moreau, L.: The requirements of recording and using provenance in e-science experiments. Journal of Grid Computing (To appear 2006)

    Google Scholar 

  • Moreau, L., Ludäscher, B., et al.: The first provenance challenge (editorial). Concurrency and Computation: Practice and Experience (To appear 2007)

    Google Scholar 

  • Oinn, T., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, M., Carver, T., Glover, K., Pocock, M., Wipat, A., Li, P.: Taverna: A tool for the composition and enactment of bioinformatics workflows. Bioinformatics Journal 20(17) (2004)

    Google Scholar 

  • Thain, D., Tannenbaum, T., Livny, M.: Distributed computing in practice: the condor experience. Concurrency and Computation: Practice and Experience 17(2-4), 323–356 (2005)

    Article  Google Scholar 

  • Zhao, J., Wroe, C., Goble, C., Stevens, R., Quan, D., Greenwood, M.: Using semantic web technologies for representing e-Science provenance. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298, Springer, Heidelberg (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Sarah Cohen-Boulakia Val Tannen

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Bowers, S., McPhillips, T., Wu, M., Ludäscher, B. (2007). Project Histories: Managing Data Provenance Across Collection-Oriented Scientific Workflow Runs. In: Cohen-Boulakia, S., Tannen, V. (eds) Data Integration in the Life Sciences. DILS 2007. Lecture Notes in Computer Science(), vol 4544. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73255-6_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-73255-6_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-73254-9

  • Online ISBN: 978-3-540-73255-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics