Skip to main content
Log in

P-PIF: a ProvONE provenance interoperability framework for analyzing heterogeneous workflow specifications and provenance traces

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

Enabling provenance interoperability by analyzing heterogeneous provenance information from different scientific workflow management systems is a novel research topic. With the advent of the ProvONE model, it is now possible to model both the prospective as well as the retrospective provenance in a single provenance model. Scientific workflows are composed using a declarative definition language, such as BPEL, SCUFL/t2flow, or MoML. Associated with the execution of a workflow is its corresponding provenance that is modeled and stored in the data model specified by the workflow system. However, sharing of provenance generated by heterogeneous workflows is a challenging task and prevents the aggregate analysis and comparison of workflows and their associated provenance. To address these challenges, this paper introduces a ProvONE-based Provenance Interoperability Framework that completely automates the modeling of provenance from heterogeneous WfMSs by: (a) automatically translating the scientific workflows to their equivalent representation in a ProvONE prospective graph using the Prov2ONE algorithm, (b) enriching the ProvONE prospective graph with the retrospective provenance exported by the WfMSs, and (c) native support for storing the ProvONE provenance graphs in a Resource Description Framework triplestore that supports the SPARQL query language for querying and retrieving ProvONE graphs. The Prov2ONE algorithm is based on a set of vocabulary translation rules between workflow specifications and the ProvONE model. The correctness and completeness proof of the algorithm is shown and its complexity is analyzed. Moreover, to demonstrate the practical applicability of the complete framework, ProvONE graphs for workflows defined in BPEL, SCUFL, and MoML are generated. Finally, the provenance challenge queries are extended with six additional queries for retrieving the provenance modeled in ProvONE.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. https://www.myexperiment.org/.

  2. http://www.sfb-episteme.de/teilprojekte/informationsinfrastruktur/index.html.

  3. http://www.dfg.de/foerderung/info_wissenschaft/2017/info_wissenschaft_17_13/.

  4. https://github.com/taverna/taverna-prov.

  5. https://github.com/mbaloch/provone-pif.

  6. http://ode.apache.org/management-api.html.

  7. https://github.com/taverna/taverna-prov.

  8. https://code.kepler-project.org/code/kepler/trunk/modules/provenance/docs/provenance.pdf.

  9. http://ode.apache.org/ode-execution-events.html.

  10. http://ode.apache.org/ode-execution-events.html

  11. http://www.myexperiment.org/workflows/4736.html.

  12. http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII10.0/ptII10.0.1/ptolemy/domains/sdf/demo/FixPoint/FixPoint/.

  13. http://datamanager.kit.edu/masi/localizationmicroscopy/swagger-ui/.

References

  1. Schwab, M., Karrenbach, M., Claerbout, J.: Making scientific computations reproducible. Comput. Sci. Eng. 2(6), 61–67 (2000)

    Article  Google Scholar 

  2. Stodden, V.: The Scientific Method in Practice: Reproducibility in the Computational Sciences. MIT Sloan Research Paper (2010)

  3. Silva, C.T., Freire, J., Callahan, S.P.: Provenance for visualizations: reproducibility and beyond. Comput. Sci. Eng. 9(5), 82–89 (2007)

    Article  Google Scholar 

  4. Houstis, E.N., Rice, J.R., Gallopoulos, E., Bramley, R.: Enabling Technologies for Computational Science: Frameworks, Middleware and Environments, vol. 548. Springer, New York (2012)

    MATH  Google Scholar 

  5. Berry, D., Parastatidis, S.: e-Science Workflow Services Workshop (2003). http://www.nesc.ac.uk/esi/events/303/index.html

  6. Gannon, D., Fox, G., Farazdel, A., Goble, C., Deelman, E., Berry, D.: Workflow in grid systems workshop (2004). http://www.extreme.indiana.edu/groc/Worflow-call.html

  7. Jacob, J., Katz, D., Miller, C., et al.: GRIST Workshop on Service Composition for Data Exploration in the Virtual Observatory (2004). http://www.roe.ac.uk/~rgm/sc4devo/sc4devo1/index.html

  8. LINK-Up Workshop on Scientific Workflows (2004). http://kbis.sdsc.edu/events/link-up-11-04/

  9. Scientific Data Management Framework Workshop (2003). http://sdm.lbl.gov/~arie/sdm/SDM.Framework.wshp.htm

  10. Deelman, E., Gil, Y., Zemankova, M.: NSF Workshop on the Challenges of Scientific Workflows, pp. 1–2 (2006)

  11. Shields, M.: Control-versus data-driven workflows. In: Workflows for e-Science, pp. 167–173. Springer, London (2007)

  12. OASIS. Web Services Business Process Execution Language version 2.0. http://docs.oasis-open.org/wsbpel/2.0/OS/wsbpel-v2.0-OS.html (2007)

  13. Lee, E.A., Neuendorffer, S.: MoML: A Modeling Markup Language in SML: Version 0.4. Electronics Research Laboratory, College of Engineering. University of California (2000)

  14. Wolstencroft, K., Haines, R., et al.: The taverna workflow suite: designing and executing workflows of web services on the desktop, web or in the cloud. Nucleic Acids Res. 41, W557–W561 (2013)

    Article  Google Scholar 

  15. Marru, S., Gunathilake, L., Herath, C., Tangchaisin, P., Pierce, M., Mattmann, C., Singh, R., Gunarathne, T., Chinthaka, E., Gardler, R. Slominski, A., Douma, A., Perera, S., Weerawarana, S.: Apache airavata: a framework for distributed applications and computational workflows. In: Proceedings of the ACM Workshop on Gateway Computing Environments, GCE, pp. 21–28. ACM, New York, NY, USA (2011)

  16. Droegemeier, K.K., Gannon, D., Reed, D., Plale, B., Alameda, J., Baltzer, T., Brewster, K., Clark, R., Domenico, B., Graves, S., et al.: Service-oriented environments for dynamically interacting with mesoscale weather. Comput. Sci. Eng. 7(6), 12–29 (2005)

    Article  Google Scholar 

  17. Scherp, G., Höing, A., Gudenkauf, S., Hasselbring, W., Kao, O.: Using UNICORE and WS-BPEL for scientific workflow execution in grid environments. In: Euro-Par Workshops, pp. 335–344. Springer (2009)

  18. Wassermann, B., Emmerich, W., Butchart, B., Cameron, N., Chen, L., Patel, J.: Sedna: a BPEL-Based environment for visual scientific workflow modeling. In: Workflows for e-Science, pp. 428–449. Springer, London (2007)

  19. Emmerich, W., Butchart, B., Chen, L., Wassermann, B., Price, S.L.: Grid service orchestration using the business process execution language (BPEL). J. Grid Comput. 3, 283–304 (2005)

    Article  Google Scholar 

  20. Sonntag, M., Karastoyanova, D., Deelman, E.: BPEL4Pegasus: combining business and scientific workflows. In: International Conference on Service-Oriented Computing, pp. 728–729. Springer (2010)

  21. Taylor, I., Shields, M., Wang, I., Harrison, A.: The Triana Workflow Environment: Architecture and Applications, pp. 320–339. Springer, London (2007)

  22. Goble, C.: Position statement: musings on provenance, workflow and (semantic web) annotations for bioinformatics. In: Workshop on Data Derivation and Provenance, vol. 3. Chicago (2002)

  23. Simmhan, Y.L., Plale, B., Gannon, D.: Towards a quality model for effective data selection in collaboratories. In: 22nd International Conference on Data Engineering Workshops (ICDEW’06), pp. 72–72. IEEE (2006)

  24. Zhao, Y., Wilde, M., Foster, I.: Applying the virtual data provenance model. International Provenance and Annotation Workshop. IPAW ’06, pp. 148–161. Springer, Heidelberg (2006)

  25. Missier, P., Dey, S.C., Belhajjame, K., Cuevas-Vicenttín, V., Ludäscher, B.: D-PROV: extending the PROV provenance model with workflow structure. In: Workshop Theory and Practice of Provenance (TaPP) (2013)

  26. Lim, C., Lu, S., Chebotko, A., Fotouhi, F.: Prospective and retrospective provenance collection in scientific workflow environments. In: IEEE International Conference on Services Computing (SCC), pp. 449–456 (2010)

  27. Cuevas-Vicenttín, V., Kianmajd, P., Ludäscher, B., Missier, P., Chirigati, F., Wei, Y., Koop, D., Dey, S.: The PBase Scientific Workflow Provenance Repository. Int. J. Digit. Curation 9(2), 28–38 (2014)

    Article  Google Scholar 

  28. Prabhune, A., Stotzka, R., Jejkal, T., Hartmann, V., Bach, M., Schmitt, E., Hausmann, M., Hesser, J.: An optimized generic client service API for managing large datasets within a data repository. In: BigDataService, pp. 44–51 (2015)

  29. Prabhune, A., Zweig, A., Stotzka, R., Gertz, M., Hesser, J.: Prov2ONE: an algorithm for automatically constructing ProvONE provenance graphs. In: International Provenance and Annotation Workshop. IPAW ’16, pp. 204–208. Springer International Publishing (2016)

  30. Chandna, S., Tonne, D., Jejkal, T., Stotzka, R., Krause, C., Vanscheidt, P., Busch, H., Prabhune, A.: Software Workflow for the Automatic Tagging of Medieval Manuscript Images (SWATI) (2015)

  31. Stotzka, R., Hartmann, V., Jejkal, T., Sutter, M., van Wezel, J., Hardt, M., Garcia, A., Kupsch, R., Bourov, S.: Perspective of the Large Scale Data Facility (LSDF) supporting nuclear fusion applications. In: 2011 19th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 373–379. IEEE (2011)

  32. Jejkal, T., Vondrous, A., Kopmann, A., Stotzka, R., Hartmann, V.: KIT data manager: the repository architecture enabling cross-disciplinary research. Large-Scale Data Management and Analysis-Big Data in Science (2014)

  33. Lassila, O., Swick, Ralph R: Resource Description Framework (RDF) model and syntax specification. Recommendation, 22 Feb 1999, W3C, Cambridge, MA (1999)

  34. Prud, E., Seaborne, A., et al.: SPARQL query language for RDF (2017). http://www.w3.org/TR/rdf-sparql-query/

  35. Russell, N., Ter Hofstede, A.H.M., van der Aalst, W.M.P., Mulyar, N.: Workflow control-flow patterns: a revised view. BPM Center Report BPM-06-22, BPMcenter.org (2006)

  36. Wohed, P., van der Aalst, W.M.P., Dumas, M., ter Hofstede, A.H.M.: Analysis of web services composition languages: the case of BPEL4WS. In: International Conference on Conceptual Modeling, pp. 200–215 (2003)

  37. Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for computational tasks: a survey. Comput. Sci. Eng. 10(3), 11–21 (2008)

    Article  Google Scholar 

  38. Zhao, J., Wroe, C., Goble, C., Stevens, R., Quan, D., Greenwood, M.: Using semantic web technologies for representing E-science provenance. In: International Semantic Web Conference, pp. 92–106. Springer, Heidelberg (2004)

  39. da Cruz, S.M.S., Campos, M.L.M., Mattoso, M.: Towards a taxonomy of provenance in scientific workflow management systems. In: Congress on Services—I, pp. 259–266 (2009)

  40. Freire, J., Silva, C.T., Callahan, S.P., Santos, E., Scheidegger, C.E., Vo, H.T.: Managing rapidly-evolving scientific workflows. International Provenance and Annotation Workshop. IPAW ’06, pp. 10–18. Springer, Heidelberg (2006)

  41. Ludscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E.A., Tao, J., Zhao, Y.: Scientific workflow management and the KEPLER system. Concurr. Comput. 18(10), 1039–1065 (2006)

    Article  Google Scholar 

  42. Davidson, S.B., Freire, J.: Provenance and scientific workflows: challenges and opportunities. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1345–1350 (2008)

  43. Moreau, L., Missier, P.: PROV-DM: The PROV Data Model. Technical Report, World Wide Web Consortium (April (2013)

    Google Scholar 

  44. Moreau, L., Clifford, B., Freire, J., Futrelle, J., Gil, Y., Groth, P., Kwasnikowska, N., Miles, S., Missier, P., Myers, J., Plale, B., Simmhan, Y., Stephan, E., Van den Bussche, J.: The open provenance model core specification (v1.1). Future Gener. Comput. Syst. 27(6), 743–756 (2011)

    Google Scholar 

  45. Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance in e-science. SIGMOD Rec. 34(3), 31–36 (2005)

    Article  Google Scholar 

  46. Gadelha, L.M., Clifford, B., Mattoso, M., Wilde, M., Foster, I., et al.: Provenance Management in Swift with Implementation Details. Technical Report, Argonne National Laboratory (ANL) (2011)

    Book  Google Scholar 

  47. Missier, Paolo, Belhajjame, Khalid, Zhao, Jun, Roos, Marco, Goble, Carole: Data lineage model for Taverna workflows with lightweight annotation requirements. In: International Provenance and Annotation Workshop, pp. 17–30. Springer (2008)

  48. Plale, B., Cao, B., Aktas, M.: S: Provenance Capture of Unmanaged Workflows with Karma. Indiana University, Bloomington, IN (2011)

    Google Scholar 

  49. Braun, U., Seltzer, M.I., Chapman, A., Blaustein, B.T., Allen, M.D., Seligman, L.: Towards query interoperability: PASSing PLUS. In: Workshop Theory and Practice of Provenance (TaPP), pp. 1–10 (2010)

  50. Missier, P., Ludäscher, B., Bowers, S., Dey, S., Sarkar, A., Shrestha, B., Altintas, I., Anand, M.K., Goble, C.: Linking multiple workflow provenance traces for interoperable collaborative science. In: Workflows in Support of Large-Scale Science (WORKS), pp. 1–8 (2010)

  51. Plankensteiner, K., Prodan, R., Janetschek, M., Fahringer, T., Montagnat, J., Rogers, D., Harvey, I., Taylor, I., Balaskó, Á.: Fine-grain interoperability of scientific workflows in distributed computing infrastructures. J. Grid Comput. 11(3), 429–455 (2013)

    Article  Google Scholar 

  52. Altintas, I., Anand, M.K., Crawl, D., Bowers, S., Belloum, A., Missier, P., Ludäscher, B., Goble, C.A., Sloot, P.M.A.: Understanding collaborative studies through interoperable workflow provenance. International Provenance and Annotation Workshop. IPAW ’10, pp. 42–58. Springer, Heidelberg (2010)

  53. Song, M., Miller, J.A., Arpinar, I.B: RepoX: An XML Repository for Workflow Designs and Specifications. Univeristy of Georgia, USA (2001)

  54. Vanhatalo, J., Koehler, J., Leymann, F.: Repository for business processes and arbitrary associated metadata. In: Proceedings of the BPM Demo Session at the Fourth International Conference on Business Process Management (BPM), pp. 25–31. CEUR (2006)

  55. Oliveira, W., Missier, P., Ocaña, K., de Oliveira, D., Braganholo, V.: Analyzing provenance across heterogeneous provenance graphs. In: International Provenance and Annotation Workshop. IPAW ’16, pp. 57–70. Springer International Publishing (2016)

  56. Watson, P., Hiden, H., Woodman, S.: e-Science central for CARMEN: science as a service. Concurr. Comput. 22(17), 2369–2380 (2010)

    Article  Google Scholar 

  57. de Oliveira, D., Ogasawara, E., Baião, F., SciCumulus, M.M.: A lightweight cloud middleware to explore many task computing paradigm in scientific workflows. In: IEEE 3rd International Conference on Cloud Computing, pp. 378–385 (2010)

  58. De Abreu, D., Flores, A., Palma, G., Pestana, V., Piñero, J., Queipo, J., Sánchez, J., Vidal, M-E.: Choosing between graph databases and RDF engines for consuming and mining linked data. In: Proceedings of the Fourth International Conference on Consuming Linked Data, COLD ’13, pp. 37–49 (2013)

  59. Vicknair, C., Macias, M., Zhao, Z., Nan, X., Chen, Y., Wilkins, D.: A comparison of a graph database and a relational database: a data provenance perspective. In: Proceedings of the 48th Annual Southeast Regional Conference, pp. 42:1–42:6. ACM (2010)

  60. Jena, A.: A free and open source Java framework for building Semantic Web and Linked Data applications. https://jena.apache.org

  61. Goderis, A., Brooks, C., Altintas, I., Lee, E.A., Goble, C.: Composing different models of computation in Kepler and Ptolemy II. In: International Conference on Computational Science, pp. 182–190. Springer (2007)

  62. Berglund, A., Boag, S., Chamberlin, D., Fernández, M., Kay, M., Robie, J., Siméon, J.: XML Path Language (XPath). W3C (2003)

  63. Moreau, L., Ludäscher, B., Altintas, I., Barga, R.S., Bowers, S., Callahan, S., Chin, G., Clifford, B., Cohen, S., Cohen-Boulakia, S., et al.: Special issue: the first provenance challenge. Concurr. Comput. 20(5), 409–418 (2008)

    Article  Google Scholar 

  64. Ellqvist, T., Koop, D., Freire, J., Silva, C., Strömbäck, L.: Using mediation to achieve provenance interoperability. In: Congress on Services—I, pp. 291–298 (2009)

  65. Blaustein, B.T., Seligman, L., Morse, M., Allen, M.D., Rosenthal, A.: PLUS: synthesizing privacy, lineage, uncertainty and security. In: IEEE 24th International Conference on Data Engineering Workshop. ICDEW, pp. 242–245 (2008)

  66. Muniswamy-Reddy, K.K., Holland, D.A., Braun, U., Seltzer, M.I.: Provenance-aware storage systems. In: USENIX Annual Technical Conference, General Track, pp. 43–56 (2006)

  67. Ding, L., Michaelis, J., McCusker, J., McGuinness, D.L.: Linked provenance data: a semantic web-based approach to interoperable workflow traces. Future Gener. Comput. Syst. 27(6), 797–805 (2011)

    Article  Google Scholar 

  68. Anand, M.K., Bowers, S., Ludäscher, B.: Techniques for efficiently querying scientific workflow provenance graphs. In: Proceedings of the 13th International Conference on Extending Database Technology, EDBT ’10, pp. 287–298 (2010)

  69. Garijo, D., Gil, Y.: Augmenting PROV with plans in P-plan: scientific processes as linked data. In: Proceedings of the 2nd International Workshop on Linked Science (2012)

  70. Dey, S., Belhajjame, K., Koop, D., Raul, M., Ludäscher, B.: Linking prospective and retrospective provenance in scripts. In: Theory and Practice of Provenance (TaPP) (2015)

  71. Pimentel, J., Dey, S., et al. Yin & Yang: demonstrating complementary provenance from noWorkflow & YesWorkflow. In: International Provenance and Annotation Workshop. IPAW ’16, pp. 161–165. Springer (2016)

  72. Terstyanszky, G., Kukla, T., Kiss, T., Kacsuk, P., Balasko, A., Farkas, Z.: Enabling scientific workflow sharing through coarse-grained interoperability. Future Gener. Comput. Syst. 37, 46–59 (2014)

    Article  Google Scholar 

Download references

Acknowledgements

This research is supported by the Portfolio Extension of Helmholtz Association "Large Scale Data Management and Analysis" and DFG (German Research Foundation) MASi project (STO 397/4-1). We are thankful to Kay-Michael Wuerzner and the OCR-D team for contributing their use case and volunteering as pilot adopters of P-PIF.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ajinkya Prabhune.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Prabhune, A., Zweig, A., Stotzka, R. et al. P-PIF: a ProvONE provenance interoperability framework for analyzing heterogeneous workflow specifications and provenance traces. Distrib Parallel Databases 36, 219–264 (2018). https://doi.org/10.1007/s10619-017-7216-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-017-7216-y

Keywords

Navigation