Abstract
Enabling provenance interoperability by analyzing heterogeneous provenance information from different scientific workflow management systems is a novel research topic. With the advent of the ProvONE model, it is now possible to model both the prospective as well as the retrospective provenance in a single provenance model. Scientific workflows are composed using a declarative definition language, such as BPEL, SCUFL/t2flow, or MoML. Associated with the execution of a workflow is its corresponding provenance that is modeled and stored in the data model specified by the workflow system. However, sharing of provenance generated by heterogeneous workflows is a challenging task and prevents the aggregate analysis and comparison of workflows and their associated provenance. To address these challenges, this paper introduces a ProvONE-based Provenance Interoperability Framework that completely automates the modeling of provenance from heterogeneous WfMSs by: (a) automatically translating the scientific workflows to their equivalent representation in a ProvONE prospective graph using the Prov2ONE algorithm, (b) enriching the ProvONE prospective graph with the retrospective provenance exported by the WfMSs, and (c) native support for storing the ProvONE provenance graphs in a Resource Description Framework triplestore that supports the SPARQL query language for querying and retrieving ProvONE graphs. The Prov2ONE algorithm is based on a set of vocabulary translation rules between workflow specifications and the ProvONE model. The correctness and completeness proof of the algorithm is shown and its complexity is analyzed. Moreover, to demonstrate the practical applicability of the complete framework, ProvONE graphs for workflows defined in BPEL, SCUFL, and MoML are generated. Finally, the provenance challenge queries are extended with six additional queries for retrieving the provenance modeled in ProvONE.
Similar content being viewed by others
Notes
References
Schwab, M., Karrenbach, M., Claerbout, J.: Making scientific computations reproducible. Comput. Sci. Eng. 2(6), 61–67 (2000)
Stodden, V.: The Scientific Method in Practice: Reproducibility in the Computational Sciences. MIT Sloan Research Paper (2010)
Silva, C.T., Freire, J., Callahan, S.P.: Provenance for visualizations: reproducibility and beyond. Comput. Sci. Eng. 9(5), 82–89 (2007)
Houstis, E.N., Rice, J.R., Gallopoulos, E., Bramley, R.: Enabling Technologies for Computational Science: Frameworks, Middleware and Environments, vol. 548. Springer, New York (2012)
Berry, D., Parastatidis, S.: e-Science Workflow Services Workshop (2003). http://www.nesc.ac.uk/esi/events/303/index.html
Gannon, D., Fox, G., Farazdel, A., Goble, C., Deelman, E., Berry, D.: Workflow in grid systems workshop (2004). http://www.extreme.indiana.edu/groc/Worflow-call.html
Jacob, J., Katz, D., Miller, C., et al.: GRIST Workshop on Service Composition for Data Exploration in the Virtual Observatory (2004). http://www.roe.ac.uk/~rgm/sc4devo/sc4devo1/index.html
LINK-Up Workshop on Scientific Workflows (2004). http://kbis.sdsc.edu/events/link-up-11-04/
Scientific Data Management Framework Workshop (2003). http://sdm.lbl.gov/~arie/sdm/SDM.Framework.wshp.htm
Deelman, E., Gil, Y., Zemankova, M.: NSF Workshop on the Challenges of Scientific Workflows, pp. 1–2 (2006)
Shields, M.: Control-versus data-driven workflows. In: Workflows for e-Science, pp. 167–173. Springer, London (2007)
OASIS. Web Services Business Process Execution Language version 2.0. http://docs.oasis-open.org/wsbpel/2.0/OS/wsbpel-v2.0-OS.html (2007)
Lee, E.A., Neuendorffer, S.: MoML: A Modeling Markup Language in SML: Version 0.4. Electronics Research Laboratory, College of Engineering. University of California (2000)
Wolstencroft, K., Haines, R., et al.: The taverna workflow suite: designing and executing workflows of web services on the desktop, web or in the cloud. Nucleic Acids Res. 41, W557–W561 (2013)
Marru, S., Gunathilake, L., Herath, C., Tangchaisin, P., Pierce, M., Mattmann, C., Singh, R., Gunarathne, T., Chinthaka, E., Gardler, R. Slominski, A., Douma, A., Perera, S., Weerawarana, S.: Apache airavata: a framework for distributed applications and computational workflows. In: Proceedings of the ACM Workshop on Gateway Computing Environments, GCE, pp. 21–28. ACM, New York, NY, USA (2011)
Droegemeier, K.K., Gannon, D., Reed, D., Plale, B., Alameda, J., Baltzer, T., Brewster, K., Clark, R., Domenico, B., Graves, S., et al.: Service-oriented environments for dynamically interacting with mesoscale weather. Comput. Sci. Eng. 7(6), 12–29 (2005)
Scherp, G., Höing, A., Gudenkauf, S., Hasselbring, W., Kao, O.: Using UNICORE and WS-BPEL for scientific workflow execution in grid environments. In: Euro-Par Workshops, pp. 335–344. Springer (2009)
Wassermann, B., Emmerich, W., Butchart, B., Cameron, N., Chen, L., Patel, J.: Sedna: a BPEL-Based environment for visual scientific workflow modeling. In: Workflows for e-Science, pp. 428–449. Springer, London (2007)
Emmerich, W., Butchart, B., Chen, L., Wassermann, B., Price, S.L.: Grid service orchestration using the business process execution language (BPEL). J. Grid Comput. 3, 283–304 (2005)
Sonntag, M., Karastoyanova, D., Deelman, E.: BPEL4Pegasus: combining business and scientific workflows. In: International Conference on Service-Oriented Computing, pp. 728–729. Springer (2010)
Taylor, I., Shields, M., Wang, I., Harrison, A.: The Triana Workflow Environment: Architecture and Applications, pp. 320–339. Springer, London (2007)
Goble, C.: Position statement: musings on provenance, workflow and (semantic web) annotations for bioinformatics. In: Workshop on Data Derivation and Provenance, vol. 3. Chicago (2002)
Simmhan, Y.L., Plale, B., Gannon, D.: Towards a quality model for effective data selection in collaboratories. In: 22nd International Conference on Data Engineering Workshops (ICDEW’06), pp. 72–72. IEEE (2006)
Zhao, Y., Wilde, M., Foster, I.: Applying the virtual data provenance model. International Provenance and Annotation Workshop. IPAW ’06, pp. 148–161. Springer, Heidelberg (2006)
Missier, P., Dey, S.C., Belhajjame, K., Cuevas-Vicenttín, V., Ludäscher, B.: D-PROV: extending the PROV provenance model with workflow structure. In: Workshop Theory and Practice of Provenance (TaPP) (2013)
Lim, C., Lu, S., Chebotko, A., Fotouhi, F.: Prospective and retrospective provenance collection in scientific workflow environments. In: IEEE International Conference on Services Computing (SCC), pp. 449–456 (2010)
Cuevas-Vicenttín, V., Kianmajd, P., Ludäscher, B., Missier, P., Chirigati, F., Wei, Y., Koop, D., Dey, S.: The PBase Scientific Workflow Provenance Repository. Int. J. Digit. Curation 9(2), 28–38 (2014)
Prabhune, A., Stotzka, R., Jejkal, T., Hartmann, V., Bach, M., Schmitt, E., Hausmann, M., Hesser, J.: An optimized generic client service API for managing large datasets within a data repository. In: BigDataService, pp. 44–51 (2015)
Prabhune, A., Zweig, A., Stotzka, R., Gertz, M., Hesser, J.: Prov2ONE: an algorithm for automatically constructing ProvONE provenance graphs. In: International Provenance and Annotation Workshop. IPAW ’16, pp. 204–208. Springer International Publishing (2016)
Chandna, S., Tonne, D., Jejkal, T., Stotzka, R., Krause, C., Vanscheidt, P., Busch, H., Prabhune, A.: Software Workflow for the Automatic Tagging of Medieval Manuscript Images (SWATI) (2015)
Stotzka, R., Hartmann, V., Jejkal, T., Sutter, M., van Wezel, J., Hardt, M., Garcia, A., Kupsch, R., Bourov, S.: Perspective of the Large Scale Data Facility (LSDF) supporting nuclear fusion applications. In: 2011 19th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 373–379. IEEE (2011)
Jejkal, T., Vondrous, A., Kopmann, A., Stotzka, R., Hartmann, V.: KIT data manager: the repository architecture enabling cross-disciplinary research. Large-Scale Data Management and Analysis-Big Data in Science (2014)
Lassila, O., Swick, Ralph R: Resource Description Framework (RDF) model and syntax specification. Recommendation, 22 Feb 1999, W3C, Cambridge, MA (1999)
Prud, E., Seaborne, A., et al.: SPARQL query language for RDF (2017). http://www.w3.org/TR/rdf-sparql-query/
Russell, N., Ter Hofstede, A.H.M., van der Aalst, W.M.P., Mulyar, N.: Workflow control-flow patterns: a revised view. BPM Center Report BPM-06-22, BPMcenter.org (2006)
Wohed, P., van der Aalst, W.M.P., Dumas, M., ter Hofstede, A.H.M.: Analysis of web services composition languages: the case of BPEL4WS. In: International Conference on Conceptual Modeling, pp. 200–215 (2003)
Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for computational tasks: a survey. Comput. Sci. Eng. 10(3), 11–21 (2008)
Zhao, J., Wroe, C., Goble, C., Stevens, R., Quan, D., Greenwood, M.: Using semantic web technologies for representing E-science provenance. In: International Semantic Web Conference, pp. 92–106. Springer, Heidelberg (2004)
da Cruz, S.M.S., Campos, M.L.M., Mattoso, M.: Towards a taxonomy of provenance in scientific workflow management systems. In: Congress on Services—I, pp. 259–266 (2009)
Freire, J., Silva, C.T., Callahan, S.P., Santos, E., Scheidegger, C.E., Vo, H.T.: Managing rapidly-evolving scientific workflows. International Provenance and Annotation Workshop. IPAW ’06, pp. 10–18. Springer, Heidelberg (2006)
Ludscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E.A., Tao, J., Zhao, Y.: Scientific workflow management and the KEPLER system. Concurr. Comput. 18(10), 1039–1065 (2006)
Davidson, S.B., Freire, J.: Provenance and scientific workflows: challenges and opportunities. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1345–1350 (2008)
Moreau, L., Missier, P.: PROV-DM: The PROV Data Model. Technical Report, World Wide Web Consortium (April (2013)
Moreau, L., Clifford, B., Freire, J., Futrelle, J., Gil, Y., Groth, P., Kwasnikowska, N., Miles, S., Missier, P., Myers, J., Plale, B., Simmhan, Y., Stephan, E., Van den Bussche, J.: The open provenance model core specification (v1.1). Future Gener. Comput. Syst. 27(6), 743–756 (2011)
Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance in e-science. SIGMOD Rec. 34(3), 31–36 (2005)
Gadelha, L.M., Clifford, B., Mattoso, M., Wilde, M., Foster, I., et al.: Provenance Management in Swift with Implementation Details. Technical Report, Argonne National Laboratory (ANL) (2011)
Missier, Paolo, Belhajjame, Khalid, Zhao, Jun, Roos, Marco, Goble, Carole: Data lineage model for Taverna workflows with lightweight annotation requirements. In: International Provenance and Annotation Workshop, pp. 17–30. Springer (2008)
Plale, B., Cao, B., Aktas, M.: S: Provenance Capture of Unmanaged Workflows with Karma. Indiana University, Bloomington, IN (2011)
Braun, U., Seltzer, M.I., Chapman, A., Blaustein, B.T., Allen, M.D., Seligman, L.: Towards query interoperability: PASSing PLUS. In: Workshop Theory and Practice of Provenance (TaPP), pp. 1–10 (2010)
Missier, P., Ludäscher, B., Bowers, S., Dey, S., Sarkar, A., Shrestha, B., Altintas, I., Anand, M.K., Goble, C.: Linking multiple workflow provenance traces for interoperable collaborative science. In: Workflows in Support of Large-Scale Science (WORKS), pp. 1–8 (2010)
Plankensteiner, K., Prodan, R., Janetschek, M., Fahringer, T., Montagnat, J., Rogers, D., Harvey, I., Taylor, I., Balaskó, Á.: Fine-grain interoperability of scientific workflows in distributed computing infrastructures. J. Grid Comput. 11(3), 429–455 (2013)
Altintas, I., Anand, M.K., Crawl, D., Bowers, S., Belloum, A., Missier, P., Ludäscher, B., Goble, C.A., Sloot, P.M.A.: Understanding collaborative studies through interoperable workflow provenance. International Provenance and Annotation Workshop. IPAW ’10, pp. 42–58. Springer, Heidelberg (2010)
Song, M., Miller, J.A., Arpinar, I.B: RepoX: An XML Repository for Workflow Designs and Specifications. Univeristy of Georgia, USA (2001)
Vanhatalo, J., Koehler, J., Leymann, F.: Repository for business processes and arbitrary associated metadata. In: Proceedings of the BPM Demo Session at the Fourth International Conference on Business Process Management (BPM), pp. 25–31. CEUR (2006)
Oliveira, W., Missier, P., Ocaña, K., de Oliveira, D., Braganholo, V.: Analyzing provenance across heterogeneous provenance graphs. In: International Provenance and Annotation Workshop. IPAW ’16, pp. 57–70. Springer International Publishing (2016)
Watson, P., Hiden, H., Woodman, S.: e-Science central for CARMEN: science as a service. Concurr. Comput. 22(17), 2369–2380 (2010)
de Oliveira, D., Ogasawara, E., Baião, F., SciCumulus, M.M.: A lightweight cloud middleware to explore many task computing paradigm in scientific workflows. In: IEEE 3rd International Conference on Cloud Computing, pp. 378–385 (2010)
De Abreu, D., Flores, A., Palma, G., Pestana, V., Piñero, J., Queipo, J., Sánchez, J., Vidal, M-E.: Choosing between graph databases and RDF engines for consuming and mining linked data. In: Proceedings of the Fourth International Conference on Consuming Linked Data, COLD ’13, pp. 37–49 (2013)
Vicknair, C., Macias, M., Zhao, Z., Nan, X., Chen, Y., Wilkins, D.: A comparison of a graph database and a relational database: a data provenance perspective. In: Proceedings of the 48th Annual Southeast Regional Conference, pp. 42:1–42:6. ACM (2010)
Jena, A.: A free and open source Java framework for building Semantic Web and Linked Data applications. https://jena.apache.org
Goderis, A., Brooks, C., Altintas, I., Lee, E.A., Goble, C.: Composing different models of computation in Kepler and Ptolemy II. In: International Conference on Computational Science, pp. 182–190. Springer (2007)
Berglund, A., Boag, S., Chamberlin, D., Fernández, M., Kay, M., Robie, J., Siméon, J.: XML Path Language (XPath). W3C (2003)
Moreau, L., Ludäscher, B., Altintas, I., Barga, R.S., Bowers, S., Callahan, S., Chin, G., Clifford, B., Cohen, S., Cohen-Boulakia, S., et al.: Special issue: the first provenance challenge. Concurr. Comput. 20(5), 409–418 (2008)
Ellqvist, T., Koop, D., Freire, J., Silva, C., Strömbäck, L.: Using mediation to achieve provenance interoperability. In: Congress on Services—I, pp. 291–298 (2009)
Blaustein, B.T., Seligman, L., Morse, M., Allen, M.D., Rosenthal, A.: PLUS: synthesizing privacy, lineage, uncertainty and security. In: IEEE 24th International Conference on Data Engineering Workshop. ICDEW, pp. 242–245 (2008)
Muniswamy-Reddy, K.K., Holland, D.A., Braun, U., Seltzer, M.I.: Provenance-aware storage systems. In: USENIX Annual Technical Conference, General Track, pp. 43–56 (2006)
Ding, L., Michaelis, J., McCusker, J., McGuinness, D.L.: Linked provenance data: a semantic web-based approach to interoperable workflow traces. Future Gener. Comput. Syst. 27(6), 797–805 (2011)
Anand, M.K., Bowers, S., Ludäscher, B.: Techniques for efficiently querying scientific workflow provenance graphs. In: Proceedings of the 13th International Conference on Extending Database Technology, EDBT ’10, pp. 287–298 (2010)
Garijo, D., Gil, Y.: Augmenting PROV with plans in P-plan: scientific processes as linked data. In: Proceedings of the 2nd International Workshop on Linked Science (2012)
Dey, S., Belhajjame, K., Koop, D., Raul, M., Ludäscher, B.: Linking prospective and retrospective provenance in scripts. In: Theory and Practice of Provenance (TaPP) (2015)
Pimentel, J., Dey, S., et al. Yin & Yang: demonstrating complementary provenance from noWorkflow & YesWorkflow. In: International Provenance and Annotation Workshop. IPAW ’16, pp. 161–165. Springer (2016)
Terstyanszky, G., Kukla, T., Kiss, T., Kacsuk, P., Balasko, A., Farkas, Z.: Enabling scientific workflow sharing through coarse-grained interoperability. Future Gener. Comput. Syst. 37, 46–59 (2014)
Acknowledgements
This research is supported by the Portfolio Extension of Helmholtz Association "Large Scale Data Management and Analysis" and DFG (German Research Foundation) MASi project (STO 397/4-1). We are thankful to Kay-Michael Wuerzner and the OCR-D team for contributing their use case and volunteering as pilot adopters of P-PIF.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Prabhune, A., Zweig, A., Stotzka, R. et al. P-PIF: a ProvONE provenance interoperability framework for analyzing heterogeneous workflow specifications and provenance traces. Distrib Parallel Databases 36, 219–264 (2018). https://doi.org/10.1007/s10619-017-7216-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-017-7216-y