Skip to main content

ProvAnalyser: A Framework for Scientific Workflows Provenance

  • Conference paper
  • First Online:
Model-Driven Engineering and Software Development (MODELSWARD 2020)

Abstract

The increasing ability of data-driven science is resulting in a growing need for applications that are under the control of data-centric workflows, also known as scientific workflows. The focus of this work is on provenance collection for these workflows, necessary to validate the workflow and to determine the quality of generated data products. However, the act of instrumenting a workflow engine for provenance collection is burdensome. This complex task requires adding hooks to the workflow engine to capture provenance, which can cause perturbation in execution. We address the challenge of extracting provenance data in the form of a knowledge graph from the event logs of the workflows to record critical information about the applications and the workflows. We present an ontology-based framework for provenance collection using the event logs of workflow engine. Further, we reduce provenance use cases to SPARQL queries over captured provenance knowledge graph. Performance evaluation demonstrates that the framework is capable of reconstructing complete data and invocation dependency graphs from one or various execution traces.

This paper is an extended version of [5].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/CSIRO-enviro-informatics/ProvAnalyser.

  2. 2.

    https://www.w3.org/RDF/.

  3. 3.

    http://docs.openlinksw.com/virtuoso/.

  4. 4.

    RESTful API code is available at https://github.com/anilabutt/weprov.

  5. 5.

    https://research.csiro.au/dss/research/senaps/.

  6. 6.

    https://en.wikipedia.org/wiki/Directed_acyclic_graph.

References

  1. Altintas, I., Barney, O., Jaeger-Frank, E.: Provenance collection support in the Kepler scientific workflow system. In: Moreau, L., Foster, I. (eds.) Provenance and Annotation of Data. IPAW 2006. Lecture Notes in Computer Science, vol. 4145. Springer, Heidelberg (2006). https://doi.org/10.1007/11890850_1406

  2. Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludascher, B., Mock, S.: Kepler: an extensible system for design and execution of scientific workflows. In: Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004, pp. 423–424. IEEE (2004)

    Google Scholar 

  3. Bavoil, L., et al.: VisTrails: enabling interactive multiple-view visualizations. In: VIS 05 IEEE Visualization, pp. 135–142 (October 2005). https://doi.org/10.1109/VISUAL.2005.1532788

  4. Belhajjame, K., et al.: Using a suite of ontologies for preserving workflow-centric research objects. J. Web Semant. 32, 16–42 (2015)

    Article  Google Scholar 

  5. Butt, A.S., Car, N., Fitch, P.: Towards ontology driven provenance in scientific workflow engine. In: Proceedings of the 8th International Conference on Model-Driven Engineering and Software Development, MODELSWARD 2020, Valletta, Malta, February 25–27, 2020, pp. 105–115 (2020)

    Google Scholar 

  6. Car, N.J., Stanford, L.S., Sedgmen, A.: Enabling web service request citation by provenance information. In: Provenance and Annotation of Data and Processes - 6th International Provenance and Annotation Workshop, McLean, VA, USA, June 7–8, 2016, Proceedings, pp. 122–133 (2016). https://doi.org/10.1007/978-3-319-40593-3_10

  7. Cohen-Boulakia, S., et al.: Scientific workflows for computational reproducibility in the life sciences: status, challenges and opportunities. Future Gener. Comput. Syst. 75, 284–298 (2017)

    Article  Google Scholar 

  8. Cuevas-Vicenttín, V., et al.: Provone: a prov extension data model for scientific workflow provenance (2015). https://purl.dataone.org/provone-v1-dev. Accessed 12 Dec 2019

  9. Deelman, E., Gannon, D., Shields, M., Taylor, I.: Workflows and e-science: an overview of workflow system features and capabilities. Future Gener. Comput. Syst. 25(5), 528–540 (2009)

    Article  Google Scholar 

  10. Deelman, E., et al.: The future of scientific workflows. Int. J. High Perform. Comput. Appl. 32(1), 159–175 (2018)

    Article  Google Scholar 

  11. Fu, X., Ren, R., Zhan, J., Zhou, W., Jia, Z., Lu, G.: LogMaster: mining event correlations in logs of large-scale cluster systems. In: 2012 IEEE 31st Symposium on Reliable Distributed Systems, pp. 71–80 (October 2012). https://doi.org/10.1109/SRDS.2012.40

  12. Gaaloul, W., Gaaloul, K., Bhiri, S., Haller, A., Hauswirth, M.: Log-based transactional workflow mining. Distrib. Parallel Databases 25(3), 193–240 (2009)

    Article  Google Scholar 

  13. Garijo, D., Gil, Y.: A new approach for publishing workflows: abstractions, standards, and linked data. In: Proceedings of the 6th Workshop on Workflows in Support of Large-scale Science, WORKS 2011, pp. 47–56. ACM, New York (2011). https://doi.org/10.1145/2110497.2110504

  14. Ghoshal, D., Plale, B.: Provenance from log files: a bigdata problem. In: Proceedings of the Joint EDBT/ICDT 2013 Workshops, EDBT 2013, pp. 290–297. ACM, New York (2013). https://doi.org/10.1145/2457317.2457366

  15. Goecks, J., Nekrutenko, A., Taylor, J.: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11(8), R86 (2010)

    Article  Google Scholar 

  16. Guedes, T., Silva, V., Mattoso, M., Bedo, M.V., de Oliveira, D.: A practical roadmap for provenance capture and data analysis in spark-based scientific workflows. In: 2018 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS), pp. 31–41. IEEE (2018)

    Google Scholar 

  17. Gunter, D., Tierney, B., Crowley, B., Holding, M., Lee, J.: NetLogger: a toolkit for distributed system performance analysis. In: Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No. PR00728), pp. 267–273. IEEE (2000)

    Google Scholar 

  18. Herschel, M., Diestelkàmper, R., Ben Lahmar, H.: A survey on provenance: what for? what form? what from? VLDB J.-Int. J. Very Large Data Bases 26(6), 881–906 (2017)

    Article  Google Scholar 

  19. Hull, D., et al.: Taverna: a tool for building and running workflows of services. Nucleic Acids Res. 34(suppl-2), W729–W732 (2006)

    Article  Google Scholar 

  20. Jiang, W., Hu, C., Pasupathy, S., Kanevsky, A., Li, Z., Zhou, Y.: Understanding customer problem troubleshooting from storage system logs. In: Proceedings of the 7th Conference on File and Storage Technologies, FAST 2009, pp. 43–56. USENIX Association, Berkeley (2009). http://dl.acm.org/citation.cfm?id=1525908.1525912

  21. Kim, J., Deelman, E., Gil, Y., Mehta, G., Ratnakar, V.: Provenance trails in the WINGS/Pegasus system. Concurr. Comput.: Pract. Exp. 20(5), 587–597 (2008)

    Article  Google Scholar 

  22. Liu, J., Pacitti, E., Valduriez, P., Mattoso, M.: A survey of data-intensive scientific workflow management. J. Grid Comput. 13(4), 457–493 (2015)

    Article  Google Scholar 

  23. Moreau, L., Missier, P.: World Wide Web Consortium “PROV-DM: The PROV Data Model” W3C Recommendation (2013). https://www.w3.org/TR/prov-dm/. Accessed 12 Dec 2019

  24. Moreau, L.: Aggregation by provenance types: a technique for summarising provenance graphs. arXiv preprint arXiv:1504.02616 (2015)

  25. Murta, L., Braganholo, V., Chirigati, F., Koop, D., Freire, J.: noWorkflow: capturing and analyzing provenance of scripts. In: Ludäscher, B., Plale, B. (eds.) Provenance and Annotation of Data and Processes. IPAW 2014. Lecture Notes in Computer Science, vol. 8628. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16462-5_6

  26. Oinn, T., et al.: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20(17), 3045–3054 (2004). https://doi.org/10.1093/bioinformatics/bth361

    Article  Google Scholar 

  27. Oliner, A., Stearley, J.: What supercomputers say: a study of five system logs. In: 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 575–584. IEEE (2007)

    Google Scholar 

  28. Oliveira, W., Oliveira, D.D., Braganholo, V.: Provenance analytics for workflow-based computational experiments: a survey. ACM Comput. Surv. (CSUR) 51(3), 53 (2018). https://doi.org/10.1145/3184900

    Article  Google Scholar 

  29. Simmhan, Y.L., Plale, B., Gannon, D.: A framework for collecting provenance in data-centric scientific workflows. In: 2006 IEEE International Conference on Web Services (ICWS 2006), pp. 427–436. IEEE (2006)

    Google Scholar 

  30. Taylor, I., Shields, M., Wang, I., Harrison, A.: The Triana workflow environment: architecture and applications. In: Taylor, I.J., Deelman, E., Gannon, D.B., Shields, M. (eds.) Workflows for e-Science. Springer, London (2007). https://doi.org/10.1007/978-1-84628-757-2_20

  31. Van Der Aalst, W.M., Ter Hofstede, A.H.: YAWL: yet another workflow language. Inf. Syst. 30(4), 245–275 (2005)

    Article  Google Scholar 

  32. Xu, W., Huang, L., Fox, A., Patterson, D., Jordan, M.I.: Detecting large-scale system problems by mining console logs. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, SOSP 2009, pp. 117–132. ACM, New York (2009). https://doi.org/10.1145/1629575.1629587

  33. Yuan, D., Mai, H., Xiong, W., Tan, L., Zhou, Y., Pasupathy, S.: SherLog: error diagnosis by connecting clues from run-time logs. SIGPLAN Not. 45(3), 143–154 (2010). https://doi.org/10.1145/1735971.1736038

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anila Sahar Butt .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Butt, A.S., Fitch, P. (2021). ProvAnalyser: A Framework for Scientific Workflows Provenance. In: Hammoudi, S., Pires, L.F., Selić, B. (eds) Model-Driven Engineering and Software Development. MODELSWARD 2020. Communications in Computer and Information Science, vol 1361. Springer, Cham. https://doi.org/10.1007/978-3-030-67445-8_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-67445-8_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-67444-1

  • Online ISBN: 978-3-030-67445-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics