Abstract
The increasing ability of data-driven science is resulting in a growing need for applications that are under the control of data-centric workflows, also known as scientific workflows. The focus of this work is on provenance collection for these workflows, necessary to validate the workflow and to determine the quality of generated data products. However, the act of instrumenting a workflow engine for provenance collection is burdensome. This complex task requires adding hooks to the workflow engine to capture provenance, which can cause perturbation in execution. We address the challenge of extracting provenance data in the form of a knowledge graph from the event logs of the workflows to record critical information about the applications and the workflows. We present an ontology-based framework for provenance collection using the event logs of workflow engine. Further, we reduce provenance use cases to SPARQL queries over captured provenance knowledge graph. Performance evaluation demonstrates that the framework is capable of reconstructing complete data and invocation dependency graphs from one or various execution traces.
This paper is an extended version of [5].
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Altintas, I., Barney, O., Jaeger-Frank, E.: Provenance collection support in the Kepler scientific workflow system. In: Moreau, L., Foster, I. (eds.) Provenance and Annotation of Data. IPAW 2006. Lecture Notes in Computer Science, vol. 4145. Springer, Heidelberg (2006). https://doi.org/10.1007/11890850_1406
Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludascher, B., Mock, S.: Kepler: an extensible system for design and execution of scientific workflows. In: Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004, pp. 423–424. IEEE (2004)
Bavoil, L., et al.: VisTrails: enabling interactive multiple-view visualizations. In: VIS 05 IEEE Visualization, pp. 135–142 (October 2005). https://doi.org/10.1109/VISUAL.2005.1532788
Belhajjame, K., et al.: Using a suite of ontologies for preserving workflow-centric research objects. J. Web Semant. 32, 16–42 (2015)
Butt, A.S., Car, N., Fitch, P.: Towards ontology driven provenance in scientific workflow engine. In: Proceedings of the 8th International Conference on Model-Driven Engineering and Software Development, MODELSWARD 2020, Valletta, Malta, February 25–27, 2020, pp. 105–115 (2020)
Car, N.J., Stanford, L.S., Sedgmen, A.: Enabling web service request citation by provenance information. In: Provenance and Annotation of Data and Processes - 6th International Provenance and Annotation Workshop, McLean, VA, USA, June 7–8, 2016, Proceedings, pp. 122–133 (2016). https://doi.org/10.1007/978-3-319-40593-3_10
Cohen-Boulakia, S., et al.: Scientific workflows for computational reproducibility in the life sciences: status, challenges and opportunities. Future Gener. Comput. Syst. 75, 284–298 (2017)
Cuevas-Vicenttín, V., et al.: Provone: a prov extension data model for scientific workflow provenance (2015). https://purl.dataone.org/provone-v1-dev. Accessed 12 Dec 2019
Deelman, E., Gannon, D., Shields, M., Taylor, I.: Workflows and e-science: an overview of workflow system features and capabilities. Future Gener. Comput. Syst. 25(5), 528–540 (2009)
Deelman, E., et al.: The future of scientific workflows. Int. J. High Perform. Comput. Appl. 32(1), 159–175 (2018)
Fu, X., Ren, R., Zhan, J., Zhou, W., Jia, Z., Lu, G.: LogMaster: mining event correlations in logs of large-scale cluster systems. In: 2012 IEEE 31st Symposium on Reliable Distributed Systems, pp. 71–80 (October 2012). https://doi.org/10.1109/SRDS.2012.40
Gaaloul, W., Gaaloul, K., Bhiri, S., Haller, A., Hauswirth, M.: Log-based transactional workflow mining. Distrib. Parallel Databases 25(3), 193–240 (2009)
Garijo, D., Gil, Y.: A new approach for publishing workflows: abstractions, standards, and linked data. In: Proceedings of the 6th Workshop on Workflows in Support of Large-scale Science, WORKS 2011, pp. 47–56. ACM, New York (2011). https://doi.org/10.1145/2110497.2110504
Ghoshal, D., Plale, B.: Provenance from log files: a bigdata problem. In: Proceedings of the Joint EDBT/ICDT 2013 Workshops, EDBT 2013, pp. 290–297. ACM, New York (2013). https://doi.org/10.1145/2457317.2457366
Goecks, J., Nekrutenko, A., Taylor, J.: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11(8), R86 (2010)
Guedes, T., Silva, V., Mattoso, M., Bedo, M.V., de Oliveira, D.: A practical roadmap for provenance capture and data analysis in spark-based scientific workflows. In: 2018 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS), pp. 31–41. IEEE (2018)
Gunter, D., Tierney, B., Crowley, B., Holding, M., Lee, J.: NetLogger: a toolkit for distributed system performance analysis. In: Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No. PR00728), pp. 267–273. IEEE (2000)
Herschel, M., Diestelkàmper, R., Ben Lahmar, H.: A survey on provenance: what for? what form? what from? VLDB J.-Int. J. Very Large Data Bases 26(6), 881–906 (2017)
Hull, D., et al.: Taverna: a tool for building and running workflows of services. Nucleic Acids Res. 34(suppl-2), W729–W732 (2006)
Jiang, W., Hu, C., Pasupathy, S., Kanevsky, A., Li, Z., Zhou, Y.: Understanding customer problem troubleshooting from storage system logs. In: Proceedings of the 7th Conference on File and Storage Technologies, FAST 2009, pp. 43–56. USENIX Association, Berkeley (2009). http://dl.acm.org/citation.cfm?id=1525908.1525912
Kim, J., Deelman, E., Gil, Y., Mehta, G., Ratnakar, V.: Provenance trails in the WINGS/Pegasus system. Concurr. Comput.: Pract. Exp. 20(5), 587–597 (2008)
Liu, J., Pacitti, E., Valduriez, P., Mattoso, M.: A survey of data-intensive scientific workflow management. J. Grid Comput. 13(4), 457–493 (2015)
Moreau, L., Missier, P.: World Wide Web Consortium “PROV-DM: The PROV Data Model” W3C Recommendation (2013). https://www.w3.org/TR/prov-dm/. Accessed 12 Dec 2019
Moreau, L.: Aggregation by provenance types: a technique for summarising provenance graphs. arXiv preprint arXiv:1504.02616 (2015)
Murta, L., Braganholo, V., Chirigati, F., Koop, D., Freire, J.: noWorkflow: capturing and analyzing provenance of scripts. In: Ludäscher, B., Plale, B. (eds.) Provenance and Annotation of Data and Processes. IPAW 2014. Lecture Notes in Computer Science, vol. 8628. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16462-5_6
Oinn, T., et al.: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20(17), 3045–3054 (2004). https://doi.org/10.1093/bioinformatics/bth361
Oliner, A., Stearley, J.: What supercomputers say: a study of five system logs. In: 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 575–584. IEEE (2007)
Oliveira, W., Oliveira, D.D., Braganholo, V.: Provenance analytics for workflow-based computational experiments: a survey. ACM Comput. Surv. (CSUR) 51(3), 53 (2018). https://doi.org/10.1145/3184900
Simmhan, Y.L., Plale, B., Gannon, D.: A framework for collecting provenance in data-centric scientific workflows. In: 2006 IEEE International Conference on Web Services (ICWS 2006), pp. 427–436. IEEE (2006)
Taylor, I., Shields, M., Wang, I., Harrison, A.: The Triana workflow environment: architecture and applications. In: Taylor, I.J., Deelman, E., Gannon, D.B., Shields, M. (eds.) Workflows for e-Science. Springer, London (2007). https://doi.org/10.1007/978-1-84628-757-2_20
Van Der Aalst, W.M., Ter Hofstede, A.H.: YAWL: yet another workflow language. Inf. Syst. 30(4), 245–275 (2005)
Xu, W., Huang, L., Fox, A., Patterson, D., Jordan, M.I.: Detecting large-scale system problems by mining console logs. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, SOSP 2009, pp. 117–132. ACM, New York (2009). https://doi.org/10.1145/1629575.1629587
Yuan, D., Mai, H., Xiong, W., Tan, L., Zhou, Y., Pasupathy, S.: SherLog: error diagnosis by connecting clues from run-time logs. SIGPLAN Not. 45(3), 143–154 (2010). https://doi.org/10.1145/1735971.1736038
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Butt, A.S., Fitch, P. (2021). ProvAnalyser: A Framework for Scientific Workflows Provenance. In: Hammoudi, S., Pires, L.F., Selić, B. (eds) Model-Driven Engineering and Software Development. MODELSWARD 2020. Communications in Computer and Information Science, vol 1361. Springer, Cham. https://doi.org/10.1007/978-3-030-67445-8_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-67445-8_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67444-1
Online ISBN: 978-3-030-67445-8
eBook Packages: Computer ScienceComputer Science (R0)