Abstract
Several workflow management systems and scripting languages have adopted provenance tracking, yet many researchers choose to manually capture or instrument their processing scripts to write provenance information to files. The Next Generation Sequencing (NGS) project we are associated with is tracking provenance in such manner. The NGS project is a collaboration between multiple groups at different sites, where each group is collecting and processing samples using an agreed-upon workflow. The workflow contains many stages with varying degrees of complexity. Over time workflow stages are modified, but data samples are only comparable when processed with identical versions of the workflow. However, for various reasons (including the distributed nature of the collaboration) it is not always clear which samples have been processed with which version of the workflow. In this paper, we introduce new techniques for clustering provenance datasets and attempt to discover the ones that are likely to be generated by same workflow. Based on the clustering result, users can identify similar provenance and would be able to categorize them into different clusters for debugging and zoom-in/zoom-out viewing.
J. Kim—Evolutionary and Molecular Biology (Kim) Lab, University of Pennsylvania.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
One-hot encoding is a process in which categorical data is converted into a bit vector.
- 2.
An induced subgraph is a subset of nodes along with the edges connecting them in original graph.
- 3.
Linking two activities with a common entity is now supported by PROV Constraint 33.
- 4.
Our dataset is available for download at https://github.com/alawinia/provClustering.
References
Chen, P., Plale, B., Aktas, M.S.: Temporal representation for mining scientific data provenance. Future Gener. Comput. Syst. 36, 363–378 (2014). Special Section: Intelligent Big Data Processing
Gao, X., Xiao, B., Tao, D., Li, X.: A survey of graph edit distance. Pattern Anal. Appl. 13(1), 113–129 (2010)
Garijo, D., et al.: FragFlow automated fragment detection in scientific workflows. In: 2014 IEEE 10th International Conference on e-Science, vol. 1, pp. 281–289, October 2014
Jung, J.-Y., Bae, J.: Workflow clustering method based on process similarity. In: Gavrilova, M.L., et al. (eds.) ICCSA 2006. LNCS, vol. 3981, pp. 379–389. Springer, Heidelberg (2006). https://doi.org/10.1007/11751588_40
Lu, X., Fahland, D., van den Biggelaar, F.J.H.M., van der Aalst, W.M.P.: Detecting deviating behaviors without models. In: Reichert, M., Reijers, H.A. (eds.) BPM 2015. LNBIP, vol. 256, pp. 126–139. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-42887-1_11
Luo, B., Wilson, R.C., Hancock, E.R.: Spectral clustering of graphs. In: Hancock, E., Vento, M. (eds.) GbRPR 2003. LNCS, vol. 2726, pp. 190–201. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-45028-9_17
Moreau, L., Missier, P.: PROV-DM: The PROV Data Model, April 2013. http://www.w3.org/TR/2013/REC-prov-dm-20130430/
Robertson, S.: Understanding inverse document frequency: on theoretical arguments for IDF. J. Doc. 60(5), 503–520 (2004)
Robles-Kelly, A., Hancock, E.R.: Graph edit distance from spectral seriation. IEEE Trans. Pattern Anal. Mach. Intell. 27(3), 365–378 (2005)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)
Santos, E., Lins, L., Ahrens, J.P., Freire, J., Silva, C.T.: A first study on clustering collections of workflow graphs. In: Freire, J., Koop, D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 160–173. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-89965-5_18
Selçuk, C.K., Sapino, M.L.: Data Management for Multimedia Retrieval, p. 114. Cambridge University Press, Cambridge (2010)
Zeng, Z., Tung, A.K.H., Wang, J., Feng, J., Zhou, L.: Comparing stars: on approximating graph edit distance. Proc. VLDB Endow. 2(1), 25–36 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Alawini, A., Chen, L., Davidson, S., Fisher, S., Kim, J. (2018). Discovering Similar Workflows via Provenance Clustering: A Case Study. In: Belhajjame, K., Gehani, A., Alper, P. (eds) Provenance and Annotation of Data and Processes. IPAW 2018. Lecture Notes in Computer Science(), vol 11017. Springer, Cham. https://doi.org/10.1007/978-3-319-98379-0_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-98379-0_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98378-3
Online ISBN: 978-3-319-98379-0
eBook Packages: Computer ScienceComputer Science (R0)