Discovering Similar Workflows via Provenance Clustering: A Case Study

Alawini, Abdussalam; Chen, Leshang; Davidson, Susan; Fisher, Stephen; Kim, Junhyong

doi:10.1007/978-3-319-98379-0_9

Abdussalam Alawini¹⁶,
Leshang Chen¹⁶,
Susan Davidson¹⁶,
Stephen Fisher¹⁶ &
…
Junhyong Kim¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11017))

Included in the following conference series:

International Provenance and Annotation Workshop

788 Accesses
2 Citations

Abstract

Several workflow management systems and scripting languages have adopted provenance tracking, yet many researchers choose to manually capture or instrument their processing scripts to write provenance information to files. The Next Generation Sequencing (NGS) project we are associated with is tracking provenance in such manner. The NGS project is a collaboration between multiple groups at different sites, where each group is collecting and processing samples using an agreed-upon workflow. The workflow contains many stages with varying degrees of complexity. Over time workflow stages are modified, but data samples are only comparable when processed with identical versions of the workflow. However, for various reasons (including the distributed nature of the collaboration) it is not always clear which samples have been processed with which version of the workflow. In this paper, we introduce new techniques for clustering provenance datasets and attempt to discover the ones that are likely to be generated by same workflow. Based on the clustering result, users can identify similar provenance and would be able to categorize them into different clusters for debugging and zoom-in/zoom-out viewing.

J. Kim—Evolutionary and Molecular Biology (Kim) Lab, University of Pennsylvania.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
One-hot encoding is a process in which categorical data is converted into a bit vector.
2.
An induced subgraph is a subset of nodes along with the edges connecting them in original graph.
3.
Linking two activities with a common entity is now supported by PROV Constraint 33.
4.
Our dataset is available for download at https://github.com/alawinia/provClustering.

References

Chen, P., Plale, B., Aktas, M.S.: Temporal representation for mining scientific data provenance. Future Gener. Comput. Syst. 36, 363–378 (2014). Special Section: Intelligent Big Data Processing
Article Google Scholar
Gao, X., Xiao, B., Tao, D., Li, X.: A survey of graph edit distance. Pattern Anal. Appl. 13(1), 113–129 (2010)
Article MathSciNet Google Scholar
Garijo, D., et al.: FragFlow automated fragment detection in scientific workflows. In: 2014 IEEE 10th International Conference on e-Science, vol. 1, pp. 281–289, October 2014
Google Scholar
Jung, J.-Y., Bae, J.: Workflow clustering method based on process similarity. In: Gavrilova, M.L., et al. (eds.) ICCSA 2006. LNCS, vol. 3981, pp. 379–389. Springer, Heidelberg (2006). https://doi.org/10.1007/11751588_40
Chapter Google Scholar
Lu, X., Fahland, D., van den Biggelaar, F.J.H.M., van der Aalst, W.M.P.: Detecting deviating behaviors without models. In: Reichert, M., Reijers, H.A. (eds.) BPM 2015. LNBIP, vol. 256, pp. 126–139. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-42887-1_11
Chapter Google Scholar
Luo, B., Wilson, R.C., Hancock, E.R.: Spectral clustering of graphs. In: Hancock, E., Vento, M. (eds.) GbRPR 2003. LNCS, vol. 2726, pp. 190–201. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-45028-9_17
Chapter Google Scholar
Moreau, L., Missier, P.: PROV-DM: The PROV Data Model, April 2013. http://www.w3.org/TR/2013/REC-prov-dm-20130430/
Robertson, S.: Understanding inverse document frequency: on theoretical arguments for IDF. J. Doc. 60(5), 503–520 (2004)
Article Google Scholar
Robles-Kelly, A., Hancock, E.R.: Graph edit distance from spectral seriation. IEEE Trans. Pattern Anal. Mach. Intell. 27(3), 365–378 (2005)
Article Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)
Article Google Scholar
Santos, E., Lins, L., Ahrens, J.P., Freire, J., Silva, C.T.: A first study on clustering collections of workflow graphs. In: Freire, J., Koop, D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 160–173. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-89965-5_18
Chapter Google Scholar
Selçuk, C.K., Sapino, M.L.: Data Management for Multimedia Retrieval, p. 114. Cambridge University Press, Cambridge (2010)
MATH Google Scholar
Zeng, Z., Tung, A.K.H., Wang, J., Feng, J., Zhou, L.: Comparing stars: on approximating graph edit distance. Proc. VLDB Endow. 2(1), 25–36 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Pennsylvania, Philadelphia, USA
Abdussalam Alawini, Leshang Chen, Susan Davidson, Stephen Fisher & Junhyong Kim

Authors

Abdussalam Alawini
View author publications
You can also search for this author in PubMed Google Scholar
Leshang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Susan Davidson
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Fisher
View author publications
You can also search for this author in PubMed Google Scholar
Junhyong Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdussalam Alawini .

Editor information

Editors and Affiliations

Paris Dauphine University, Paris, France
Khalid Belhajjame
SRI International, Menlo Park, CA, USA
Ashish Gehani
University of Luxembourg, Belvaux, Luxembourg
Pinar Alper

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alawini, A., Chen, L., Davidson, S., Fisher, S., Kim, J. (2018). Discovering Similar Workflows via Provenance Clustering: A Case Study. In: Belhajjame, K., Gehani, A., Alper, P. (eds) Provenance and Annotation of Data and Processes. IPAW 2018. Lecture Notes in Computer Science(), vol 11017. Springer, Cham. https://doi.org/10.1007/978-3-319-98379-0_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-98379-0_9
Published: 06 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98378-3
Online ISBN: 978-3-319-98379-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics