Skip to main content

Discovering Similar Workflows via Provenance Clustering: A Case Study

  • Conference paper
  • First Online:
Provenance and Annotation of Data and Processes (IPAW 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11017))

Included in the following conference series:

Abstract

Several workflow management systems and scripting languages have adopted provenance tracking, yet many researchers choose to manually capture or instrument their processing scripts to write provenance information to files. The Next Generation Sequencing (NGS) project we are associated with is tracking provenance in such manner. The NGS project is a collaboration between multiple groups at different sites, where each group is collecting and processing samples using an agreed-upon workflow. The workflow contains many stages with varying degrees of complexity. Over time workflow stages are modified, but data samples are only comparable when processed with identical versions of the workflow. However, for various reasons (including the distributed nature of the collaboration) it is not always clear which samples have been processed with which version of the workflow. In this paper, we introduce new techniques for clustering provenance datasets and attempt to discover the ones that are likely to be generated by same workflow. Based on the clustering result, users can identify similar provenance and would be able to categorize them into different clusters for debugging and zoom-in/zoom-out viewing.

J. Kim—Evolutionary and Molecular Biology (Kim) Lab, University of Pennsylvania.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    One-hot encoding is a process in which categorical data is converted into a bit vector.

  2. 2.

    An induced subgraph is a subset of nodes along with the edges connecting them in original graph.

  3. 3.

    Linking two activities with a common entity is now supported by PROV Constraint 33.

  4. 4.

    Our dataset is available for download at https://github.com/alawinia/provClustering.

References

  1. Chen, P., Plale, B., Aktas, M.S.: Temporal representation for mining scientific data provenance. Future Gener. Comput. Syst. 36, 363–378 (2014). Special Section: Intelligent Big Data Processing

    Article  Google Scholar 

  2. Gao, X., Xiao, B., Tao, D., Li, X.: A survey of graph edit distance. Pattern Anal. Appl. 13(1), 113–129 (2010)

    Article  MathSciNet  Google Scholar 

  3. Garijo, D., et al.: FragFlow automated fragment detection in scientific workflows. In: 2014 IEEE 10th International Conference on e-Science, vol. 1, pp. 281–289, October 2014

    Google Scholar 

  4. Jung, J.-Y., Bae, J.: Workflow clustering method based on process similarity. In: Gavrilova, M.L., et al. (eds.) ICCSA 2006. LNCS, vol. 3981, pp. 379–389. Springer, Heidelberg (2006). https://doi.org/10.1007/11751588_40

    Chapter  Google Scholar 

  5. Lu, X., Fahland, D., van den Biggelaar, F.J.H.M., van der Aalst, W.M.P.: Detecting deviating behaviors without models. In: Reichert, M., Reijers, H.A. (eds.) BPM 2015. LNBIP, vol. 256, pp. 126–139. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-42887-1_11

    Chapter  Google Scholar 

  6. Luo, B., Wilson, R.C., Hancock, E.R.: Spectral clustering of graphs. In: Hancock, E., Vento, M. (eds.) GbRPR 2003. LNCS, vol. 2726, pp. 190–201. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-45028-9_17

    Chapter  Google Scholar 

  7. Moreau, L., Missier, P.: PROV-DM: The PROV Data Model, April 2013. http://www.w3.org/TR/2013/REC-prov-dm-20130430/

  8. Robertson, S.: Understanding inverse document frequency: on theoretical arguments for IDF. J. Doc. 60(5), 503–520 (2004)

    Article  Google Scholar 

  9. Robles-Kelly, A., Hancock, E.R.: Graph edit distance from spectral seriation. IEEE Trans. Pattern Anal. Mach. Intell. 27(3), 365–378 (2005)

    Article  Google Scholar 

  10. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)

    Article  Google Scholar 

  11. Santos, E., Lins, L., Ahrens, J.P., Freire, J., Silva, C.T.: A first study on clustering collections of workflow graphs. In: Freire, J., Koop, D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 160–173. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-89965-5_18

    Chapter  Google Scholar 

  12. Selçuk, C.K., Sapino, M.L.: Data Management for Multimedia Retrieval, p. 114. Cambridge University Press, Cambridge (2010)

    MATH  Google Scholar 

  13. Zeng, Z., Tung, A.K.H., Wang, J., Feng, J., Zhou, L.: Comparing stars: on approximating graph edit distance. Proc. VLDB Endow. 2(1), 25–36 (2009)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abdussalam Alawini .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Alawini, A., Chen, L., Davidson, S., Fisher, S., Kim, J. (2018). Discovering Similar Workflows via Provenance Clustering: A Case Study. In: Belhajjame, K., Gehani, A., Alper, P. (eds) Provenance and Annotation of Data and Processes. IPAW 2018. Lecture Notes in Computer Science(), vol 11017. Springer, Cham. https://doi.org/10.1007/978-3-319-98379-0_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-98379-0_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-98378-3

  • Online ISBN: 978-3-319-98379-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics