skip to main content
research-article

On the Anonymization of Workflow Provenance without Compromising the Transparency of Lineage

Published: 23 December 2021 Publication History

Abstract

Workflows have been adopted in several scientific fields as a tool for the specification and execution of scientific experiments. In addition to automating the execution of experiments, workflow systems often include capabilities to record provenance information, which contains, among other things, data records used and generated by the workflow as a whole but also by its component modules. It is widely recognized that provenance information can be useful for the interpretation, verification, and re-use of workflow results, justifying its sharing and publication among scientists. However, workflow execution in some branches of science can manipulate sensitive datasets that contain information about individuals. To address this problem, we investigate, in this article, the problem of anonymizing the provenance of workflows. In doing so, we consider a popular class of workflows in which component modules use and generate collections of data records as a result of their invocation, as opposed to a single data record. The solution we propose offers guarantees of confidentiality without compromising lineage information, which provides transparency as to the relationships between the data records used and generated by the workflow modules. We provide algorithmic solutions that show how the provenance of a single module and an entire workflow can be anonymized and present the results of experiments that we conducted for their evaluation.

References

[1]
Karim Abouelmehdi et al. 2018. Big healthcare data: Preserving security and privacy. J. Big Data 5 (2018), 1. DOI: https://doi.org/10.1186/s40537-017-0110-7
[2]
AEPD. 2018. k-anonimity as a Privacy Measure. Spanish Agen. Data Protect.Retrieved from https://www.aepd.es/media/notas-tecnicas/nota-tecnica-kanonimidad-en.pdf.
[3]
B. Alhaqbani, M. Adams, C. J. Fidge, and A. H. M. ter Hofstede. 2013. Privacy-aware Workflow Management. Springer, 111–128.
[4]
Zhuowei Bao, Sarah Cohen Boulakia, Susan B. Davidson, et al. 2009. Differencing provenance in scientific workflows. In ICDE. IEEE, 808–819. DOI: https://doi.org/10.1109/ICDE.2009.103
[5]
Khalid Belhajjame. 2020. Lineage Preserving Annonymization of the Provenance of Collection-based Workflows. Technical Report. PSL, Paris-Dauphine University. Retrieved from https://hal.archives-ouvertes.fr/hal-02430624/file/techreport.pdf.
[6]
Khalid Belhajjame. 2020. Lineage-preserving anonymization of the provenance of collection-based workflows. In EDBT, Angela Bonifati, Yongluan Zhou, Marcos Antonio Vaz Salles, Alexander Böhm, Dan Olteanu, George H. L. Fletcher, Arijit Khan, and Bin Yang (Eds.). OpenProceedings.org, 229–240. DOI: https://doi.org/10.5441/002/edbt.2020.21
[7]
Khalid Belhajjame, Noura Faci, Zakaria Maamar, Vanilson Arruda Burégio, Edvan Soares, and Mahmoud Barhamgi. 2019. Privacy-preserving data analysis workflows for escience. In EDBT/ICDT Workshops (CEUR Workshop Proceedings), Vol. 2322. CEUR-WS.org.
[8]
Olivier Biton, Sarah Cohen Boulakia, and Susan B. Davidson. 2007. Zoom*UserViews: Querying relevant provenance in workflow systems. In VLDB. ACM.
[9]
Artem Chebotko, Seunghan Chang, et al. 2008. Scientific workflow provenance querying with security views. In WAIM. IEEE CS, 349–356. DOI: https://doi.org/10.1109/WAIM.2008.41
[10]
William Kwok-Wai Cheung and Yolanda Gil. 2007. Privacy enforcement through workflow systems in e-science and beyond. In ISWC’07 Workshops. CEUR.WS. http://ceur-ws.org/Vol-320/paper5.pdf.
[11]
Chris Clifton and Tamir Tassa. 2013. On syntactic anonymity and differential privacy. Trans. Data Priv. 6, 2 (2013), 161–183.
[12]
Rafael Ferreira da Silva, et al. 2016. Automating environmental computing applications with scientific workflows. In e-Science. IEEE, 400–406. DOI: https://doi.org/10.1109/eScience.2016.7870926
[13]
Susan B. Davidson, Sanjeev Khanna, Tova Milo, et al. 2011. Provenance views for module privacy. In PODS. 175–186. DOI: https://doi.org/10.1145/1989284.1989305
[14]
Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. Retrieved fromhttp://archive.ics.uci.edu/ml.
[15]
Cynthia Dwork. 2006. Differential privacy. In ICALP. Springer, 1–12. DOI: https://doi.org/10.1007/11787006_1
[16]
Yolanda Gil and Christian Fritz. 2010. Reasoning about the appropriate use of private data through computational workflows. In AAAI. AAAI. Retrieved from http://www.aaai.org/ocs/index.php/SSS/SSS10/paper/view/1150.
[17]
Ragib Hasan and Rasib Khan. 2017. Unified authentication factors and fuzzy service access using interaction provenance. Comput. Secur. 67 (2017), 211–231. DOI: https://doi.org/10.1016/j.cose.2017.02.014
[18]
Rasib Khan and Ragib Hasan. 2015. Fuzzy authentication using interaction provenance in service oriented computing. In SCC. 170–177. DOI: https://doi.org/10.1109/SCC.2015.32
[19]
Kristen LeFevre, David J. DeWitt, and Raghu Ramakrishnan. 2006. Mondrian multidimensional k-anonymity. In ICDE. IEEE, 25. DOI: https://doi.org/10.1109/ICDE.2006.101
[20]
John Lyle and Andrew P. Martin. 2010. Trusted computing and provenance: Better together. In TaPP. USENIX. Retrieved fromhttps://www.usenix.org/conference/tapp-10/trusted-computing-and-provenance-better-together.
[21]
Adam Meyerson and Ryan Williams. 2004. On the complexity of optimal k-anonymity. In PODS. ACM, 223–228. DOI: https://doi.org/10.1145/1055558.1055591
[22]
Paolo Missier, Norman W. Paton, and Khalid Belhajjame. 2010. Fine-grained and efficient lineage querying of collection-based workflow provenance. In EDBT. ACM. DOI: https://doi.org/10.1145/1739041.1739079
[23]
Luc Moreau, Bertram Ludäscher, Ilkay Altintas, et al. 2008. Special issue: The first provenance challenge. Concurr. Comput. Pract. Exper. 20, 5 (2008), 409–418. DOI: https://doi.org/10.1002/cpe.1233
[24]
Mehmet Ercan Nergiz et al. 2009. Multirelational k-anonymity. IEEE Trans. Knowl. Data Eng. 21, 8 (2009), 1104–1117. DOI: https://doi.org/10.1109/TKDE.2008.210
[25]
Noseong Park et al. 2018. Data synthesis based on generative adversarial networks. PVLDB 11, 10 (2018), 1071–1083. DOI: https://doi.org/10.14778/3231751.3231757
[26]
Pierangela Samarati and Latanya Sweeney. 1998. Generalizing data to provide anonymity when disclosing information (abstract). In PODS. ACM Press, 188.
[27]
Jacek Sroka, Jan Hidders, Paolo Missier, and Carole A. Goble. 2010. A formal semantics for the Taverna 2 workflow model. J. Comput. Syst. Sci. 76, 6 (2010), 490–508. DOI: https://doi.org/10.1016/j.jcss.2009.11.009
[28]
M. Terrovitis et al. 2008. Privacy-preserving anonymization of set-valued data. VLDB Endow. 1, 1 (2008), 115–125.
[29]
Frank Werner, Larysa Burtseva, and Yuri Sotskov. 2018. Algorithms for Scheduling Problems. MDPI.
[30]
Katherine Wolstencroft et al. 2013. The Taverna workflow suite: Designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucleic Acids Res. 41 (2013).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality
Journal of Data and Information Quality  Volume 14, Issue 1
March 2022
61 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/3505184
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 December 2021
Accepted: 01 April 2021
Revised: 01 February 2021
Received: 01 December 2020
Published in JDIQ Volume 14, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Provenance
  2. scientific workflows
  3. privacy
  4. transparency

Qualifiers

  • Research-article
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 178
    Total Downloads
  • Downloads (Last 12 months)28
  • Downloads (Last 6 weeks)3
Reflects downloads up to 18 Jan 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media