research-article

Trust the Process: Analyzing Prospective Provenance for Data Cleaning

Authors:

Nikolaus Nova Parulian,

Bertram LudäscherAuthors Info & Claims

WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023

Pages 1513 - 1523

https://doi.org/10.1145/3543873.3587558

Published: 30 April 2023 Publication History

Abstract

In the field of data-driven research and analysis, the quality of results largely depends on the quality of the data used. Data cleaning is a crucial step in improving the quality of data. Still, it is equally important to document the steps made during the data cleaning process to ensure transparency and enable others to assess the quality of the resulting data. While provenance models such as W3C PROV have been introduced to track changes and events related to any entity, their use in documenting the provenance of data-cleaning workflows can be challenging, particularly when mixing different types of documents or entities in the model. To address this, we propose a conceptual model and analysis that breaks down data-cleaning workflows into process abstraction and workflow recipes, refining operations to the column level. This approach provides users with detailed provenance information, enabling transparency, auditing, and support for data cleaning workflow improvements. Our model has several features that allow static analysis, e.g., to determine the minimal input schema and expected output schema for running a recipe, to identify which steps violate the column schema requirement constraint, and to assess the reusability of a recipe on a new dataset. We hope that our model and analysis will contribute to making data processing more transparent, accessible, and reusable.

References

[1]

Alex Ball. 2012. Review of data management lifecycle models. Citeseer. https://researchportal.bath.ac.uk/en/publications/review-of-data-management-lifecycle-models

[2]

Khalid Belhajjame, Reza B’Far, James Cheney, Sam Coppens, Stephen Cresswell, Yolanda Gil, Paul Groth, Graham Klyne, Timothy Lebo, Jim McCusker, Simon Miles, James Myers, Satya Sahoo, and Curt Tilmes. 2012. PROV-DM: The PROV Data Model. www.w3.org/TR/prov-dm.

[3]

Laure Berti-Équille and Ugo Comignani. 2021. Explaining automated data cleaning with cleanex. In IJCAI-PRICAI 2020 Workshop on Explainable Artificial Intelligence (XAI).

[4]

Shawn Bowers, Timothy McPhillips, Sean Riddle, Manish Kumar Anand, and Bertram Ludäscher. 2008. Kepler/pPOD: Scientific workflow and provenance support for assembling the tree of life. In International Provenance and Annotation Workshop. Springer, 70–77.

Digital Library

[5]

Yang Cao, Christopher Jones, V Cuevas-Vicenttín, Matthew B Jones, Bertram Ludäscher, Timothy McPhillips, Paolo Missier, Christopher Schwalm, Peter Slaughter, Dave Vieglais, 2016. ProvONE: extending PROV to support the DataONE scientific community. PROV: Three Years Later (2016).

[6]

Peter Couvares, Tevfik Kosar, Alain Roy, Jeff Weber, and Kent Wenger. 2007. Workflow management in condor. In Workflows for e-Science. Springer, 357–375.

[7]

Tamraparni Dasu and Theodore Johnson. 2003. Exploratory data mining and data cleaning. Vol. 479. John Wiley & Sons.

[8]

Saumen C Dey, Sven Köhler, Shawn Bowers, and Bertram Ludäscher. 2012. Datalog as a Lingua Franca for Provenance Querying and Reasoning. In TaPP.

[9]

Martin Doerr, Stefan Gradmann, Steffen Hennicke, Antoine Isaac, Carlo Meghini, and Herbert Van de Sompel. 2010. The europeana data model (edm). In World Library and Information Congress: 76th IFLA general conference and assembly, Vol. 10. 15.

[10]

Lan Li, Nikolaus Parulian, and Bertram Ludäscher. 2021. Automatic Module Detection in Data Cleaning Workflows: Enabling Transparency and Recipe Reuse. (2021).

[11]

Lan Li, Nikolaus Parulian, and Bertram Ludäscher. 2021. or2yw: Generating YesWorkflow models from OpenRefine histories. github.com/idaks/OR2YWTool.

[12]

Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. 2021. CleanML: A study for evaluating the impact of data cleaning on ml classification tasks. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 13–24.

[13]

Raoni Lourenço, Juliana Freire, and Dennis Shasha. 2020. Bugdoc: Algorithms to debug computational processes. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 463–478.

Digital Library

[14]

Timothy McPhillips, Shawn Bowers, Khalid Belhajjame, and Bertram Ludäscher. 2015. Retrospective Provenance without a Runtime Provenance Recorder. In Theory and Practice of Provenance (TaPP). dl.acm.org/doi/abs/10.5555/2814579.2814580.

[15]

Timothy McPhillips, Tianhong Song, Tyler Kolisnik, Steve Aulenbach, Khalid Belhajjame, Kyle Bocinsky, Yang Cao, Fernando Chirigati, Saumen Dey, Juliana Freire, 2015. YesWorkflow: a user-oriented, language-independent tool for recovering workflow information from scripts. arXiv preprint arXiv:1502.02403 (2015).

[16]

Luc Moreau, Ben Clifford, Juliana Freire, Joe Futrelle, Yolanda Gil, Paul Groth, Natalia Kwasnikowska, Simon Miles, Paolo Missier, Jim Myers, 2011. The Open Provenance Model Core Specification. Future Generation Computer Systems 27, 6 (2011), 743–756.

Digital Library

[17]

Jorge Piazentin Ono, Sonia Castelo, Roque Lopez, Enrico Bertini, Juliana Freire, and Claudio Silva. 2020. Pipelineprofiler: A visual analytics tool for the exploration of automl pipelines. IEEE Transactions on Visualization and Computer Graphics 27, 2 (2020), 390–400.

[18]

OpenRefine. 2023. OpenRefine: A free, open source, power tool for working with messy data. github.com/OpenRefine.

[19]

Nikolaus Nova Parulian. 2023. Process Model Analysis. https://github.com/idaks/process-model-analysis.git

[20]

Nikolaus Nova Parulian and Bertram Ludäscher. 2022. DCM explorer: a tool to support transparent data cleaning through provenance exploration. In Proceedings of the 14th International Workshop on the Theory and Practice of Provenance. 1–6.

Digital Library

[21]

Nikolaus Nova Parulian, Timothy M McPhillips, and Bertram Ludäscher. 2020. A model and system for querying provenance from data cleaning workflows. In Provenance and Annotation of Data and Processes. Springer, 183–197.

[22]

El Kindi Rezig, Lei Cao, Giovanni Simonini, Maxime Schoemans, Samuel Madden, Nan Tang, Mourad Ouzzani, and Michael Stonebraker. 2020. Dagger: a data (not code) debugger. In CIDR 2020, 10th Conference on Innovative Data Systems Research, Amsterdam, The Netherlands, January 12-15, 2020, Online Proceedings.

[23]

Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E. Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè Crosas, Ingrid Dillo, Olivier Dumon, Scott Edmunds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, Alasdair J.G. Gray, Paul Groth, Carole Goble, Jeffrey S. Grethe, Jaap Heringa, Peter A.C. ’t Hoen, Rob Hooft, Tobias Kuhn, Ruben Kok, Joost Kok, Scott J. Lusher, Maryann E. Martone, Albert Mons, Abel L. Packer, Bengt Persson, Philippe Rocca-Serra, Marco Roos, Rene van Schaik, Susanna-Assunta Sansone, Erik Schultes, Thierry Sengstag, Ted Slater, George Strawn, Morris A. Swertz, Mark Thompson, Johan van der Lei, Erik van Mulligen, Jan Velterop, Andra Waagmeester, Peter Wittenburg, Katherine Wolstencroft, Jun Zhao, and Barend Mons. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3 (March 2016), 160018. http://dx.doi.org/10.1038/sdata.2016.18

[24]

Qian Zhang, Yang Cao, Qiwen Wang, Duc Vu, Priyaa Thavasimani, Timothy McPhillips, Paolo Missier, Peter Slaughter, Christopher Jones, Matthew Jones, and Bertram Ludäscher. 2017. Revealing the Detailed Lineage of Script Outputs Using Hybrid Provenance. Intl. Journal of Digital Curation (IJDC) 12, 2 (2017), 390–408. doi.org/10.2218/ijdc.v12i2.585.

[25]

Qian Zhang, Paul J Morris, Timothy McPhillips, James Hanken, David Lowery, Bertram Ludäscher, James Macklin, Robert Morris, and John Wieczorek. 2017. Using YesWorkflow hybrid queries to reveal data lineage from data curation activities. Biodiversity Information Science and Standards 1 (2017), e20380. https://doi.org/10.3897/tdwgproceedings.1.20380

Index Terms

Trust the Process: Analyzing Prospective Provenance for Data Cleaning
1. Information systems
  1. Information systems applications
    1. Data mining
      1. Data cleaning

Recommendations

A fuzzy model for calculating workflow trust using provenance data
MG '08: Proceedings of the 15th ACM Mardi Gras conference: From lightweight mash-ups to lambda grids: Understanding the spectrum of distributed computing requirements, applications, tools, infrastructures, interoperability, and the incremental adoption of key capabilities

Workflow forms a key part of many existing Service Oriented applications, involving the integration of services that may be made available at distributed sites. It is possible to distinguish between an "abstract" workflow description - outlining which ...
Towards Transparent Data Cleaning: The Data Cleaning Model Explorer (DCM/X)
JCDL '21: Proceedings of the 2021 ACM/IEEE Joint Conference on Digital Libraries

To make data cleaning processes more transparent, we have developed DCM, a data cleaning model that can represent different kinds of provenance information from tools such as OpenRefine. The information in DCM captures the data cleaning history D₀ ↪ D_n, ...
Provenance for MapReduce-based data-intensive workflows
WORKS '11: Proceedings of the 6th workshop on Workflows in support of large-scale science

MapReduce has been widely adopted by many business and scientific applications for data-intensive processing of large datasets. There are increasing efforts for workflows and systems to work with the MapReduce programming model and the Hadoop ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023

April 2023

1567 pages

ISBN:9781450394192

DOI:10.1145/3543873

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 April 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '23

Sponsor:

SIGWEB

WWW '23: The ACM Web Conference 2023

April 30 - May 4, 2023

TX, Austin, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
138
Total Downloads

Downloads (Last 12 months)39
Downloads (Last 6 weeks)3

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten