skip to main content
10.1145/3543873.3587558acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Trust the Process: Analyzing Prospective Provenance for Data Cleaning

Published: 30 April 2023 Publication History

Abstract

In the field of data-driven research and analysis, the quality of results largely depends on the quality of the data used. Data cleaning is a crucial step in improving the quality of data. Still, it is equally important to document the steps made during the data cleaning process to ensure transparency and enable others to assess the quality of the resulting data. While provenance models such as W3C PROV have been introduced to track changes and events related to any entity, their use in documenting the provenance of data-cleaning workflows can be challenging, particularly when mixing different types of documents or entities in the model. To address this, we propose a conceptual model and analysis that breaks down data-cleaning workflows into process abstraction and workflow recipes, refining operations to the column level. This approach provides users with detailed provenance information, enabling transparency, auditing, and support for data cleaning workflow improvements. Our model has several features that allow static analysis, e.g., to determine the minimal input schema and expected output schema for running a recipe, to identify which steps violate the column schema requirement constraint, and to assess the reusability of a recipe on a new dataset. We hope that our model and analysis will contribute to making data processing more transparent, accessible, and reusable.

References

[1]
Alex Ball. 2012. Review of data management lifecycle models. Citeseer. https://researchportal.bath.ac.uk/en/publications/review-of-data-management-lifecycle-models
[2]
Khalid Belhajjame, Reza B’Far, James Cheney, Sam Coppens, Stephen Cresswell, Yolanda Gil, Paul Groth, Graham Klyne, Timothy Lebo, Jim McCusker, Simon Miles, James Myers, Satya Sahoo, and Curt Tilmes. 2012. PROV-DM: The PROV Data Model. www.w3.org/TR/prov-dm.
[3]
Laure Berti-Équille and Ugo Comignani. 2021. Explaining automated data cleaning with cleanex. In IJCAI-PRICAI 2020 Workshop on Explainable Artificial Intelligence (XAI).
[4]
Shawn Bowers, Timothy McPhillips, Sean Riddle, Manish Kumar Anand, and Bertram Ludäscher. 2008. Kepler/pPOD: Scientific workflow and provenance support for assembling the tree of life. In International Provenance and Annotation Workshop. Springer, 70–77.
[5]
Yang Cao, Christopher Jones, V Cuevas-Vicenttín, Matthew B Jones, Bertram Ludäscher, Timothy McPhillips, Paolo Missier, Christopher Schwalm, Peter Slaughter, Dave Vieglais, 2016. ProvONE: extending PROV to support the DataONE scientific community. PROV: Three Years Later (2016).
[6]
Peter Couvares, Tevfik Kosar, Alain Roy, Jeff Weber, and Kent Wenger. 2007. Workflow management in condor. In Workflows for e-Science. Springer, 357–375.
[7]
Tamraparni Dasu and Theodore Johnson. 2003. Exploratory data mining and data cleaning. Vol. 479. John Wiley & Sons.
[8]
Saumen C Dey, Sven Köhler, Shawn Bowers, and Bertram Ludäscher. 2012. Datalog as a Lingua Franca for Provenance Querying and Reasoning. In TaPP.
[9]
Martin Doerr, Stefan Gradmann, Steffen Hennicke, Antoine Isaac, Carlo Meghini, and Herbert Van de Sompel. 2010. The europeana data model (edm). In World Library and Information Congress: 76th IFLA general conference and assembly, Vol. 10. 15.
[10]
Lan Li, Nikolaus Parulian, and Bertram Ludäscher. 2021. Automatic Module Detection in Data Cleaning Workflows: Enabling Transparency and Recipe Reuse. (2021).
[11]
Lan Li, Nikolaus Parulian, and Bertram Ludäscher. 2021. or2yw: Generating YesWorkflow models from OpenRefine histories. github.com/idaks/OR2YWTool.
[12]
Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. 2021. CleanML: A study for evaluating the impact of data cleaning on ml classification tasks. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 13–24.
[13]
Raoni Lourenço, Juliana Freire, and Dennis Shasha. 2020. Bugdoc: Algorithms to debug computational processes. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 463–478.
[14]
Timothy McPhillips, Shawn Bowers, Khalid Belhajjame, and Bertram Ludäscher. 2015. Retrospective Provenance without a Runtime Provenance Recorder. In Theory and Practice of Provenance (TaPP). dl.acm.org/doi/abs/10.5555/2814579.2814580.
[15]
Timothy McPhillips, Tianhong Song, Tyler Kolisnik, Steve Aulenbach, Khalid Belhajjame, Kyle Bocinsky, Yang Cao, Fernando Chirigati, Saumen Dey, Juliana Freire, 2015. YesWorkflow: a user-oriented, language-independent tool for recovering workflow information from scripts. arXiv preprint arXiv:1502.02403 (2015).
[16]
Luc Moreau, Ben Clifford, Juliana Freire, Joe Futrelle, Yolanda Gil, Paul Groth, Natalia Kwasnikowska, Simon Miles, Paolo Missier, Jim Myers, 2011. The Open Provenance Model Core Specification. Future Generation Computer Systems 27, 6 (2011), 743–756.
[17]
Jorge Piazentin Ono, Sonia Castelo, Roque Lopez, Enrico Bertini, Juliana Freire, and Claudio Silva. 2020. Pipelineprofiler: A visual analytics tool for the exploration of automl pipelines. IEEE Transactions on Visualization and Computer Graphics 27, 2 (2020), 390–400.
[18]
OpenRefine. 2023. OpenRefine: A free, open source, power tool for working with messy data. github.com/OpenRefine.
[19]
Nikolaus Nova Parulian. 2023. Process Model Analysis. https://github.com/idaks/process-model-analysis.git
[20]
Nikolaus Nova Parulian and Bertram Ludäscher. 2022. DCM explorer: a tool to support transparent data cleaning through provenance exploration. In Proceedings of the 14th International Workshop on the Theory and Practice of Provenance. 1–6.
[21]
Nikolaus Nova Parulian, Timothy M McPhillips, and Bertram Ludäscher. 2020. A model and system for querying provenance from data cleaning workflows. In Provenance and Annotation of Data and Processes. Springer, 183–197.
[22]
El Kindi Rezig, Lei Cao, Giovanni Simonini, Maxime Schoemans, Samuel Madden, Nan Tang, Mourad Ouzzani, and Michael Stonebraker. 2020. Dagger: a data (not code) debugger. In CIDR 2020, 10th Conference on Innovative Data Systems Research, Amsterdam, The Netherlands, January 12-15, 2020, Online Proceedings.
[23]
Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E. Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè Crosas, Ingrid Dillo, Olivier Dumon, Scott Edmunds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, Alasdair J.G. Gray, Paul Groth, Carole Goble, Jeffrey S. Grethe, Jaap Heringa, Peter A.C. ’t Hoen, Rob Hooft, Tobias Kuhn, Ruben Kok, Joost Kok, Scott J. Lusher, Maryann E. Martone, Albert Mons, Abel L. Packer, Bengt Persson, Philippe Rocca-Serra, Marco Roos, Rene van Schaik, Susanna-Assunta Sansone, Erik Schultes, Thierry Sengstag, Ted Slater, George Strawn, Morris A. Swertz, Mark Thompson, Johan van der Lei, Erik van Mulligen, Jan Velterop, Andra Waagmeester, Peter Wittenburg, Katherine Wolstencroft, Jun Zhao, and Barend Mons. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3 (March 2016), 160018. http://dx.doi.org/10.1038/sdata.2016.18
[24]
Qian Zhang, Yang Cao, Qiwen Wang, Duc Vu, Priyaa Thavasimani, Timothy McPhillips, Paolo Missier, Peter Slaughter, Christopher Jones, Matthew Jones, and Bertram Ludäscher. 2017. Revealing the Detailed Lineage of Script Outputs Using Hybrid Provenance. Intl. Journal of Digital Curation (IJDC) 12, 2 (2017), 390–408. doi.org/10.2218/ijdc.v12i2.585.
[25]
Qian Zhang, Paul J Morris, Timothy McPhillips, James Hanken, David Lowery, Bertram Ludäscher, James Macklin, Robert Morris, and John Wieczorek. 2017. Using YesWorkflow hybrid queries to reveal data lineage from data curation activities. Biodiversity Information Science and Standards 1 (2017), e20380. https://doi.org/10.3897/tdwgproceedings.1.20380

Index Terms

  1. Trust the Process: Analyzing Prospective Provenance for Data Cleaning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023
    April 2023
    1567 pages
    ISBN:9781450394192
    DOI:10.1145/3543873
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 April 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Data cleaning
    2. provenance
    3. provenance analysis
    4. transparency
    5. workflow

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    WWW '23
    Sponsor:
    WWW '23: The ACM Web Conference 2023
    April 30 - May 4, 2023
    TX, Austin, USA

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 136
      Total Downloads
    • Downloads (Last 12 months)37
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media