Abstract
Data cleaning is an essential component of data preparation in machine learning and other data science workflows, and is widely recognized as the most time-consuming and error-prone part when working with real-world data. How data was prepared and cleaned has a significant impact on the reliability and trustworthiness of results of any subsequent analysis. Transparent data cleaning not only requires that provenance (i.e., operation history and value changes) be captured, but also that those changes are easy to explore and evaluate: The data scientists who prepare the data, as well as others who want to reuse the cleaned data for their studies, need to be able to easily explore and query its data cleaning history. We have developed a domain-specific provenance model for data cleaning that supports the kind of provenance questions that data scientists need to answer when inspecting and debugging data preparation histories. The design of the model was driven by the need (i) to answer relevant, user-oriented provenance questions, and (ii) to do so in an effective and efficient manner. The model is a refinement of an earlier provenance model and has been implemented as a companion tool to OpenRefine, a popular, open source tool for data cleaning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The old adage of “garbage in, garbage out” comes to mind.
- 2.
Roughly speaking, data is of good quality if it is fit for purpose [19].
- 3.
- 4.
In ML and statistics, columns and rows often represent features (or variables) and observations, respectively.
- 5.
This query can also be reused to reconstruct dataset snapshots if needed.
References
Belhajjame, K., et al.: PROV-DM: the PROV data model. www.w3.org/TR/prov-dm (2012)
Clingo: A grounder and solver for logic programs. https://github.com/potassco/clingo
Cuevas-Vicenttín, V., et al.: ProvONE: a PROV extension data model for scientific workflow provenance (2016). http://jenkins-1.dataone.org/jenkins/view/DocumentationProjects/job/ProvONE-Documentation-trunk/ws/provenance/ProvONE/v1/provone.html
Dey, S.C., Köhler, S., Bowers, S., Ludäscher, B.: Datalog as a Lingua Franca for Provenance Querying and Reasoning. In: Workshop on Theory and Practice of Provenance (TaPP) (2012)
Gebser, M., Kaminski, R., Kaufmann, B., Schaub, T.: Multi-shot ASP solving with clingo. CoRR arXiv:1705.09811 (2017)
Hipp, R.: SQLite (2021). www.sqlite.org
Li, L., Parulian, N., Ludäscher, B.: or2yw: generating YesWorkflow models from OpenRefine histories (2021). https://github.com/idaks/OR2YWTool
McPhillips, T., Bowers, S., Belhajjame, K., Ludäscher, B.: Retrospective provenance without a runtime provenance recorder. In: Theory and Practice of Provenance (TaPP) (2015). https://doi.org/10.5555/2814579.2814580
McPhillips, T., Li, L., Parulian, N., Ludäscher, B.: Modeling provenance and understanding reproducibility for OpenRefine data cleaning workflows. In: Workshop on Theory and Practice of Provenance (TaPP) (2019)
Missier, P., Dey, S., Belhajjame, K., Cuevas-Vicenttín, V., Ludäscher, B.: D-PROV: extending the PROV provenance model with workflow structure. In: Workshop on the Theory and Practice of Provenance (TaPP) (2013)
Moreau, L., et al.: The open provenance model core specification. Future Gener. Comput. Syst. 27(6), 743–756 (2011)
New York Public Library: What’s on the menu? (2020). http://menus.nypl.org
Olveira, W., Missier, P., de Olveira, D., Braganholo, V.: Comparing provenance data models for scientific workflows: an analysis of PROV-Wf and ProvONE. In: Anais do Brazilian e-Science Workshop (BreSci), pp. 9–16, January 2020
Omitola, T., Freitas, A., Curry, E., O’Riain, S., Gibbins, N., Shadbolt, N.: Capturing interactive data transformation operations using provenance workflows. In: Simperl, E., et al. (eds.) ESWC 2012. LNCS, vol. 7540, pp. 29–42. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-46641-4_3
OpenRefine: A free, open source, power tool for working with messy data (2021). https://github.com/OpenRefine
Pandas: powerful Python data analysis toolkit (2019). https://github.com/pandas-dev/pandas
Parulian, N.: OpenRefine Provenance Explorer (ORPE) Data Cleaning Model (DCM) (2021). https://github.com/idaks/IPAW2021-ORPE
Pimentel, J.F., et al.: Yin & Yang: demonstrating complementary provenance from noWorkflow & YesWorkflow. In: IPAW. LNCS, vol. 9672. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-40593-3_13
Sadiq, S.: Handbook of Data Quality. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36257-6
Winslett, M., Braganholo, V.: Richard Hipp speaks out on SQLite. ACM SIGMOD Record 48(2), 39–46 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix A Sample Provenance Query Output
Appendix A Sample Provenance Query Output
The following is a log of outputs for a set of SQLite demo-queries:
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Parulian, N.N., McPhillips, T.M., Ludäscher, B. (2021). A Model and System for Querying Provenance from Data Cleaning Workflows. In: Glavic, B., Braganholo, V., Koop, D. (eds) Provenance and Annotation of Data and Processes. IPAW IPAW 2020 2021. Lecture Notes in Computer Science(), vol 12839. Springer, Cham. https://doi.org/10.1007/978-3-030-80960-7_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-80960-7_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-80959-1
Online ISBN: 978-3-030-80960-7
eBook Packages: Computer ScienceComputer Science (R0)