A Model and System for Querying Provenance from Data Cleaning Workflows

Parulian, Nikolaus Nova; McPhillips, Timothy M.; Ludäscher, Bertram

doi:10.1007/978-3-030-80960-7_11

Nikolaus Nova Parulian¹¹,
Timothy M. McPhillips¹¹ &
Bertram Ludäscher¹¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12839))

Included in the following conference series:

896 Accesses
2 Citations

Abstract

Data cleaning is an essential component of data preparation in machine learning and other data science workflows, and is widely recognized as the most time-consuming and error-prone part when working with real-world data. How data was prepared and cleaned has a significant impact on the reliability and trustworthiness of results of any subsequent analysis. Transparent data cleaning not only requires that provenance (i.e., operation history and value changes) be captured, but also that those changes are easy to explore and evaluate: The data scientists who prepare the data, as well as others who want to reuse the cleaned data for their studies, need to be able to easily explore and query its data cleaning history. We have developed a domain-specific provenance model for data cleaning that supports the kind of provenance questions that data scientists need to answer when inspecting and debugging data preparation histories. The design of the model was driven by the need (i) to answer relevant, user-oriented provenance questions, and (ii) to do so in an effective and efficient manner. The model is a refinement of an earlier provenance model and has been implemented as a companion tool to OpenRefine, a popular, open source tool for data cleaning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The old adage of “garbage in, garbage out” comes to mind.
2.
Roughly speaking, data is of good quality if it is fit for purpose [19].
3.
en.wikipedia.org/wiki/Videocassette_recorder.
4.
In ML and statistics, columns and rows often represent features (or variables) and observations, respectively.
5.
This query can also be reused to reconstruct dataset snapshots if needed.

References

Belhajjame, K., et al.: PROV-DM: the PROV data model. www.w3.org/TR/prov-dm (2012)
Clingo: A grounder and solver for logic programs. https://github.com/potassco/clingo
Cuevas-Vicenttín, V., et al.: ProvONE: a PROV extension data model for scientific workflow provenance (2016). http://jenkins-1.dataone.org/jenkins/view/DocumentationProjects/job/ProvONE-Documentation-trunk/ws/provenance/ProvONE/v1/provone.html
Dey, S.C., Köhler, S., Bowers, S., Ludäscher, B.: Datalog as a Lingua Franca for Provenance Querying and Reasoning. In: Workshop on Theory and Practice of Provenance (TaPP) (2012)
Google Scholar
Gebser, M., Kaminski, R., Kaufmann, B., Schaub, T.: Multi-shot ASP solving with clingo. CoRR arXiv:1705.09811 (2017)
Hipp, R.: SQLite (2021). www.sqlite.org
Li, L., Parulian, N., Ludäscher, B.: or2yw: generating YesWorkflow models from OpenRefine histories (2021). https://github.com/idaks/OR2YWTool
McPhillips, T., Bowers, S., Belhajjame, K., Ludäscher, B.: Retrospective provenance without a runtime provenance recorder. In: Theory and Practice of Provenance (TaPP) (2015). https://doi.org/10.5555/2814579.2814580
McPhillips, T., Li, L., Parulian, N., Ludäscher, B.: Modeling provenance and understanding reproducibility for OpenRefine data cleaning workflows. In: Workshop on Theory and Practice of Provenance (TaPP) (2019)
Google Scholar
Missier, P., Dey, S., Belhajjame, K., Cuevas-Vicenttín, V., Ludäscher, B.: D-PROV: extending the PROV provenance model with workflow structure. In: Workshop on the Theory and Practice of Provenance (TaPP) (2013)
Google Scholar
Moreau, L., et al.: The open provenance model core specification. Future Gener. Comput. Syst. 27(6), 743–756 (2011)
Article Google Scholar
New York Public Library: What’s on the menu? (2020). http://menus.nypl.org
Olveira, W., Missier, P., de Olveira, D., Braganholo, V.: Comparing provenance data models for scientific workflows: an analysis of PROV-Wf and ProvONE. In: Anais do Brazilian e-Science Workshop (BreSci), pp. 9–16, January 2020
Google Scholar
Omitola, T., Freitas, A., Curry, E., O’Riain, S., Gibbins, N., Shadbolt, N.: Capturing interactive data transformation operations using provenance workflows. In: Simperl, E., et al. (eds.) ESWC 2012. LNCS, vol. 7540, pp. 29–42. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-46641-4_3
Chapter Google Scholar
OpenRefine: A free, open source, power tool for working with messy data (2021). https://github.com/OpenRefine
Pandas: powerful Python data analysis toolkit (2019). https://github.com/pandas-dev/pandas
Parulian, N.: OpenRefine Provenance Explorer (ORPE) Data Cleaning Model (DCM) (2021). https://github.com/idaks/IPAW2021-ORPE
Pimentel, J.F., et al.: Yin & Yang: demonstrating complementary provenance from noWorkflow & YesWorkflow. In: IPAW. LNCS, vol. 9672. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-40593-3_13
Sadiq, S.: Handbook of Data Quality. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36257-6
Winslett, M., Braganholo, V.: Richard Hipp speaks out on SQLite. ACM SIGMOD Record 48(2), 39–46 (2019)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Sciences, University of Illinois at Urbana-Champaign, Champaign, USA
Nikolaus Nova Parulian, Timothy M. McPhillips & Bertram Ludäscher

Authors

Nikolaus Nova Parulian
View author publications
You can also search for this author in PubMed Google Scholar
Timothy M. McPhillips
View author publications
You can also search for this author in PubMed Google Scholar
Bertram Ludäscher
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nikolaus Nova Parulian .

Editor information

Editors and Affiliations

Illinois Institute of Technology, Chicago, IL, USA
Boris Glavic
Fluminense Federal University, Niterói, Brazil
Vanessa Braganholo
Northern Illinois University, DeKalb, IL, USA
David Koop

Appendix A Sample Provenance Query Output

The following is a log of outputs for a set of SQLite demo-queries:

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Parulian, N.N., McPhillips, T.M., Ludäscher, B. (2021). A Model and System for Querying Provenance from Data Cleaning Workflows. In: Glavic, B., Braganholo, V., Koop, D. (eds) Provenance and Annotation of Data and Processes. IPAW IPAW 2020 2021. Lecture Notes in Computer Science(), vol 12839. Springer, Cham. https://doi.org/10.1007/978-3-030-80960-7_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-80960-7_11
Published: 09 July 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-80959-1
Online ISBN: 978-3-030-80960-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Model and System for Querying Provenance from Data Cleaning Workflows

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix A Sample Provenance Query Output

Appendix A Sample Provenance Query Output

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation