Skip to main content

A Model and System for Querying Provenance from Data Cleaning Workflows

  • Conference paper
  • First Online:
Book cover Provenance and Annotation of Data and Processes (IPAW 2020, IPAW 2021)

Abstract

Data cleaning is an essential component of data preparation in machine learning and other data science workflows, and is widely recognized as the most time-consuming and error-prone part when working with real-world data. How data was prepared and cleaned has a significant impact on the reliability and trustworthiness of results of any subsequent analysis. Transparent data cleaning not only requires that provenance (i.e., operation history and value changes) be captured, but also that those changes are easy to explore and evaluate: The data scientists who prepare the data, as well as others who want to reuse the cleaned data for their studies, need to be able to easily explore and query its data cleaning history. We have developed a domain-specific provenance model for data cleaning that supports the kind of provenance questions that data scientists need to answer when inspecting and debugging data preparation histories. The design of the model was driven by the need (i) to answer relevant, user-oriented provenance questions, and (ii) to do so in an effective and efficient manner. The model is a refinement of an earlier provenance model and has been implemented as a companion tool to OpenRefine, a popular, open source tool for data cleaning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The old adage of “garbage in, garbage out” comes to mind.

  2. 2.

    Roughly speaking, data is of good quality if it is fit for purpose [19].

  3. 3.

    en.wikipedia.org/wiki/Videocassette_recorder.

  4. 4.

    In ML and statistics, columns and rows often represent features (or variables) and observations, respectively.

  5. 5.

    This query can also be reused to reconstruct dataset snapshots if needed.

References

  1. Belhajjame, K., et al.: PROV-DM: the PROV data model. www.w3.org/TR/prov-dm (2012)

  2. Clingo: A grounder and solver for logic programs. https://github.com/potassco/clingo

  3. Cuevas-Vicenttín, V., et al.: ProvONE: a PROV extension data model for scientific workflow provenance (2016). http://jenkins-1.dataone.org/jenkins/view/DocumentationProjects/job/ProvONE-Documentation-trunk/ws/provenance/ProvONE/v1/provone.html

  4. Dey, S.C., Köhler, S., Bowers, S., Ludäscher, B.: Datalog as a Lingua Franca for Provenance Querying and Reasoning. In: Workshop on Theory and Practice of Provenance (TaPP) (2012)

    Google Scholar 

  5. Gebser, M., Kaminski, R., Kaufmann, B., Schaub, T.: Multi-shot ASP solving with clingo. CoRR arXiv:1705.09811 (2017)

  6. Hipp, R.: SQLite (2021). www.sqlite.org

  7. Li, L., Parulian, N., Ludäscher, B.: or2yw: generating YesWorkflow models from OpenRefine histories (2021). https://github.com/idaks/OR2YWTool

  8. McPhillips, T., Bowers, S., Belhajjame, K., Ludäscher, B.: Retrospective provenance without a runtime provenance recorder. In: Theory and Practice of Provenance (TaPP) (2015). https://doi.org/10.5555/2814579.2814580

  9. McPhillips, T., Li, L., Parulian, N., Ludäscher, B.: Modeling provenance and understanding reproducibility for OpenRefine data cleaning workflows. In: Workshop on Theory and Practice of Provenance (TaPP) (2019)

    Google Scholar 

  10. Missier, P., Dey, S., Belhajjame, K., Cuevas-Vicenttín, V., Ludäscher, B.: D-PROV: extending the PROV provenance model with workflow structure. In: Workshop on the Theory and Practice of Provenance (TaPP) (2013)

    Google Scholar 

  11. Moreau, L., et al.: The open provenance model core specification. Future Gener. Comput. Syst. 27(6), 743–756 (2011)

    Article  Google Scholar 

  12. New York Public Library: What’s on the menu? (2020). http://menus.nypl.org

  13. Olveira, W., Missier, P., de Olveira, D., Braganholo, V.: Comparing provenance data models for scientific workflows: an analysis of PROV-Wf and ProvONE. In: Anais do Brazilian e-Science Workshop (BreSci), pp. 9–16, January 2020

    Google Scholar 

  14. Omitola, T., Freitas, A., Curry, E., O’Riain, S., Gibbins, N., Shadbolt, N.: Capturing interactive data transformation operations using provenance workflows. In: Simperl, E., et al. (eds.) ESWC 2012. LNCS, vol. 7540, pp. 29–42. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-46641-4_3

    Chapter  Google Scholar 

  15. OpenRefine: A free, open source, power tool for working with messy data (2021). https://github.com/OpenRefine

  16. Pandas: powerful Python data analysis toolkit (2019). https://github.com/pandas-dev/pandas

  17. Parulian, N.: OpenRefine Provenance Explorer (ORPE) Data Cleaning Model (DCM) (2021). https://github.com/idaks/IPAW2021-ORPE

  18. Pimentel, J.F., et al.: Yin & Yang: demonstrating complementary provenance from noWorkflow & YesWorkflow. In: IPAW. LNCS, vol. 9672. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-40593-3_13

  19. Sadiq, S.: Handbook of Data Quality. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36257-6

  20. Winslett, M., Braganholo, V.: Richard Hipp speaks out on SQLite. ACM SIGMOD Record 48(2), 39–46 (2019)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nikolaus Nova Parulian .

Editor information

Editors and Affiliations

Appendix A Sample Provenance Query Output

Appendix A Sample Provenance Query Output

The following is a log of outputs for a set of SQLite demo-queries:

figure o

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Parulian, N.N., McPhillips, T.M., Ludäscher, B. (2021). A Model and System for Querying Provenance from Data Cleaning Workflows. In: Glavic, B., Braganholo, V., Koop, D. (eds) Provenance and Annotation of Data and Processes. IPAW IPAW 2020 2021. Lecture Notes in Computer Science(), vol 12839. Springer, Cham. https://doi.org/10.1007/978-3-030-80960-7_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-80960-7_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-80959-1

  • Online ISBN: 978-3-030-80960-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics