Skip to main content

Notebook Archaeology: Inferring Provenance from Computational Notebooks

  • Conference paper
  • First Online:
Provenance and Annotation of Data and Processes (IPAW 2020, IPAW 2021)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12839))

  • 659 Accesses

Abstract

Computational notebooks allow users to persist code, results, and explanations together, making them important artifacts in understanding research. However, these notebooks often do not record the full provenance of results because steps can be repeated, reordered, or removed. This can lead to inconsistencies between what the authors found and recorded, and what others see when they attempt to examine those results. However, these notebooks do offer some clues that help us infer and understand what may have happened. This paper presents techniques to unearth patterns and develop hypotheses about how the original results were obtained. The work uses statistics from a large corpora of notebooks to build the probable provenance of a notebook’s state. Results show these techniques can help others understand notebooks that may have been archived without proper preservation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Apache Zeppelin. http://zeppelin.apache.org

  2. Beaker Notebook. http://beakernotebook.com

  3. Bowers, S., McPhillips, T., Ludäscher, B.: Declarative rules for inferring fine-grained data provenance from scientific workflow execution traces. In: Groth, P., Frew, J. (eds.) IPAW 2012. LNCS, vol. 7525, pp. 82–96. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34222-6_7

    Chapter  Google Scholar 

  4. Chattopadhyay, S., Prasad, I., Henley, A.Z., Sarma, A., Barik, T.: What’s wrong with computational notebooks? Pain points, needs, and design opportunities. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–12 (2020)

    Google Scholar 

  5. Datalore. https://datalore.jetbrains.com

  6. Dey, S., Belhajjame, K., Koop, D., Song, T., Missier, P., Ludäscher, B.: UP & DOWN: improving provenance precision by combining workflow-and trace-level information. In: 6th USENIX Workshop on the Theory and Practice of Provenance (TaPP 2014) (2014)

    Google Scholar 

  7. Head, A., Hohman, F., Barik, T., Drucker, S.M., DeLine, R.: Managing messes in computational notebooks. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, p. 270. ACM (2019)

    Google Scholar 

  8. Huq, M.R., Apers, P.M., Wombacher, A.: ProvenanceCurious: a tool to infer data provenance from scripts. In: Proceedings of the 16th International Conference on Extending Database Technology, pp. 765–768 (2013)

    Google Scholar 

  9. Jupyter. http://jupyter.org

  10. Kery, M.B., Myers, B.A.: Interactions for untangling messy history in a computational notebook. In: 2018 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 147–155 (October 2018). https://doi.org/10.1109/VLHCC.2018.8506576

  11. Kery, M.B., John, B.E., O’Flaherty, P., Horvath, A., Myers, B.A.: Towards effective foraging by data scientists to find past analysis choices. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI 2019, pp. 92:1–92:13. ACM, New York (2019). https://doi.org/10.1145/3290605.3300322, http://doi.acm.org/10.1145/3290605.3300322

  12. Koenzen, A.P., Ernst, N.A., Storey, M.A.D.: Code duplication and reuse in Jupyter notebooks. In: 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 1–9. IEEE (2020)

    Google Scholar 

  13. Koop, D., Patel, J.: Dataflow notebooks: encoding and tracking dependencies of cells. In: 9th Workshop on the Theory and Practice of Provenance (TaPP 2017) (2017)

    Google Scholar 

  14. Macke, S., Gong, H., Lee, D.J.L., Head, A., Xin, D., Parameswaran, A.: Fine-grained lineage for safer notebook interactions. Proc. VLDB Endow. 14(6), 1093–1101 (2021)

    Article  Google Scholar 

  15. Nodebook. https://github.com/stitchfix/nodebook

  16. North, S., Scheidegger, C., Urbanek, S., Woodhull, G.: Collaborative visual analysis with rcloud. In: 2015 IEEE Conference on Visual Analytics Science and Technology (VAST), pp. 25–32. IEEE (2015)

    Google Scholar 

  17. Observable. https://observablehq.com

  18. Pérez, F., Granger, B.E.: IPython: a system for interactive scientific computing. Comput. Sci. Eng. 9(3), 21–29 (2007)

    Article  Google Scholar 

  19. Pimentel, J.F., Freire, J., Murta, L., Braganholo, V.: A survey on collecting, managing, and analyzing provenance from scripts. ACM Comput. Surv. (CSUR) 52(3), 1–38 (2019)

    Article  Google Scholar 

  20. Pimentel, J.F., Murta, L., Braganholo, V., Freire, J.: A large-scale study about quality and reproducibility of Jupyter notebooks. In: Proceedings of the 16th International Conference on Mining Software Repositories, pp. 507–517. IEEE Press (2019)

    Google Scholar 

  21. reactivepy. https://github.com/jupytercalpoly/reactivepy

  22. Rule, A., Drosos, I., Tabard, A., Hollan, J.D.: Aiding collaborative reuse of computational notebooks with annotated cell folding. Proc. ACM Hum.-Comput. Interact. 2(CSCW), 150 (2018)

    Article  Google Scholar 

  23. Rule, A., Tabard, A., Hollan, J.D.: Exploration and explanation in computational notebooks. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI 2018, pp. 32:1–32:12. ACM, New York (2018). https://doi.org/10.1145/3173574.3173606, http://doi.acm.org/10.1145/3173574.3173606

  24. Sage Developers: SageMath, the Sage Mathematics Software System (2017). http://www.sagemath.org

  25. Samuel, S., König-Ries, B.: Provbook: provenance-based semantic enrichment of interactive notebooks for reproducibility. In: International Semantic Web Conference (P&D/Industry/BlueSky) (2018)

    Google Scholar 

  26. Wang, A.Y., Mittal, A., Brooks, C., Oney, S.: How data scientists use computational notebooks for real-time collaboration. Proc. ACM Hum.-Comput. Interact. 3(CSCW), 39 (2019)

    Google Scholar 

  27. Wang, J., Tzu-Yang, K., Li, L., Zeller, A.: Assessing and restoring reproducibility of Jupyter notebooks. In: 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 138–149. IEEE (2020)

    Google Scholar 

  28. Wolfram Research Inc.: Mathematica. https://www.wolfram.com/mathematica/

Download references

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant SBE-2022443.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Koop .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Koop, D. (2021). Notebook Archaeology: Inferring Provenance from Computational Notebooks. In: Glavic, B., Braganholo, V., Koop, D. (eds) Provenance and Annotation of Data and Processes. IPAW IPAW 2020 2021. Lecture Notes in Computer Science(), vol 12839. Springer, Cham. https://doi.org/10.1007/978-3-030-80960-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-80960-7_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-80959-1

  • Online ISBN: 978-3-030-80960-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics