Abstract
Computational notebooks allow users to persist code, results, and explanations together, making them important artifacts in understanding research. However, these notebooks often do not record the full provenance of results because steps can be repeated, reordered, or removed. This can lead to inconsistencies between what the authors found and recorded, and what others see when they attempt to examine those results. However, these notebooks do offer some clues that help us infer and understand what may have happened. This paper presents techniques to unearth patterns and develop hypotheses about how the original results were obtained. The work uses statistics from a large corpora of notebooks to build the probable provenance of a notebook’s state. Results show these techniques can help others understand notebooks that may have been archived without proper preservation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Apache Zeppelin. http://zeppelin.apache.org
Beaker Notebook. http://beakernotebook.com
Bowers, S., McPhillips, T., Ludäscher, B.: Declarative rules for inferring fine-grained data provenance from scientific workflow execution traces. In: Groth, P., Frew, J. (eds.) IPAW 2012. LNCS, vol. 7525, pp. 82–96. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34222-6_7
Chattopadhyay, S., Prasad, I., Henley, A.Z., Sarma, A., Barik, T.: What’s wrong with computational notebooks? Pain points, needs, and design opportunities. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–12 (2020)
Datalore. https://datalore.jetbrains.com
Dey, S., Belhajjame, K., Koop, D., Song, T., Missier, P., Ludäscher, B.: UP & DOWN: improving provenance precision by combining workflow-and trace-level information. In: 6th USENIX Workshop on the Theory and Practice of Provenance (TaPP 2014) (2014)
Head, A., Hohman, F., Barik, T., Drucker, S.M., DeLine, R.: Managing messes in computational notebooks. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, p. 270. ACM (2019)
Huq, M.R., Apers, P.M., Wombacher, A.: ProvenanceCurious: a tool to infer data provenance from scripts. In: Proceedings of the 16th International Conference on Extending Database Technology, pp. 765–768 (2013)
Jupyter. http://jupyter.org
Kery, M.B., Myers, B.A.: Interactions for untangling messy history in a computational notebook. In: 2018 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 147–155 (October 2018). https://doi.org/10.1109/VLHCC.2018.8506576
Kery, M.B., John, B.E., O’Flaherty, P., Horvath, A., Myers, B.A.: Towards effective foraging by data scientists to find past analysis choices. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI 2019, pp. 92:1–92:13. ACM, New York (2019). https://doi.org/10.1145/3290605.3300322, http://doi.acm.org/10.1145/3290605.3300322
Koenzen, A.P., Ernst, N.A., Storey, M.A.D.: Code duplication and reuse in Jupyter notebooks. In: 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 1–9. IEEE (2020)
Koop, D., Patel, J.: Dataflow notebooks: encoding and tracking dependencies of cells. In: 9th Workshop on the Theory and Practice of Provenance (TaPP 2017) (2017)
Macke, S., Gong, H., Lee, D.J.L., Head, A., Xin, D., Parameswaran, A.: Fine-grained lineage for safer notebook interactions. Proc. VLDB Endow. 14(6), 1093–1101 (2021)
Nodebook. https://github.com/stitchfix/nodebook
North, S., Scheidegger, C., Urbanek, S., Woodhull, G.: Collaborative visual analysis with rcloud. In: 2015 IEEE Conference on Visual Analytics Science and Technology (VAST), pp. 25–32. IEEE (2015)
Observable. https://observablehq.com
Pérez, F., Granger, B.E.: IPython: a system for interactive scientific computing. Comput. Sci. Eng. 9(3), 21–29 (2007)
Pimentel, J.F., Freire, J., Murta, L., Braganholo, V.: A survey on collecting, managing, and analyzing provenance from scripts. ACM Comput. Surv. (CSUR) 52(3), 1–38 (2019)
Pimentel, J.F., Murta, L., Braganholo, V., Freire, J.: A large-scale study about quality and reproducibility of Jupyter notebooks. In: Proceedings of the 16th International Conference on Mining Software Repositories, pp. 507–517. IEEE Press (2019)
reactivepy. https://github.com/jupytercalpoly/reactivepy
Rule, A., Drosos, I., Tabard, A., Hollan, J.D.: Aiding collaborative reuse of computational notebooks with annotated cell folding. Proc. ACM Hum.-Comput. Interact. 2(CSCW), 150 (2018)
Rule, A., Tabard, A., Hollan, J.D.: Exploration and explanation in computational notebooks. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI 2018, pp. 32:1–32:12. ACM, New York (2018). https://doi.org/10.1145/3173574.3173606, http://doi.acm.org/10.1145/3173574.3173606
Sage Developers: SageMath, the Sage Mathematics Software System (2017). http://www.sagemath.org
Samuel, S., König-Ries, B.: Provbook: provenance-based semantic enrichment of interactive notebooks for reproducibility. In: International Semantic Web Conference (P&D/Industry/BlueSky) (2018)
Wang, A.Y., Mittal, A., Brooks, C., Oney, S.: How data scientists use computational notebooks for real-time collaboration. Proc. ACM Hum.-Comput. Interact. 3(CSCW), 39 (2019)
Wang, J., Tzu-Yang, K., Li, L., Zeller, A.: Assessing and restoring reproducibility of Jupyter notebooks. In: 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 138–149. IEEE (2020)
Wolfram Research Inc.: Mathematica. https://www.wolfram.com/mathematica/
Acknowledgements
This material is based upon work supported by the National Science Foundation under Grant SBE-2022443.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Koop, D. (2021). Notebook Archaeology: Inferring Provenance from Computational Notebooks. In: Glavic, B., Braganholo, V., Koop, D. (eds) Provenance and Annotation of Data and Processes. IPAW IPAW 2020 2021. Lecture Notes in Computer Science(), vol 12839. Springer, Cham. https://doi.org/10.1007/978-3-030-80960-7_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-80960-7_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-80959-1
Online ISBN: 978-3-030-80960-7
eBook Packages: Computer ScienceComputer Science (R0)