Combining P-Plan and the REPRODUCE-ME Ontology to Achieve Semantic Enrichment of Scientific Experiments Using Interactive Notebooks

Samuel, Sheeba; König-Ries, Birgitta

doi:10.1007/978-3-319-98192-5_24

Sheeba Samuel²⁶ &
Birgitta König-Ries²⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11155))

Included in the following conference series:

European Semantic Web Conference

2043 Accesses
3 Citations
6 Altmetric

Abstract

End-to-end reproducibility of scientific experiments requires scientists to share their experimental data along with the computational environment. Interactive notebooks have recently gained widespread popularity among scientists because they allow users to document their experiments along with the code, visualize the results inline and selectively execute the code. In a multi-user environment where users can run and modify the shared notebooks, it becomes essential to capture the provenance of notebooks along with the experiments which used them. In this paper, we propose a way to capture provenance of these interactive notebooks and convert them into semantic descriptions so that a user can query the difference between the results, steps, errors and the execution environment of the code. We use the REPRODUCE-ME ontology extended from PROV-O and P-Plan to describe the provenance of notebook execution. We evaluate our prototype in a multi-user environment provided by JupyterHub.

You have full access to this open access chapter, Download conference paper PDF

Generating Scientific Documentation for Computational Experiments Using Provenance

Understanding and improving the quality and reproducibility of Jupyter notebooks

Article 08 May 2021

João Felipe Pimentel, Leonardo Murta, … Juliana Freire

ReproduceMeGit: A Visualization Tool for Analyzing Reproducibility of Jupyter Notebooks

Keywords

1 Introduction

Scientific experiments are a complex set of processes which involve multiple agents, activities, computational environment, input, and output. Additional challenges emerge with the collaborative and distributed experiment environment in terms of code sharing and execution. The inspiration for our work arises from the Collaborative Research Center (CRC) ReceptorLight^{Footnote 1}, where scientists from multiple research institutes collaborate to develop high-performance microscopy techniques to understand the function of membrane receptors. In such a distributed and collaborative environment, it is necessary to understand the provenance of results generated by the researchers. In our previous work [9], we developed a prototype to capture the non-computational parts of an experiment which includes the descriptions, agents, execution environment, devices, materials and methods used. To capture the provenance of the computational part of an experiment, we use Jupyterhub^{Footnote 2}, which is a multi-user version of Jupyter Notebooks. It is an open source initiative which supports centralized deployment, centralized user-authentication and advances collaboration among scientists to document and run the code in any programming languages.

Several tools have been introduced to capture provenance of computational experiments. Scientific Workflow Management Systems capture provenance by running experiments as workflows. But, because of their steep learning curve, scientists still prefer writing scripts [6]. This has motivated approaches that aim to capture provenance from scripts and notebooks [2, 5, 7]. The YesWorkflow tool [5] is a language-independent tool where provenance data from a script is rendered as a workflow with the help of special annotations added by the user. Another recent approach is to convert notebooks into workflows where notebook developers need to follow a set of guidelines in writing code [2]. Carvalho et al. [1] present a methodology to convert scripts into workflow research objects with the help of tools like YesWorkflow, Research Objects, and PROV. All of these approaches have the limitation that they require changes to scripts by the user. Pimentel et al. [7] collect provenance data from IPython Notebooks by integrating noWorkflow [6] to the notebooks. However, this approach is limited to Python scripts. In our work, we capture and semantically describe the provenance of notebook execution in a multi-user environment using the REPRODUCE-ME ontology [8] by extending PROV-O [4] and P-Plan [3] to give a complete picture of a scientific experiment.

2 Development

We use the REPRODUCE-ME ontology [8] to describe the provenance of a notebook and its execution. In order to do this, we extend P-Plan [3] to represent the steps, plans, input and output variables and their relationship with each other. The prefixes prov:, p-plan: and repr: are used to indicate the namespace of all terms of PROV-O, P-Plan, and REPRODUCE-ME respectively. A p-plan:Plan consists of smaller steps p-plan:Step which consumes and produces p-plan:Variable. A repr:Experiment and repr:Notebook are the subclasses of p-plan:Plan and the repr:Notebook is related to repr:Experiment using the object property p-plan:isSubPlanOfPlan. A cell of the notebook, a repr:Cell, is a p-plan:Step which generates an output which is described as a p-plan:Variable. The source of the cell is described as an input variable. The creation of the notebook is described using prov:generatedAtTime and the modification time using repr:modifiedAtTime. The order of the execution of cells is described using p-plan:isPrecededBy. The repr:Session, a subclass of p-plan:Activity, describes the session of a notebook user who is described using prov:Agent. The execution environment of a notebook is described using repr:Setting which includes repr:ProgrammingLanguage, repr:Version and repr:Kernel.

JupyterHub is installed and connected to our prototype so that users can create new notebooks, run and share them. The notebooks are stored in a centralized place so that they can be shared and run by scientists that belong to a group. Our prototype fetches the metadata of the notebooks from the Jupyter Notebook and JupyterHub REST APIs which provide details of the notebooks, the kernel, and the programming language used, the sessions and the users. The metadata that is useful for scientists is stored along with the other experimental data. The captured provenance data is then mapped to the ontology using ontology-based data access technique. The prototype provides a dashboard which runs SPARQL queries and visualizes the experimental data including the people who are involved in the experiment, the devices and their settings, publications used in the experiment and the notebook data. Figure 1 shows the project dashboard in our prototype. In this way, the prototype provides a complete picture of an experiment. Listing 1.1 shows an example SPARQL query to find all the notebooks used in an experiment and their metadata.

The REPRODUCE-ME ontology, the mappings and the SPARQL queries used for evaluation are publicly available^{Footnote 3}.

3 Conclusion and Future Work

The REPRODUCE-ME ontology was initially developed for microscopy-based experiments. Since scientists use scripts to perform data analysis, we decided to expand the ontology by extending W3C vocabularies to describe the widely used notebooks. In this paper, we semantically enrich the scientific experimental data using notebooks in a multi-user environment provided by JupyterHub. The prototype provides a dashboard which visualizes the experimental data along with the notebooks used to generate the final results. This allows the user to visualize the complete path taken for an experiment from its input to its output along with the execution environment. As future work, we aim to evaluate the prototype based on scalability measures.

Notes

References

Carvalho, L.A.M.C., Belhajjame, K., Medeiros, C.B.: Converting scripts into reproducible workflow research objects. In: 2016 IEEE 12th International Conference on e-Science (e-Science), pp. 71–80, October 2016
Google Scholar
Carvalho, L.A.M.C., Wang, R., Gil, Y., Garijo, D.: NiW: converting notebooks into workflows to capture dataflow and provenance (2017)
Google Scholar
Garijo, D., Gil, Y.: Augmenting PROV with plans in P-Plan: scientific processes as linked data. In: CEUR Workshop Proceedings (2012)
Google Scholar
Lebo, T., Sahoo, S., McGuinness, D., Belhajjame, K., et al.: PROV-O: the PROV ontology. W3C Recomm. 30 (2013)
Google Scholar
McPhillips, T.M., Song, T., Kolisnik, T., Aulenbach, S., Belhajjame, K., et al.: YesWorkflow: a user-oriented, language-independent tool for recovering workflow information from scripts. CoRR abs/1502.02403 (2015)
Google Scholar
Pimentel, J.F., Murta, L., Braganholo, V., Freire, J.: noWorkflow: a tool for collecting, analyzing, and managing provenance from python scripts. Proc. VLDB Endow. 10(12), 1841–1844 (2017)
Article Google Scholar
Pimentel, J.F.N., Braganholo, V., Murta, L., Freire, J.: Collecting and analyzing provenance on interactive notebooks: when IPython meets noWorkflow. In: 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 2015). USENIX Association, Edinburgh (2015)
Google Scholar
Samuel, S., König-Ries, B.: REPRODUCE-ME: ontology-based data access for reproducibility of microscopy experiments. In: Blomqvist, E., Hose, K., Paulheim, H., Ławrynowicz, A., Ciravegna, F., Hartig, O. (eds.) ESWC 2017. LNCS, vol. 10577, pp. 17–20. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-70407-4_4
Chapter Google Scholar
Samuel, S., Taubert, F., Walther, D., König-Ries, B., Bücker, H.M.: Towards reproducibility of microscopy experiments. D-Lib Mag. 23(1/2) (2017)
Google Scholar

Download references

Acknowledgements

This research is supported by the “Deutsche Forschungsgemeinschaft” (DFG) in Project Z2 of the CRC/TRR 166 “High-end light microscopy elucidates membrane receptor function - ReceptorLight”. We thank Christoph Biskup, Kathrin Groeneveld and Tom Kache from University Hospital Jena, Germany, for providing the requirements to develop the proposed approach and evaluating the system.

Author information

Authors and Affiliations

Heinz-Nixdorf Chair for Distributed Information Systems, Friedrich-Schiller University, Jena, Germany
Sheeba Samuel & Birgitta König-Ries

Authors

Sheeba Samuel
View author publications
You can also search for this author in PubMed Google Scholar
Birgitta König-Ries
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sheeba Samuel .

Editor information

Editors and Affiliations

University of Bologna, Bologna, Italy
Aldo Gangemi
IBM Research - Almaden, San Jose, CA, USA
Anna Lisa Gentile
CNR-ISTC, Rome, Italy
Andrea Giovanni Nuzzolese
Technische Universität Dresden, Dresden, Germany
Sebastian Rudolph
Karlsruhe Institute of Technology, Karlsruhe, Germany
Maria Maleshkova
University of Mannheim, Mannheim, Germany
Heiko Paulheim
University of Aberdeen, Aberdeen, UK
Jeff Z Pan
CNR-ISTC, Rome, Italy
Mehwish Alam

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Samuel, S., König-Ries, B. (2018). Combining P-Plan and the REPRODUCE-ME Ontology to Achieve Semantic Enrichment of Scientific Experiments Using Interactive Notebooks. In: Gangemi, A., et al. The Semantic Web: ESWC 2018 Satellite Events. ESWC 2018. Lecture Notes in Computer Science(), vol 11155. Springer, Cham. https://doi.org/10.1007/978-3-319-98192-5_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-98192-5_24
Published: 02 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98191-8
Online ISBN: 978-3-319-98192-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Combining P-Plan and the REPRODUCE-ME Ontology to Achieve Semantic Enrichment of Scientific Experiments Using Interactive Notebooks

Abstract

Similar content being viewed by others

Generating Scientific Documentation for Computational Experiments Using Provenance

Understanding and improving the quality and reproducibility of Jupyter notebooks

ReproduceMeGit: A Visualization Tool for Analyzing Reproducibility of Jupyter Notebooks

Keywords

1 Introduction

2 Development

3 Conclusion and Future Work

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Combining P-Plan and the REPRODUCE-ME Ontology to Achieve Semantic Enrichment of Scientific Experiments Using Interactive Notebooks

Abstract

Similar content being viewed by others

Generating Scientific Documentation for Computational Experiments Using Provenance

Understanding and improving the quality and reproducibility of Jupyter notebooks

ReproduceMeGit: A Visualization Tool for Analyzing Reproducibility of Jupyter Notebooks

Keywords

1 Introduction

2 Development

3 Conclusion and Future Work

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation