Self-learning Data Foundation for Scientific AI

Justine, Annmary; Serebryakov, Sergey; Xu, Cong; Tripathy, Aalap; Bhattacharya, Suparna; Faraboschi, Paolo; Foltin, Martin

doi:10.1007/978-3-031-23606-8_2

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1690))

Included in the following conference series:

Smoky Mountains Computational Sciences and Engineering Conference

548 Accesses
2 Citations

Abstract

The “Self-Learning Data Foundation for AI” is an open-source platform to manage Machine Learning (ML) metadata in complex end-to-end pipelines, and includes the intelligence to optimize data gradation, pipeline configuration, and compute performance. The work addresses several challenges: prioritizing data to reduce movement, tracking lineage to optimize complex ML pipelines, and enabling reproducibility and portability of data selection and ML model development. Off-the-shelf AI metadata management frameworks (such as MLflow or Weights & Biases) focus on fine-grain stage-level metadata, and only track parts of the pipeline, and lineage. Our proposed software layer sits between ML workflows and pipelines and storage/data access. The first implementation of the Data Foundation is the Common Metadata Framework (CMF), which captures metadata and tracks them automatically alongside references to data artifacts and application code. Its git-like nature allows parallel model development by different teams and is well suited for federated environments. It includes intelligence to optimize pipelines and storage, can learn the access patterns from pipeline execution to inform optimizations such as prestaging and caching. It also learns from model inference metrics to build iteratively more robust models. Through a data shaping use case for I/O optimization and an active learning use case to reduce labelling (on DeepCam AI model training on climate data running on NERSC Cori), we show the versatility of the data foundation layer, the potential benefits (4x reduction in training time and 2x reduction in labelling effort), and its central role in complex ML pipelines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Du, X., et al.: Active learning to classify macromolecular structures in situ for less supervision in cryo-electron tomography. Bioinformatics 37(16), 2340–2346 (2021)
Article Google Scholar
Lee, H., et al.: DeepDriveMD: deep-learning driven adaptive molecular dynamic simulations for protein folding. In: 3rd DLS, pp. 12–19 (2019)
Google Scholar
Jacobs, S.A., et al.: Parallelizing training of deep generative models on massive scientific datasets. In: CLUSTER 2019, pp. 1–10 (2019)
Google Scholar
Partee, S., et al.: Using machine learning at scale in HPC simulations with SmartSim: an application to ocean climate modeling. J. Comput. Sci. 62, 101707 (2022)
Article Google Scholar
Jimenez-Luna, J., et al.: Drug discovery with explainable artificial intelligence. Nat. Mach. Intell. 2, 573–584 (2020)
Article Google Scholar
Ju, X., et al.: Performance of a geometric deep learning pipeline for HL-LHC particle tracking. Eur. Phys. J. C 81, 876 (2021)
Article Google Scholar
https://docs.ray.io/en/latest/tune/index.html
https://www.kubeflow.org/docs/components/katib/overview/
https://www.determined.ai/
https://mlflow.org
https://wandb.ai
https://www.tensorflow.org/tfx/guide/mlmd
https://dvc.org
Mohan, J., et al.: Analyzing and mitigating data stalls in DNN training. Proc. VLDB Endow. 14(5), 771–784 (2021)
Article Google Scholar
Xu, C., et al: Data-aware storage tiering for deep learning. In: PDSW 2021, pp. 23–28 (2021)
Google Scholar
Kurth, T., et al.: Exascale deep learning for climate analytics. In: SC 2018, vol. 51, pp. 1–12 (2018)
Google Scholar
Prabhat, et al.: ClimateNet: an expert-labeled open dataset and deep learning architecture for enabling high-precision analyses of extreme weather. Geosci. Model Dev. 14, 107–124 (2021)
Article Google Scholar
Matsumoto, S., et al.: Extraction of protein dynamics information from Cryo-EM maps using deep learning. Nat. Mach. Intell. 3, 153–160 (2021)
Article Google Scholar
Nix, D.A., et al.: Estimating the mean and variance of the target probability distribution. In: ICNN 1994, pp. 55–60 (1994)
Google Scholar
Nguyen, V.-L., et al.: How to measure uncertainty in uncertainty sampling for active learning. Mach. Learn. 111, 89–122 (2022)
Article MATH Google Scholar
Guo, Ch., et al.: On calibration of modern neural networks. In: 34th ICML, vol. 70, pp. 1321–1330. PMLR (2017)
Google Scholar
Lakshminarayanan, B., et al.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: NIPS 2017, pp. 6405–6416 (2017)
Google Scholar
Chitta, K., et al.: Large-scale visual active learning with deep probabilistic ensembles. https://arxiv.org/pdf/1811.03575.pdf

Download references

Author information

Authors and Affiliations

Hewlett Packard Labs, Hewlett Packard Enterprise, Spring, TX, 77389, USA
Annmary Justine, Sergey Serebryakov, Cong Xu, Aalap Tripathy, Suparna Bhattacharya, Paolo Faraboschi & Martin Foltin

Authors

Annmary Justine
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Serebryakov
View author publications
You can also search for this author in PubMed Google Scholar
Cong Xu
View author publications
You can also search for this author in PubMed Google Scholar
Aalap Tripathy
View author publications
You can also search for this author in PubMed Google Scholar
Suparna Bhattacharya
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Faraboschi
View author publications
You can also search for this author in PubMed Google Scholar
Martin Foltin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martin Foltin .

Editor information

Editors and Affiliations

Oak Ridge National Laboratory, Oak Ridge, TN, USA
Kothe Doug
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Geist Al
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Swaroop Pophale
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Hong Liu
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Suzanne Parete-Koon

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Justine, A. et al. (2022). Self-learning Data Foundation for Scientific AI. In: Doug, K., Al, G., Pophale, S., Liu, H., Parete-Koon, S. (eds) Accelerating Science and Engineering Discoveries Through Integrated Research Infrastructure for Experiment, Big Data, Modeling and Simulation. SMC 2022. Communications in Computer and Information Science, vol 1690. Springer, Cham. https://doi.org/10.1007/978-3-031-23606-8_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-23606-8_2
Published: 18 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23605-1
Online ISBN: 978-3-031-23606-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics