Skip to main content

Abstract

The “Self-Learning Data Foundation for AI” is an open-source platform to manage Machine Learning (ML) metadata in complex end-to-end pipelines, and includes the intelligence to optimize data gradation, pipeline configuration, and compute performance. The work addresses several challenges: prioritizing data to reduce movement, tracking lineage to optimize complex ML pipelines, and enabling reproducibility and portability of data selection and ML model development. Off-the-shelf AI metadata management frameworks (such as MLflow or Weights & Biases) focus on fine-grain stage-level metadata, and only track parts of the pipeline, and lineage. Our proposed software layer sits between ML workflows and pipelines and storage/data access. The first implementation of the Data Foundation is the Common Metadata Framework (CMF), which captures metadata and tracks them automatically alongside references to data artifacts and application code. Its git-like nature allows parallel model development by different teams and is well suited for federated environments. It includes intelligence to optimize pipelines and storage, can learn the access patterns from pipeline execution to inform optimizations such as prestaging and caching. It also learns from model inference metrics to build iteratively more robust models. Through a data shaping use case for I/O optimization and an active learning use case to reduce labelling (on DeepCam AI model training on climate data running on NERSC Cori), we show the versatility of the data foundation layer, the potential benefits (4x reduction in training time and 2x reduction in labelling effort), and its central role in complex ML pipelines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Du, X., et al.: Active learning to classify macromolecular structures in situ for less supervision in cryo-electron tomography. Bioinformatics 37(16), 2340–2346 (2021)

    Article  Google Scholar 

  2. Lee, H., et al.: DeepDriveMD: deep-learning driven adaptive molecular dynamic simulations for protein folding. In: 3rd DLS, pp. 12–19 (2019)

    Google Scholar 

  3. Jacobs, S.A., et al.: Parallelizing training of deep generative models on massive scientific datasets. In: CLUSTER 2019, pp. 1–10 (2019)

    Google Scholar 

  4. Partee, S., et al.: Using machine learning at scale in HPC simulations with SmartSim: an application to ocean climate modeling. J. Comput. Sci. 62, 101707 (2022)

    Article  Google Scholar 

  5. Jimenez-Luna, J., et al.: Drug discovery with explainable artificial intelligence. Nat. Mach. Intell. 2, 573–584 (2020)

    Article  Google Scholar 

  6. Ju, X., et al.: Performance of a geometric deep learning pipeline for HL-LHC particle tracking. Eur. Phys. J. C 81, 876 (2021)

    Article  Google Scholar 

  7. https://docs.ray.io/en/latest/tune/index.html

  8. https://www.kubeflow.org/docs/components/katib/overview/

  9. https://www.determined.ai/

  10. https://mlflow.org

  11. https://wandb.ai

  12. https://www.tensorflow.org/tfx/guide/mlmd

  13. https://dvc.org

  14. Mohan, J., et al.: Analyzing and mitigating data stalls in DNN training. Proc. VLDB Endow. 14(5), 771–784 (2021)

    Article  Google Scholar 

  15. Xu, C., et al: Data-aware storage tiering for deep learning. In: PDSW 2021, pp. 23–28 (2021)

    Google Scholar 

  16. Kurth, T., et al.: Exascale deep learning for climate analytics. In: SC 2018, vol. 51, pp. 1–12 (2018)

    Google Scholar 

  17. Prabhat, et al.: ClimateNet: an expert-labeled open dataset and deep learning architecture for enabling high-precision analyses of extreme weather. Geosci. Model Dev. 14, 107–124 (2021)

    Article  Google Scholar 

  18. Matsumoto, S., et al.: Extraction of protein dynamics information from Cryo-EM maps using deep learning. Nat. Mach. Intell. 3, 153–160 (2021)

    Article  Google Scholar 

  19. Nix, D.A., et al.: Estimating the mean and variance of the target probability distribution. In: ICNN 1994, pp. 55–60 (1994)

    Google Scholar 

  20. Nguyen, V.-L., et al.: How to measure uncertainty in uncertainty sampling for active learning. Mach. Learn. 111, 89–122 (2022)

    Article  MATH  Google Scholar 

  21. Guo, Ch., et al.: On calibration of modern neural networks. In: 34th ICML, vol. 70, pp. 1321–1330. PMLR (2017)

    Google Scholar 

  22. Lakshminarayanan, B., et al.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: NIPS 2017, pp. 6405–6416 (2017)

    Google Scholar 

  23. Chitta, K., et al.: Large-scale visual active learning with deep probabilistic ensembles. https://arxiv.org/pdf/1811.03575.pdf

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin Foltin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Justine, A. et al. (2022). Self-learning Data Foundation for Scientific AI. In: Doug, K., Al, G., Pophale, S., Liu, H., Parete-Koon, S. (eds) Accelerating Science and Engineering Discoveries Through Integrated Research Infrastructure for Experiment, Big Data, Modeling and Simulation. SMC 2022. Communications in Computer and Information Science, vol 1690. Springer, Cham. https://doi.org/10.1007/978-3-031-23606-8_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-23606-8_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-23605-1

  • Online ISBN: 978-3-031-23606-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics