Abstract
The “Self-Learning Data Foundation for AI” is an open-source platform to manage Machine Learning (ML) metadata in complex end-to-end pipelines, and includes the intelligence to optimize data gradation, pipeline configuration, and compute performance. The work addresses several challenges: prioritizing data to reduce movement, tracking lineage to optimize complex ML pipelines, and enabling reproducibility and portability of data selection and ML model development. Off-the-shelf AI metadata management frameworks (such as MLflow or Weights & Biases) focus on fine-grain stage-level metadata, and only track parts of the pipeline, and lineage. Our proposed software layer sits between ML workflows and pipelines and storage/data access. The first implementation of the Data Foundation is the Common Metadata Framework (CMF), which captures metadata and tracks them automatically alongside references to data artifacts and application code. Its git-like nature allows parallel model development by different teams and is well suited for federated environments. It includes intelligence to optimize pipelines and storage, can learn the access patterns from pipeline execution to inform optimizations such as prestaging and caching. It also learns from model inference metrics to build iteratively more robust models. Through a data shaping use case for I/O optimization and an active learning use case to reduce labelling (on DeepCam AI model training on climate data running on NERSC Cori), we show the versatility of the data foundation layer, the potential benefits (4x reduction in training time and 2x reduction in labelling effort), and its central role in complex ML pipelines.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Du, X., et al.: Active learning to classify macromolecular structures in situ for less supervision in cryo-electron tomography. Bioinformatics 37(16), 2340–2346 (2021)
Lee, H., et al.: DeepDriveMD: deep-learning driven adaptive molecular dynamic simulations for protein folding. In: 3rd DLS, pp. 12–19 (2019)
Jacobs, S.A., et al.: Parallelizing training of deep generative models on massive scientific datasets. In: CLUSTER 2019, pp. 1–10 (2019)
Partee, S., et al.: Using machine learning at scale in HPC simulations with SmartSim: an application to ocean climate modeling. J. Comput. Sci. 62, 101707 (2022)
Jimenez-Luna, J., et al.: Drug discovery with explainable artificial intelligence. Nat. Mach. Intell. 2, 573–584 (2020)
Ju, X., et al.: Performance of a geometric deep learning pipeline for HL-LHC particle tracking. Eur. Phys. J. C 81, 876 (2021)
Mohan, J., et al.: Analyzing and mitigating data stalls in DNN training. Proc. VLDB Endow. 14(5), 771–784 (2021)
Xu, C., et al: Data-aware storage tiering for deep learning. In: PDSW 2021, pp. 23–28 (2021)
Kurth, T., et al.: Exascale deep learning for climate analytics. In: SC 2018, vol. 51, pp. 1–12 (2018)
Prabhat, et al.: ClimateNet: an expert-labeled open dataset and deep learning architecture for enabling high-precision analyses of extreme weather. Geosci. Model Dev. 14, 107–124 (2021)
Matsumoto, S., et al.: Extraction of protein dynamics information from Cryo-EM maps using deep learning. Nat. Mach. Intell. 3, 153–160 (2021)
Nix, D.A., et al.: Estimating the mean and variance of the target probability distribution. In: ICNN 1994, pp. 55–60 (1994)
Nguyen, V.-L., et al.: How to measure uncertainty in uncertainty sampling for active learning. Mach. Learn. 111, 89–122 (2022)
Guo, Ch., et al.: On calibration of modern neural networks. In: 34th ICML, vol. 70, pp. 1321–1330. PMLR (2017)
Lakshminarayanan, B., et al.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: NIPS 2017, pp. 6405–6416 (2017)
Chitta, K., et al.: Large-scale visual active learning with deep probabilistic ensembles. https://arxiv.org/pdf/1811.03575.pdf
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Justine, A. et al. (2022). Self-learning Data Foundation for Scientific AI. In: Doug, K., Al, G., Pophale, S., Liu, H., Parete-Koon, S. (eds) Accelerating Science and Engineering Discoveries Through Integrated Research Infrastructure for Experiment, Big Data, Modeling and Simulation. SMC 2022. Communications in Computer and Information Science, vol 1690. Springer, Cham. https://doi.org/10.1007/978-3-031-23606-8_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-23606-8_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23605-1
Online ISBN: 978-3-031-23606-8
eBook Packages: Computer ScienceComputer Science (R0)