Strategies for Integrating Deep Learning Surrogate Models with HPC Simulation Applications
- ORNL
The emerging trend of the convergence of high performance computing (HPC), machine learning/deep learning (ML/DL), and big data analytics presents a host of challenges for large-scale computing campaigns that seek best practices to interleave traditional scientific simulation-based workloads with ML/DL models. A portfolio of systematic approaches to incorporate deep learning into modeling and simulation serves a vital need when we support AI for science at a computing facility. In this paper, we evaluate several strategies for deploying deep learning surrogate models in a representative physics application on supercomputers at the Oak Ridge Leadership Computing Facility (OLCF). We discuss a set of recommended deployment architectures and implementation approaches. We analyze and evaluate these alternatives and show their performance and scalability up to 1000 GPUs on two mainstream platforms equipped with different deep learning hardware and software stacks.
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1885297
- Resource Relation:
- Conference: ExSAIS 2022: Workshop on Extreme Scaling of AI for Science, co-Located with IPDPS 2022 - Lyons, , France - 5/30/2022 4:00:00 AM-6/3/2022 4:00:00 AM
- Country of Publication:
- United States
- Language:
- English
Similar Records
Scalable training of graph convolutional neural networks for fast and accurate predictions of HOMO-LUMO gap in molecules
Characterizing Machine Learning I/O Workloads on Leadership Scale HPC Systems