Elastic distributed training with fast convergence and efficient resource utilization

Cong, Guojing

doi:10.1109/ICMLA52953.2021.00160

Title: Elastic distributed training with fast convergence and efficient resource utilization

Conference · Wed Dec 01 00:00:00 EST 2021

DOI:https://doi.org/10.1109/ICMLA52953.2021.00160· OSTI ID:1843691

Cong, Guojing ^[1]

ORNL

Distributed learning is now routinely conducted on cloud as well as dedicated clusters. Training with elastic resources brings new challenges and design choices. Prior studies focus on runtime performance and assume a static algorithmic behavior. In this work, by analyzing the impact of of resource scaling on convergence, we introduce schedules for synchronous stochastic gradient descent that proactively adapt the number of learners to reduce training time and improve convergence. Our approach no longer assumes a constant number of processors throughout training. In our experiment, distributed stochastic gradient descent with dynamic schedules and reduction momentum achieves better convergence and significant speedups over prior static ones. Numerous distributed training jobs running on cloud may benefit from our approach.

View Conference

Cite

Export

Save

Research Organization:: Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1843691

Resource Relation:: Conference: International Conference on Machine Learning and Applications (ICMLA) - remote, California, United States of America - 12/13/2021 10:00:00 AM-12/16/2021 5:00:00 AM

Country of Publication:: United States

Language:: English

Similar Records

Deep Generative Models that Solve PDEs: Distributed Computing for Training Large Data-Free Models

Conference · Sun Nov 01 00:00:00 EDT 2020 · Workshop on Machine Learning in HPC Environments (Online) · OSTI ID:1843691

Botelho, Sergio; Joshi, Ameya; Khara, Biswajit; +5 more

Adaptive elasticity policies for staging-based in situ visualization

Journal Article · Wed Dec 21 00:00:00 EST 2022 · Future Generations Computer Systems · OSTI ID:1843691

Wang, Zhe; Dorier, Matthieu; Subedi, Pradeep; +2 more

High-Level Synthesis of Parallel Specifications Coupling Static and Dynamic Controllers

Conference · Mon May 17 00:00:00 EDT 2021 · OSTI ID:1843691

Castellana, Vito G.; Tumeo, Antonino; Ferrandi, Fabrizio

Title: Elastic distributed training with fast convergence and efficient resource utilization

Citation Formats

Similar Records

Related Subjects