skip to main content
10.1145/3322795.3331463acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
short-paper

ElasticPipe: An Efficient and Dynamic Model-Parallel Solution to DNN Training

Published: 17 June 2019 Publication History

Abstract

Traditional deep neural network (DNN) training is executed with data parallelism, which suffers from significant communication overheads and GPU memory consumption. Considering this, recent pioneering works have attempted to train DNN with model parallelism. However, model partition remains as a major concern and a static partition fails to adapt to the ever-changing computation environment of the cloud cluster. This paper proposes ElasticPipe, which trains the neural network based on pipe-based model parallelism. Unlike data-parallel solutions, each node in ElasticPipe only holds part of the whole model, leading to much lower cost of communication and GPU memory. More importantly, ElasticPipe is able to dynamically tune the workload distribution among different nodes, so that it can mitigate the common straggler effect in cloud environment. Our primary experiment shows, compared to the data-parallel baselines, ElasticPipe can reduce the training time by up to 89.03% without considering straggler effect, and by up to 76.72% with the existence of stragglers. Besides, ElasticPipe also outperforms its static counterpart by up to 28.81% in training performance when stragglers are involved.

References

[1]
Harlap Aaron, Narayanan Deepak, Amar Phanishayee, and et al. 2018. PipeDream: Pipeline Parallelism for DNN Training. In Proceedings of SysML'18 .
[2]
Krizhevsky Alex. 2014. One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997 (2014).
[3]
Chi-Chung Chen, Chia-Lin Yang, and Hsiang-Yun Cheng. 2018. Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform. arXiv:1809.02839 (2018).
[4]
Henggang Cui, James Cipar, et al. 2014. Exploiting Bounded Staleness to Speed Up Big Data Analytics. In Proceedings of ATC'14 .
[5]
Wei Dai, Abhimanu Kumar, Jinliang Wei, et al. 2015. High-performance Distributed ML at Scale Through Parameter Server Consistency Models. In Proceedings of AAAI'15 .
[6]
Jinkun Geng, Dan Li, Yang Cheng, et al. 2018. HiPS: Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning. In Proceedings of NetAI'18 .
[7]
Aaron Harlap, Henggang Cui, Wei Dai, et al. 2016. Addressing the Straggler Problem for Iterative Convergent Parallel ML. In Proceedings of the SoCC'16 .
[8]
Yanping Huang, Yonglong Cheng, Dehao Chen, et al. 2018. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. arXiv preprint arXiv:1811.06965 (2018).
[9]
Mu Li, David Andersen, Jun Woo Park, et al. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proceedings of OSDI'14 .
[10]
Y. Li, J. Park, M. Alian, et al. 2018. A Network-Centric Hardware/Algorithm Co-Design to Accelerate Distributed Training of Deep Neural Networks. In Proceedings of MICRO'18 .
[11]
Molchanov Pavlo, Tyree Stephen, Karras Tero, et al. 2017. Pruning Convolutional Neural Networks for Resource Efficient Inference. arXiv:1611.06440 (2017).

Cited By

View all
  • (2024)Advancements in Accelerating Deep Neural Network Inference on AIoT Devices: A SurveyIEEE Transactions on Sustainable Computing10.1109/TSUSC.2024.33531769:6(830-847)Online publication date: Nov-2024
  • (2024)Bandwidth Characterization of DeepSpeed on Distributed Large Language Model Training2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00031(241-256)Online publication date: 5-May-2024
  • (2024)Resource- and Workload-Aware Malware Detection through Distributed Computing in IoT NetworksProceedings of the 29th Asia and South Pacific Design Automation Conference10.1109/ASP-DAC58780.2024.10473814(368-373)Online publication date: 22-Jan-2024
  • Show More Cited By

Index Terms

  1. ElasticPipe: An Efficient and Dynamic Model-Parallel Solution to DNN Training

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ScienceCloud '19: Proceedings of the 10th Workshop on Scientific Cloud Computing
      June 2019
      32 pages
      ISBN:9781450367585
      DOI:10.1145/3322795
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 17 June 2019

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. GPU memory consumption
      2. communication overheads
      3. model parallelism
      4. straggler effect

      Qualifiers

      • Short-paper

      Conference

      HPDC '19
      Sponsor:

      Acceptance Rates

      ScienceCloud '19 Paper Acceptance Rate 22 of 106 submissions, 21%;
      Overall Acceptance Rate 44 of 151 submissions, 29%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)38
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 17 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Advancements in Accelerating Deep Neural Network Inference on AIoT Devices: A SurveyIEEE Transactions on Sustainable Computing10.1109/TSUSC.2024.33531769:6(830-847)Online publication date: Nov-2024
      • (2024)Bandwidth Characterization of DeepSpeed on Distributed Large Language Model Training2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00031(241-256)Online publication date: 5-May-2024
      • (2024)Resource- and Workload-Aware Malware Detection through Distributed Computing in IoT NetworksProceedings of the 29th Asia and South Pacific Design Automation Conference10.1109/ASP-DAC58780.2024.10473814(368-373)Online publication date: 22-Jan-2024
      • (2024)A high-performance dataflow-centric optimization framework for deep learning inference on the edgeJournal of Systems Architecture10.1016/j.sysarc.2024.103180152(103180)Online publication date: Jul-2024
      • (2023)Offloading Machine Learning to Programmable Data Planes: A Systematic SurveyACM Computing Surveys10.1145/360515356:1(1-34)Online publication date: 26-Aug-2023
      • (2023)Resource- and Workload-Aware Model Parallelism-Inspired Novel Malware Detection for IoT DevicesIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.329012842:12(4618-4628)Online publication date: Dec-2023
      • (2023)SmartPipe: Intelligently Freezing Layers in Pipeline Parallelism for Distributed DNN Training2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00259(1885-1894)Online publication date: 17-Dec-2023
      • (2023)Enabling All In-Edge Deep Learning: A Literature ReviewIEEE Access10.1109/ACCESS.2023.323476111(3431-3460)Online publication date: 2023
      • (2023)Layer-wise partitioning and merging for efficient and scalable deep learningFuture Generation Computer Systems10.1016/j.future.2023.07.043149(432-444)Online publication date: Dec-2023
      • (2023)Xenos : Dataflow-Centric Optimization to Accelerate Model Inference on Edge DevicesDatabase Systems for Advanced Applications10.1007/978-3-031-30637-2_35(535-545)Online publication date: 14-Apr-2023
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media