skip to main content
10.1145/3503222.3507735acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

NASPipe: high performance and reproducible pipeline parallel supernet training via causal synchronous parallelism

Published: 22 February 2022 Publication History

Abstract

Supernet training, a prevalent and important paradigm in Neural Architecture Search, embeds the whole DNN architecture search space into one monolithic supernet, iteratively activates a subset of the supernet (i.e., a subnet) for fitting each batch of data, and searches a high-quality subnet which meets specific requirements. Although training subnets in parallel on multiple GPUs is desirable for acceleration, there inherently exists a race hazard that concurrent subnets may access the same DNN layers. Existing systems support neither efficiently parallelizing subnets’ training executions, nor resolving the race hazard deterministically, leading to unreproducible training procedures and potentiallly non-trivial accuracy loss.
We present NASPipe, the first high-performance and reproducible distributed supernet training system via causal synchronous parallel (CSP) pipeline scheduling abstraction: NASPipe partitions a supernet across GPUs and concurrently executes multiple generated sub-tasks (subnets) in a pipelined manner; meanwhile, it oversees the correlations between the subnets and deterministically resolves any causal dependency caused by subnets’ layer sharing. To obtain high performance, NASPipe’s CSP scheduler exploits the fact that the larger a supernet spans, the fewer dependencies manifest between chronologically close subnets; therefore, it aggressively schedules the subnets with larger chronological orders into execution, only if they are not causally dependent on unfinished precedent subnets. Moreover, to relieve the excessive GPU memory burden for holding the whole supernet’s parameters, NASPipe uses a context switch technique that stashes the whole supernet in CPU memory, precisely predicts the subnets’ schedule, and pre-fetches/evicts a subnet before/after its execution. The evaluation shows that NASPipe is the only system that retains supernet training reproducibility, while achieving a comparable and even higher performance (up to 7.8X) compared to three recent pipeline training systems (e.g., GPipe).

References

[1]
[n. d.]. NVIDIA/framework-determinism. https://github.com/NVIDIA/framework-determinism
[2]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, and Michael Isard. 2016. Tensorflow: A system for large-scale machine learning. In 12th $USENIX$ symposium on operating systems design and implementation ($OSDI$ 16). 265–283.
[3]
Zhihao Bai, Zhen Zhang, Yibo Zhu, and Xin Jin. 2020. PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications. In 14th $USENIX$ Symposium on Operating Systems Design and Implementation ($OSDI$ 20). 499–514.
[4]
Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. 2018. Understanding and simplifying one-shot architecture search. In International Conference on Machine Learning. 550–559.
[5]
Han Cai, Ligeng Zhu, and Song Han. 2018. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332.
[6]
John Crawford. 1990. The execution pipeline of the intel i486 cpu. In 1990 Thirty-Fifth IEEE Computer Society International Conference on Intellectual Leverage. 254–255. https://doi.org/10.1109/CMPCON.1990.63682
[7]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. 248–255. https://doi.org/10.1109/CVPR.2009.5206848
[8]
Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexander Smola. 2020. Autogluon-tabular: Robust and accurate automl for structured data. arXiv preprint arXiv:2003.06505.
[9]
Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. 2020. Single path one-shot neural architecture search with uniform sampling. In European Conference on Computer Vision. 544–560. https://doi.org/10.1007/978-3-030-58517-4_32
[10]
Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H Campbell. 2018. Tictac: Accelerating distributed deep learning with communication scheduling. arXiv preprint arXiv:1803.03288.
[11]
Chien-Chin Huang, Gu Jin, and Jinyang Li. 2020. SwapAdvisor: Pushing deep learning beyond the GPU memory limit via smart swapping. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 1341–1355. https://doi.org/10.1145/3373376.3378530
[12]
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, and Yonghui Wu. 2018. Gpipe: Efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1811.06965.
[13]
Haifeng Jin, Qingquan Song, and Xia Hu. 2019. Auto-keras: An efficient neural architecture search system. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1946–1956. https://doi.org/10.1145/3292500.3330648
[14]
Woosuk Kwon, Gyeong-In Yu, Eunji Jeong, and Byung-Gon Chun. 2020. Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning. arXiv preprint arXiv:2012.02732.
[15]
Changlin Li, Guangrun Wang, Bing Wang, Xiaodan Liang, Zhihui Li, and Xiaojun Chang. 2021. Dynamic slimmable network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8607–8617. https://doi.org/10.1109/CVPR46437.2021.00850
[16]
Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In 11th $USENIX$ Symposium on Operating Systems Design and Implementation ($OSDI$ 14). 583–598. https://doi.org/10.1145/2640087.2644155
[17]
Mu Li, Li Zhou, Zichao Yang, Aaron Li, Fei Xia, David G Andersen, and Alexander Smola. 2013. Parameter server for distributed machine learning. In Big Learning NIPS Workshop. 6, 2. https://doi.org/10.1145/2640087.2644155
[18]
Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055.
[19]
Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xiaowei Li. 2017. Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). 553–564. https://doi.org/10.1109/HPCA.2017.29
[20]
Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. 2020. Themis: Fair and Efficient $GPU$ Cluster Scheduling. In 17th $USENIX$ Symposium on Networked Systems Design and Implementation ($NSDI$ 20). 289–304.
[21]
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 1–15. https://doi.org/10.1145/3341301.3359646
[22]
Dan Negrut, Radu Serban, Ang Li, and Andrew Seidl. 2014. Unified memory in cuda 6.0. a brief overview of related data access and transfer issues. SBEL, Madison, WI, USA, Tech. Rep. TR-2014-09.
[23]
Robert O’callahan and Jong-Deok Choi. 2003. Hybrid dynamic data race detection. In Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming. 167–178. https://doi.org/10.1145/966049.781528
[24]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, and Luca Antiga. 2019. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703.
[25]
Xuan Peng, Xuanhua Shi, Hulin Dai, Hai Jin, Weiliang Ma, Qian Xiong, Fan Yang, and Xuehai Qian. 2020. Capuchin: Tensor-based GPU memory management for deep learning. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 891–905. https://doi.org/10.1145/3373376.3378505
[26]
Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. 2018. Efficient neural architecture search via parameters sharing. In International Conference on Machine Learning. 4095–4104.
[27]
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3505–3506. https://doi.org/10.1145/3394486.3406703
[28]
Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. 2019. Regularized evolution for image classifier architecture search. In Proceedings of the aaai conference on artificial intelligence. 33, 4780–4789. https://doi.org/10.1609/aaai.v33i01.33014780
[29]
Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc V Le, and Alexey Kurakin. 2017. Large-scale evolution of image classifiers. In International Conference on Machine Learning. 2902–2911.
[30]
Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W Keckler. 2016. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–13. https://doi.org/10.1109/MICRO.2016.7783721
[31]
Jean Senellart, Dakun Zhang, Bo Wang, Guillaume Klein, Jean-Pierre Ramatchandirin, Josep M Crego, and Alexander M Rush. 2018. OpenNMT system description for WNMT 2018: 800 words/sec on a single-core CPU. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. 122–128. https://doi.org/10.18653/v1/W18-2715
[32]
Konstantin Serebryany and Timur Iskhodzhanov. 2009. ThreadSanitizer: data race detection in practice. In Proceedings of the workshop on binary instrumentation and applications. 62–71. https://doi.org/10.1145/1791194.1791203
[33]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.
[34]
David So, Quoc Le, and Chen Liang. 2019. The evolved transformer. In International Conference on Machine Learning. 5877–5886.
[35]
Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dimitrios Lymberopoulos, Bodhi Priyantha, Jie Liu, and Diana Marculescu. 2019. Single-path nas: Designing hardware-efficient convnets in less than 4 hours. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 481–497. https://doi.org/10.1007/978-3-030-46147-8_29
[36]
Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. 2019. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2820–2828. https://doi.org/10.1109/CVPR.2019.00293
[37]
Mingxing Tan, Ruoming Pang, and Quoc V Le. 2020. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10781–10790. https://doi.org/10.1109/CVPR42600.2020.01079
[38]
Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dynamic GPU memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming. 41–53. https://doi.org/10.1145/3200691.3178491
[39]
Qiang Wang, Shaohuai Shi, Canhui Wang, and Xiaowen Chu. 2020. Communication Contention Aware Scheduling of Multiple Deep Learning Training Jobs. arXiv preprint arXiv:2002.10105.
[40]
Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, and Quanlu Zhang. 2018. Gandiva: Introspective cluster scheduling for deep learning. In 13th $USENIX$ Symposium on Operating Systems Design and Implementation ($OSDI$ 18). 595–610.
[41]
Bowen Yang, Jian Zhang, Jonathan Li, Christopher Ré, Christopher Aberger, and Christopher De Sa. 2021. Pipemare: Asynchronous pipeline parallel dnn training. Proceedings of Machine Learning and Systems, 3 (2021).
[42]
Shan You, Tao Huang, Mingmin Yang, Fei Wang, Chen Qian, and Changshui Zhang. 2020. Greedynas: Towards fast one-shot nas with greedy supernet. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1999–2008. https://doi.org/10.1109/CVPR42600.2020.00207
[43]
Zhao You, Shulin Feng, Dan Su, and Dong Yu. 2021. SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts. arXiv preprint arXiv:2105.03036, https://doi.org/10.21437/Interspeech.2021-478
[44]
Miao Zhang, Huiqi Li, Shirui Pan, Taoping Liu, and Steven Su. 2020. One-Shot Neural Architecture Search via Novelty Driven Sampling. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, Christian Bessiere (Ed.). International Joint Conferences on Artificial Intelligence Organization, 3188–3194. https://doi.org/10.24963/ijcai.2020/441 Main track
[45]
Quanlu Zhang, Zhenhua Han, Fan Yang, Yuge Zhang, Zhe Liu, Mao Yang, and Lidong Zhou. 2020. Retiarii: A Deep Learning Exploratory-Training Framework. In 14th $USENIX$ Symposium on Operating Systems Design and Implementation ($OSDI$ 20). 919–936.
[46]
Shixiong Zhao, Xusheng Chen, Cheng Wang, Fanxin Li, Qi Ji, Heming Cui, Cheng Li, and Sen Wang. 2020. HAMS: High Availability for Distributed Machine Learning Service Graphs. In 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 184–196. https://doi.org/10.1109/DSN48063.2020.00036
[47]
Shixiong Zhao, Rui Gu, Haoran Qiu, Tsz On Li, Yuexuan Wang, Heming Cui, and Junfeng Yang. 2018. Owl: Understanding and detecting concurrency attacks. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 219–230. https://doi.org/10.1109/DSN.2018.00033
[48]
Shixiong Zhao, Fanxin Li, Xusheng Chen, Xiuxian Guan, Jianyu Jiang, Dong Huang, Yuhao Qing, Sen Wang, Peng Wang, and Gong Zhang. 2021. VPIPE: A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training. IEEE Transactions on Parallel and Distributed Systems, https://doi.org/10.1109/TPDS.2021.3094364
[49]
Yiyang Zhao, Linnan Wang, Yuandong Tian, Rodrigo Fonseca, and Tian Guo. 2021. Few-shot neural architecture search. In International Conference on Machine Learning. 12707–12718.
[50]
Martin Zinkevich, Markus Weimer, Alexander J Smola, and Lihong Li. 2010. Parallelized stochastic gradient descent. In NIPS. 4, 4.
[51]
Barret Zoph and Quoc V Le. 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578.

Cited By

View all
  • (2024)Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-BatchingSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00055(1-15)Online publication date: 17-Nov-2024
  • (2024)CSIMD: Cross-Search Algorithm with Improved Multi-dimensional Dichotomy for Micro-Batch-Based Pipeline Parallel Training in DNNEuro-Par 2024: Parallel Processing10.1007/978-3-031-69766-1_20(288-301)Online publication date: 26-Aug-2024
  • (2023)Pipe-BD: Pipelined Parallel Blockwise Distillation2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE56975.2023.10137044(1-6)Online publication date: Apr-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '22: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems
February 2022
1164 pages
ISBN:9781450392051
DOI:10.1145/3503222
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 February 2022

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. Neural networks
  2. automatic machine learning
  3. distributed training
  4. neural architecture search
  5. one-shot
  6. parallel training

Qualifiers

  • Research-article

Funding Sources

Conference

ASPLOS '22

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)94
  • Downloads (Last 6 weeks)7
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-BatchingSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00055(1-15)Online publication date: 17-Nov-2024
  • (2024)CSIMD: Cross-Search Algorithm with Improved Multi-dimensional Dichotomy for Micro-Batch-Based Pipeline Parallel Training in DNNEuro-Par 2024: Parallel Processing10.1007/978-3-031-69766-1_20(288-301)Online publication date: 26-Aug-2024
  • (2023)Pipe-BD: Pipelined Parallel Blockwise Distillation2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE56975.2023.10137044(1-6)Online publication date: Apr-2023
  • (2023)SDPipe: A Semi-Decentralized Framework for Heterogeneity-Aware Pipeline-parallel TrainingProceedings of the VLDB Endowment10.14778/3598581.359860416:9(2354-2363)Online publication date: 1-May-2023
  • (2023)Automatic Pipeline Parallelism: A Parallel Inference Framework for Deep Learning Applications in 6G Mobile Communication SystemsIEEE Journal on Selected Areas in Communications10.1109/JSAC.2023.328097041:7(2041-2056)Online publication date: 1-Jul-2023
  • (2023)Efficient Supernet Training Using Path Parallelism2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071099(1249-1261)Online publication date: Feb-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media