skip to main content
10.1145/3464298.3476132acmconferencesArticle/Chapter ViewAbstractPublication PagesmiddlewareConference Proceedingsconference-collections
research-article
Public Access

Towards optimal placement and scheduling of DNN operations with Pesto

Published: 02 October 2021 Publication History

Abstract

The increasing size of Deep Neural Networks (DNNs) has necessitated the use of multiple GPUs to host a single DNN model, a practice commonly referred to as model parallelism. The key challenge for model parallelism is to efficiently and effectively partition the DNN model across GPUs to avoid communication overheads while maximizing the GPU utilization, with the end-goal of minimizing the training time of DNN models. Existing approaches either take a long time(hours or even days) to find an effective partition or settle for sub-optimal partitioning, invariably increasing the end-to-end training effort. In this paper, we design and implement Pesto, a fast and near-optimal model placement technique for automatically partitioning arbitrary DNNs across multiple GPUs. The key idea in Pesto is to jointly optimize the model placement and scheduling at the fine-grained operation level to minimize inter-GPU communication while maximizing the opportunity to parallelize the model across GPUs. By carefully formulating the problem as an integer program, Pesto can provide the optimal placement and scheduling. We implement Pesto in TensorFlow and show that Pesto can reduce model training time by up to 31% compared to state-of-the-art approaches, across several large DNN models.

References

[1]
[n.d.]. Amazon EC2 Instance Types. https://aws.amazon.com/ec2/instance-types. Accessed: 2020-11-28.
[2]
[n.d.]. Baechi: Fase Device Placement on Machine Learning Graphs (SoCC 2020). https://github.com/beomyeol/baechi. Accessed: 2021-03-28.
[3]
[n.d.]. NVIDIA V100 TENSOR CORE GPUs. https://www.nvidia.com/en-us/data-center/v100/. Accessed: 2020-11-28.
[4]
[n.d.]. NVLINK FABRIC: A FASTER, MORE SCALABLE INTERCONNECT. https://www.nvidia.com/en-us/data-center/nvlink/.
[5]
[n.d.]. Pesto Source-code. https://github.com/PACELab/TF-Pesto. Accessed: 2020-11-28.
[6]
[n.d.]. TensorFlow NMT GitHub. https://github.com/tensorflow/nmt. Accessed: 2020-11-28.
[7]
[n.d.]. WMT 2016. http://www.statmt.org/wmt16/. Accessed: 2020-11-28.
[8]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. http://tensorflow.org/ Software available from tensorflow.org.
[9]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In <i>12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16)</i>. 265–283.
[10]
Ravichandra Addanki, Shaileshh Bojja Venkatakrishnan, Shreyan Gupta, Hongzi Mao, and Mohammad Alizadeh. 2019. Placeto: Learning generalizable device placement algorithms for distributed machine learning. In <i>Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS)</i>. 3983–3993.
[11]
Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In <i>3rd International Conference on Learning Representations, ICLR 2015</i>.
[12]
Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Aleš Tamchyna. 2014. Findings of the 2014 Workshop on Statistical Machine Translation. In <i>Proceedings of the Ninth Workshop on Statistical Machine Translation</i>. Association for Computational Linguistics, Baltimore, Maryland, USA, 12–58. http://www.aclweb.org/anthology/W/W14/W14-3302
[13]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. <i>arXiv preprint arXiv:1512.01274</i> (2015).
[14]
Cédric Chevalier and François Pellegrini. 2008. PT-Scotch: A tool for efficient parallel graph ordering. <i>Parallel computing</i> 34, 6-8 (2008), 318–331.
[15]
Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system. In <i>11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14)</i>. 571–582.
[16]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. 2012a. Large scale distributed deep networks. <i>Advances in neural information processing systems</i> 25 (2012), 1223–1231.
[17]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. 2012b. Large scale distributed deep networks. In <i>Advances in neural information processing systems</i>. 1223–1231.
[18]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In <i>2009 IEEE conference on computer vision and pattern recognition</i>. Ieee, 248–255.
[19]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In <i>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)</i>. 4171–4186.
[20]
Shashwat Garg, Janardhan Kulkarni, and Shi Li. 2019. Lift and project algorithms for precedence constrained scheduling to minimize completion time. In <i>Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms</i>. SIAM, 1570–1584.
[21]
Ronald L. Graham. 1969. Bounds on multiprocessing timing anomalies. <i>SIAM journal on Applied Mathematics</i> 17, 2 (1969), 416–429.
[22]
Ubaid Ullah Hafeez and Anshul Gandhi. 2020. Empirical Analysis and Modeling of Compute Times of CNN Operations on AWS Cloud. In <i>2020 IEEE International Symposium on Workload Characterization (IISWC)</i>. IEEE, 181–192.
[23]
Claire Hanen and Alix Munier. 1995. An approximation algorithm for scheduling dependent tasks on m processors with small communication delays. In <i>Proceedings 1995 INRIA/IEEE Symposium on Emerging Technologies and Factory Automation. ETFA'95</i>, Vol. 1. IEEE, 167–189.
[24]
Claire Hanen and Alix Munier. 1998. Performance of Coffman-Graham schedules in the presence of unit communication delays. <i>Discrete applied mathematics</i> 81, 1-3 (1998), 93–108.
[25]
Julien Herrmann, Jonathan Kho, Bora Uçar, Kamer Kaya, and Ümit V Çatalyürek. 2017. Acyclic partitioning of large directed acyclic graphs. In <i>2017 17th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGRID)</i>. IEEE, 371–380.
[26]
Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al. 2012. Deep neural networks for acoustic modeling in speech recognition. <i>IEEE Signal processing magazine</i> 29 (2012).
[27]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. <i>Neural computation</i> 9, 8 (1997), 1735–1780.
[28]
JA Hoogeveen, Jan Karel Lenstra, and Bart Veltman. 1994. Three, four, five, six, or the complexity of scheduling with communication delays. <i>Operations Research Letters</i> 16, 3 (1994), 129–137.
[29]
Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, and Zhifeng Chen. 2018. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. <i>arXiv preprint arXiv:1811.06965</i> (2018).
[30]
IBM. [n.d.]. <i>CPLEX Optimizer</i>. https://www.ibm.com/analytics/cplex-optimizer
[31]
Beomyeol Jeon, Linda Cai, Pallavi Srivastava, Jintao Jiang, Xiaolan Ke, Yitao Meng, Cong Xie, and Indranil Gupta. 2020a. Baechi: fast device placement of machine learning graphs. In <i>Proceedings of the 11th ACM Symposium on Cloud Computing</i>. 416–430.
[32]
Beomyeol Jeon, Linda Cai, Pallavi Srivastava, Jintao Jiang, Xiaolan Ke, Yitao Meng, Cong Xie, and Indranil Gupta. 2020b. Baechi: fast device placement of machine learning graphs. In <i>Proceedings of the 11th ACM Symposium on Cloud Computing</i>. 416–430.
[33]
Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond data and model parallelism for deep neural networks. <i>SysML 2019</i> (2019).
[34]
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. <i>arXiv preprint arXiv:1602.02410</i> (2016).
[35]
Arthur B Kahn. 1962. Topological sorting of large networks. <i>Commun. ACM</i> 5, 11 (1962), 558–562.
[36]
N Kalchbrenner, E Grefenstette, and Philip Blunsom. 2014. A convolutional neural network for modelling sentences. In <i>52nd Annual Meeting of the Association for Computational Linguistics</i>. Association for Computational Linguistics.
[37]
Jin Kyu Kim, Qirong Ho, Seunghak Lee, Xun Zheng, Wei Dai, Garth A Gibson, and Eric P Xing. 2016. STRADS: a distributed framework for scheduled model parallel machine learning. In <i>Proceedings of the Eleventh European Conference on Computer Systems</i>. 1–16.
[38]
Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. <i>arXiv preprint arXiv:1404.5997</i> (2014).
[39]
Janardhan Kulkarni, Shi Li, Jakub Tarnawski, and Minwei Ye. 2020. Hierarchy-based algorithms for minimizing makespan under precedence and communication constraints. In <i>Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms</i>. SIAM, 2770–2789.
[40]
Quoc V Le. 2013. Building high-level features using large scale unsupervised learning. In <i>2013 IEEE international conference on acoustics, speech and signal processing</i>. IEEE, 8595–8598.
[41]
Elaine Levey and Thomas Rothvoss. 2019. A (1+ epsilon)-approximation for makespan scheduling with precedence constraints using LP hierarchies. <i>SIAM J. Comput.</i> 0 (2019), STOC16–201.
[42]
Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R Tallent, and Kevin J Barker. 2019. Evaluating modern GPU interconnect: Pcie, nvlink, nv-sli, nvswitch and gpudirect. <i>IEEE Transactions on Parallel and Distributed Systems</i> 31, 1 (2019), 94–110.
[43]
Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. (1993).
[44]
Azalia Mirhoseini, Anna Goldie, Hieu Pham, Benoit Steiner, Quoc V Le, and Jeff Dean. 2018. A Hierarchical Model for Device Placement. In <i>International Conference on Learning Representations (ICLR)</i>.
[45]
Azalia Mirhoseini, Hieu Pham, Quoc V Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. 2017. Device placement optimization with reinforcement learning. In <i>Proceedings of the 34th International Conference on Machine Learning-Volume 70</i>. JMLR. org, 2430–2439.
[46]
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. In <i>Proceedings of the 27th ACM Symposium on Operating Systems Principles</i>. 1–15.
[47]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In <i>Advances in neural information processing systems</i>. 8026–8037.
[48]
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models. In <i>SC20: International Conference for High Performance Computing, Networking, Storage and Analysis</i>. IEEE, 1–16.
[49]
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In <i>Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining</i>. 3505–3506.
[50]
Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. 2019. Regularized evolution for image classifier architecture search. In <i>Proceedings of the aaai conference on artificial intelligence</i>, Vol. 33. 4780–4789.
[51]
Yassir Samadi, Mostapha Zbakh, and Claude Tadonki. 2018. E-HEFT: enhancement heterogeneous earliest finish time algorithm for task scheduling based on load balancing in cloud computing. In <i>2018 International Conference on High Performance Computing & Simulation (HPCS)</i>. IEEE, 601–609.
[52]
Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, et al. 2018. Mesh-TensorFlow: deep learning for supercomputers. In <i>Proceedings of the 32nd International Conference on Neural Information Processing Systems</i>. 10435–10444.
[53]
Kenneth W Stufflebeam Jr. 2006. Configurable PCI express switch which allows multiple CPUs to be connected to multiple I/O devices. US Patent 7,058,738.
[54]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In <i>Proceedings of the IEEE conference on computer vision and pattern recognition</i>. 2818–2826.
[55]
Haluk Topcuoglu, Salim Hariri, and Min-you Wu. 2002. Performance-effective and low-complexity task scheduling for heterogeneous computing. <i>IEEE transactions on parallel and distributed systems</i> 13, 3 (2002), 260–274.
[56]
J.D. Ullman. 1975. NP-complete scheduling problems. <i>J. Comput. System Sci.</i> 10 (1975), 384–393.
[57]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In <i>Advances in neural information processing systems</i>. 5998–6008.
[58]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. <i>arXiv preprint arXiv:1609.08144</i> (2016).
[59]
Tao Yang and Apostolos Gerasoulis. 1994. DSC: Scheduling parallel tasks on an unbounded number of processors. <i>IEEE Transactions on Parallel and Distributed Systems</i> 5, 9 (1994), 951–967.
[60]
Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. <i>arXiv preprint arXiv:1409.2329</i> (2014).
[61]
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. 2018. Learning transferable architectures for scalable image recognition. In <i>Proceedings of the IEEE conference on computer vision and pattern recognition</i>. 8697–8710.

Cited By

View all
  • (2024)nnScalerProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691957(347-363)Online publication date: 10-Jul-2024
  • (2023)Metrics for Sustainability in Data CentersACM SIGEnergy Energy Informatics Review10.1145/3630614.36306223:3(40-46)Online publication date: 25-Oct-2023
  • (2023)Mercury: Fast and Optimal Device Placement for Large Deep Learning ModelsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605603(412-422)Online publication date: 7-Aug-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
Middleware '21: Proceedings of the 22nd International Middleware Conference
December 2021
398 pages
ISBN:9781450385343
DOI:10.1145/3464298
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • USENIX Assoc: USENIX Assoc
  • IFIP

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. DNN placement
  2. giant DNNs
  3. model parallelism
  4. scheduling
  5. systems for ML

Qualifiers

  • Research-article

Funding Sources

Conference

Middleware '21
Sponsor:
Middleware '21: 22nd International Middleware Conference
December 6 - 10, 2021
Québec city, Canada

Acceptance Rates

Overall Acceptance Rate 203 of 948 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)240
  • Downloads (Last 6 weeks)23
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)nnScalerProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691957(347-363)Online publication date: 10-Jul-2024
  • (2023)Metrics for Sustainability in Data CentersACM SIGEnergy Energy Informatics Review10.1145/3630614.36306223:3(40-46)Online publication date: 25-Oct-2023
  • (2023)Mercury: Fast and Optimal Device Placement for Large Deep Learning ModelsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605603(412-422)Online publication date: 7-Aug-2023
  • (2023)A Survey on Auto-Parallelism of Large-Scale Deep Learning TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.328193134:8(2377-2390)Online publication date: 1-Aug-2023
  • (2023)Expediting Distributed DNN Training With Device Topology-Aware Graph DeploymentIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.324326134:4(1281-1293)Online publication date: 1-Apr-2023
  • (2022)FuncPipe: A Pipelined Serverless Framework for Fast and Cost-Efficient Training of Deep Learning ModelsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35706076:3(1-30)Online publication date: 8-Dec-2022
  • (2022)Serving unseen deep learning models with near-optimal configurationsProceedings of the 13th Symposium on Cloud Computing10.1145/3542929.3563485(461-476)Online publication date: 7-Nov-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media