ABSTRACT
Modern GPU clusters inherently exhibit heterogeneity, encompassing various aspects such as computation and communication. This heterogeneity poses a significant challenge for the elastic scheduling of deep learning workloads. Unfortunately, existing elastic schedulers often overlook the impact of heterogeneity on scaling efficiency, resulting in considerably prolonged job completion times.
In this paper, we present Heet, a new Heterogeneity-aware system explicitly developed for elastic training in DL clusters. Heet addresses two critical issues. First, it utilizes a 3-D collaborative filtering method to accurately measure the scaling efficiency of all elastic configurations on heterogeneous hosts, substantially reducing profiling overhead. Second, Heet introduces a unique price function to effectively balance scaling efficiency and scheduling efficiency. Building upon this function, Heet incorporates a scalable mechanism that employs minimum-weight full bipartite matching and opportunistic resource trading to generate dynamic scheduling decisions. Evaluations conducted on cloud clusters and large-scale simulations demonstrate that Heet can reduce job completion time by up to 2.46× compared to existing solutions.
- Alibaba cloud. https://www.alibabacloud.com/product/gpu, 2023.Google Scholar
- Funk svd. https://sifter.org/~simon/journal/20061211.html, 2023.Google Scholar
- Google cloud. https://cloud.google.com/gpu, 2023.Google Scholar
- Nccl: accelerated multi-gpu collective communication. https://images.nvidia.com/events/sc15/pdfs/NCCL-Woolley.pdf, 2023.Google Scholar
- Technical Report. https://www.dropbox.com/scl/fi/kpzudtyls285lp3zhhzgv/Tech.pdf?rlkey=glhq3rpaipaizaa1k04bl67t7&dl=0, 2023.Google Scholar
- Robert M Bell, Yehuda Koren, and Chris Volinsky. The bellkor 2008 solution to the netflix prize. Statistics Research Department at AT&T Research, 1(1), 2008.Google Scholar
- Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers, pages 177--186. Springer, 2010.Google ScholarCross Ref
- Shubham Chaudhary, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, and Srinidhi Viswanatha. Balancing efficiency and fairness in heterogeneous gpu clusters for deep learning. In Proceedings of the Fifteenth European Conference on Computer Systems, pages 1--16, 2020.Google ScholarDigital Library
- Chen Chen, Qizhen Weng, Wei Wang, Baochun Li, and Bo Li. Semi-dynamic load balancing: Efficient distributed learning in non-dedicated environments. In Proceedings of the 11th ACM Symposium on Cloud Computing, pages 431--446, 2020.Google ScholarDigital Library
- Christina Delimitrou and Christos Kozyrakis. Paragon: Qos-aware scheduling for heterogeneous datacenters. ACM SIGPLAN Notices, 48(4):77--88, 2013.Google ScholarDigital Library
- Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-efficient and qos-aware cluster management. ACM SIGPLAN Notices, 49(4):127--144, 2014.Google ScholarDigital Library
- Christina Delimitrou and Christos Kozyrakis. Bolt: I know what you did last summer... in the cloud. ACM SIGARCH Computer Architecture News, 45(1):599--613, 2017.Google ScholarDigital Library
- Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, et al. Dapple: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 431--445, 2021.Google ScholarDigital Library
- Wei Gao, Zhisheng Ye, Peng Sun, Yonggang Wen, and Tianwei Zhang. Chronus: A novel deadline-aware scheduler for deep learning training jobs. In Proceedings of the ACM Symposium on Cloud Computing, pages 609--623, 2021.Google ScholarDigital Library
- Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, and Aditya Akella. Multi-resource packing for cluster schedulers. In Proceedings of ACM SIGCOMM, August 2014.Google ScholarDigital Library
- Diandian Gu, Yihao Zhao, Yinmin Zhong, Yifan Xiong, Zhenhua Han, Peng Cheng, Fan Yang, Gang Huang, Xin Jin, and Xuanzhe Liu. Elasticflow: An elastic serverless training platform for distributed deep learning. 2023.Google Scholar
- Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Harry Liu, and Chuanxiong Guo. Tiresias: A gpu cluster manager for distributed deep learning. In NSDI, volume 19, pages 485--500, 2019.Google Scholar
- Aric Hagberg, Pieter Swart, and Daniel S Chult. Exploring network structure, dynamics, and function using networkx. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States), 2008.Google Scholar
- Jonathan L Herlocker, Joseph A Konstan, and John Riedl. Explaining collaborative filtering recommendations. In Proceedings of the 2000 ACM conference on Computer supported cooperative work, pages 241--250, 2000.Google ScholarDigital Library
- Qinghao Hu, Peng Sun, Shengen Yan, Yonggang Wen, and Tianwei Zhang. Characterization and prediction of deep learning workloads in large-scale gpu datacenters. In Proceedings of SC, 2021.Google ScholarDigital Library
- Qinghao Hu, Meng Zhang, Peng Sun, Yonggang Wen, and Tianwei Zhang. Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 457--472, 2023.Google ScholarDigital Library
- Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.Google Scholar
- Nicolas Hug. Surprise: A python library for recommender systems. Journal of Open Source Software, 5(52):2174, 2020.Google ScholarCross Ref
- Changho Hwang, Taehyun Kim, Sunghyun Kim, Jinwoo Shin, and KyoungSoo Park. Elastic resource sharing for distributed deep learning. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 721--739, 2021.Google Scholar
- Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. In Proceedings of ATC, 2019.Google Scholar
- Richard M Karp. An algorithm to solve the m× n assignment problem in expected time o (mn log n). Networks, 10(2):143--152, 1980.Google ScholarCross Ref
- Tan N Le, Xiao Sun, Mosharaf Chowdhury, and Zhenhua Liu. Allox: compute allocation in hybrid clusters. In Proceedings of the Fifteenth European Conference on Computer Systems, pages 1--16, 2020.Google ScholarDigital Library
- Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R Tallent, and Kevin J Barker. Evaluating modern gpu interconnect: Pcie, nvlink, nv-sli, nvswitch and gpudirect. IEEE Transactions on Parallel and Distributed Systems, 31(1):94--110, 2019.Google ScholarDigital Library
- Jiamin Li, Hong Xu, Yibo Zhu, Zherui Liu, Chuanxiong Guo, and Cong Wang. Lyra: Elastic scheduling for deep learning clusters. In Proceedings of the Eighteenth European Conference on Computer Systems, pages 835--850, 2023.Google ScholarDigital Library
- Qinyi Luo, Jiaao He, Youwei Zhuo, and Xuehai Qian. Prague: High-performance heterogeneity-aware asynchronous decentralized training. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 401--416, 2020.Google ScholarDigital Library
- Qinyi Luo, Jinkun Lin, Youwei Zhuo, and Xuehai Qian. Hop: Heterogeneity-aware decentralized training. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 893--907, 2019.Google ScholarDigital Library
- Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. Themis: Fair and efficient gpu cluster scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation, 2020.Google Scholar
- Luo Mai, Guo Li, Marcel Wagenländer, Konstantinos Fertakis, Andrei-Octavian Brabete, and Peter Pietzuch. Kungfu: Making training in distributed machine learning adaptive. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation, pages 937--954, 2020.Google Scholar
- Jayashree Mohan, Amar Phanishayee, Janardhan Kulkarni, and Vijay Chidambaram. Looking beyond gpus for dnn scheduling on multi-tenant clusters. In Proceedings of OSDI, July 2022.Google Scholar
- Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1--15, 2019.Google ScholarDigital Library
- Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. Heterogeneity-aware cluster scheduling policies for deep learning workloads. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation, pages 481--498, 2020.Google ScholarDigital Library
- Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--15, 2021.Google ScholarDigital Library
- Andrew Or, Haoyu Zhang, and Michael Freedman. Resource elasticity in distributed deep learning. Proceedings of Machine Learning and Systems, 2:400--411, 2020.Google Scholar
- Andrew Or, Haoyu Zhang, and Michael None Freedman. Virtualflow: Decoupling deep learning models from the underlying hardware. Proceedings of Machine Learning and Systems, 4:126--140, 2022.Google Scholar
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.Google Scholar
- Aurick Qiao, Abutalib Aghayev, Weiren Yu, Haoyang Chen, Qirong Ho, Garth A Gibson, and Eric P Xing. Litz: Elastic framework for high-performance distributed machine learning. In 2018 { USENIX} Annual Technical Conference ({USENIX}{ATC } 18), pages 631--644, 2018.Google Scholar
- Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R Ganger, and Eric P Xing. Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning. In OSDI, volume 21, pages 1--18, 2021.Google Scholar
- Saeed Rashidi, Matthew Denton, Srinivas Sridharan, Sudarshan Srinivasan, Amoghavarsha Suresh, Jade Nie, and Tushar Krishna. Enabling compute-communication overlap in distributed deep learning training platforms. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 540--553. IEEE, 2021.Google ScholarDigital Library
- Herbert Robbins. A remark on stirling's formula. The American Mathematical Monthly, 10(1):26--29, 1955.Google Scholar
- Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799, 2018.Google Scholar
- Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.Google Scholar
- Xiaoyuan Su and Taghi M Khoshgoftaar. A survey of collaborative filtering techniques. Advances in artificial intelligence, 2009, 2009.Google Scholar
- Shibo Wang, Jinliang Wei, Amit Sabne, Andy Davis, Berkin Ilbeyi, Blake Hechtman, Dehao Chen, Karthik Srinivasa Murthy, Marcello Maggioni, Qiao Zhang, et al. Overlap communication with dependent computation via decomposition in large deep learning models. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pages 93--106, 2022.Google ScholarDigital Library
- Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. Mlaas in the wild: Workload analysis and scheduling in large-scale heterogeneous gpu clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 945--960. USENIX Association, 2022.Google Scholar
- Ian H Witten and Eibe Frank. Data mining: practical machine learning tools and techniques with java implementations. Acm Sigmod Record, 31(1):76--77, 2002.Google ScholarDigital Library
- Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, et al. Gandiva: Introspective cluster scheduling for deep learning. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), pages 595--610, 2018.Google Scholar
- Lei Xie, Jidong Zhai, Baodong Wu, Yuanbo Wang, Xingcheng Zhang, Peng Sun, and Shengen Yan. Elan: Towards generic and efficient elastic training for deep learning. In 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS), pages 78--88. IEEE, 2020.Google ScholarCross Ref
- Xiaodong Yi, Shiwei Zhang, Ziyue Luo, Guoping Long, Lansong Diao, Chuan Wu, Zhen Zheng, Jun Yang, and Wei Lin. Optimizing distributed training deployment in heterogeneous gpu clusters. In Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies, pages 93--107, 2020.Google ScholarDigital Library
- Jie You, Jae-Won Chung, and Mosharaf Chowdhury. Zeus: Understanding and optimizing {GPU} energy consumption of {DNN} training. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 119--139, 2023.Google Scholar
- Hanyu Zhao, Zhenhua Han, Zhi Yang, Quanlu Zhang, Fan Yang, Lidong Zhou, Mao Yang, Francis CM Lau, Yuqi Wang, Yifan Xiong, et al. Hived: Sharing a gpu cluster for deep learning with guarantees. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation, pages 515--532, 2020.Google Scholar
Index Terms
- Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters
Recommendations
Lyra: Elastic Scheduling for Deep Learning Clusters
EuroSys '23: Proceedings of the Eighteenth European Conference on Computer SystemsOrganizations often build separate training and inference clusters for deep learning, and use separate schedulers to manage them. This leads to problems for both: inference clusters have low utilization when the traffic load is low; training jobs often ...
Dynamic Job Scheduling on Heterogeneous Clusters
ISPDC '09: Proceedings of the 2009 Eighth International Symposium on Parallel and Distributed ComputingThis paper addresses the problem of scheduling dynamicallymulti-user and independent jobs on clusters, both homogeneous and heterogeneous. The dynamic behaviormeans that the scheduler is able to adapt the schedulingwhen new jobs are submitted and also ...
Dynamic scheduling of a batch of parallel task jobs on heterogeneous clusters
This paper addresses the problem of minimizing the scheduling length (make-span) of a batch of jobs with different arrival times. A job is described by a direct acyclic graph (DAG) of parallel tasks. The paper proposes a dynamic scheduling method that ...
Comments