ABSTRACT
Over the last few years, at ByteDance, our compute infrastructure scale has been expanding significantly due to expedited business growth. In this journey, to meet hyper-scale growth, some business groups resorted to managing their own compute infrastructure stack running different scheduling systems such as Kubernetes, YARN which created two major pain points: the increasing resource fragmentation across different business groups and the inadequate resource elasticity between workloads of different business priorities. Isolation across different business groups (and their compute infrastructure management) leads to inefficient compute resource utilization and prevents us from serving the business growth needs in the long run.
To meet these challenges, we propose a resource management and scheduling system named Gödel, which provides a unified compute infrastructure for all business groups to run their diverse workloads under a unified resource pool. It co-locates various workloads on every machine to achieve better resource utilization and elasticity. Gödel is built upon Kubernetes, the de facto open-source container orchestration system, but with significant components replaced or enhanced to accommodate various workloads at a large scale. In production, it manages clusters with tens of thousands of machines, achieves high overall resource utilization of over 60%, and scheduling throughput of up to 5000 pods per second. This paper reports on our design and implementation with Gödel. Moreover, it discusses the lessons and best practices we learned in developing and operating it in production at ByteDance's scale.
- etcd. https://etcd.io/.Google Scholar
- Kansible kubemark. https://github.com/fabric8io/kansible/blob/master/vendor/k8s.io/kubernetes/docs/devel/kubemark-guide.md.Google Scholar
- Katalyst. https://github.com/kubewharf/katalyst-core.Google Scholar
- Kube-batch. https://github.com/kubernetes-sigs/kube-batch.Google Scholar
- Kubebrain. https://github.com/kubewharf/kubebrain.Google Scholar
- Kubernetes. https://kubernetes.io/.Google Scholar
- Nomad project. https://www.nomadproject.io/.Google Scholar
- Volcano. https://github.com/volcano-sh/volcano.Google Scholar
- Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. Apollo: Scalable and coordinated scheduling for {Cloud-Scale} computing. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 285--300, 2014.Google Scholar
- Pamela Delgado, Diego Didona, Florin Dinu, and Willy Zwaenepoel. Job-aware scheduling in eagle: Divide and stick to your probes. In Proceedings of the Seventh ACM Symposium on Cloud Computing, pages 497--509, 2016.Google ScholarDigital Library
- Pamela Delgado, Florin Dinu, Anne-Marie Kermarrec, and Willy Zwaenepoel. Hawk: Hybrid datacenter scheduling. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 499--510, 2015.Google Scholar
- Christina Delimitrou and Christos Kozyrakis. Paragon: Qos-aware scheduling for heterogeneous datacenters. ACM SIGPLAN Notices, 48(4):77--88, 2013.Google ScholarDigital Library
- Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-efficient and qos-aware cluster management. ACM SIGPLAN Notices, 49(4):127--144, 2014.Google ScholarDigital Library
- Christina Delimitrou, Daniel Sanchez, and Christos Kozyrakis. Tarcil: Reconciling scheduling speed and quality in large shared clusters. In Proceedings of the Sixth ACM Symposium on Cloud Computing, pages 97--110, 2015.Google ScholarDigital Library
- Yihui Feng, Zhi Liu, Yunjian Zhao, Tatiana Jin, Yidi Wu, Yang Zhang, James Cheng, Chao Li, and Tao Guan. Scaling large production clusters with partitioned synchronization. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 81--97, 2021.Google Scholar
- Panagiotis Garefalakis, Konstantinos Karanasos, Peter Pietzuch, Arun Suresh, and Sriram Rao. Medea: scheduling of long running applications in shared production clusters. In Proceedings of the thirteenth EuroSys conference, pages 1--13, 2018.Google ScholarDigital Library
- Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. Dominant resource fairness: Fair allocation of multiple resource types. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11), 2011.Google ScholarDigital Library
- Ionel Gog, Malte Schwarzkopf, Adam Gleave, Robert NM Watson, and Steven Hand. Firmament: Fast, centralized cluster scheduling at scale. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 99--115, 2016.Google ScholarDigital Library
- Robert Grandl, Mosharaf Chowdhury, Aditya Akella, and Ganesh Ananthanarayanan. Altruistic scheduling in {Multi-Resource} clusters. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 65--80, 2016.Google Scholar
- Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D Joseph, Randy Katz, Scott Shenker, and Ion Stoica. Mesos: A platform for {Fine-Grained} resource sharing in the data center. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11), 2011.Google Scholar
- Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg. Quincy: fair scheduling for distributed computing clusters. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 261--276, 2009.Google ScholarDigital Library
- Konstantinos Karanasos, Sriram Rao, Carlo Curino, Chris Douglas, Kishore Chaliparambil, Giovanni Matteo Fumarola, Solom Heddaya, Raghu Ramakrishnan, and Sarvesh Sakalanaga. Mercury: Hybrid centralized and distributed scheduling in large shared clusters. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 485--497, 2015.Google Scholar
- Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. Learning scheduling algorithms for data processing clusters. In Proceedings of the ACM special interest group on data communication, pages 270--288. 2019.Google ScholarDigital Library
- Shanka Subhra Mondal, Nikhil Sheoran, and Subrata Mitra. Scheduling of time-varying workloads using reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9000--9008, 2021.Google ScholarCross Ref
- Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. Sparrow: distributed, low latency scheduling. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 69--84, 2013.Google ScholarDigital Library
- Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. Omega: flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems, pages 351--364, 2013.Google ScholarDigital Library
- Chunqiang Tang, Kenny Yu, Kaushik Veeraraghavan, Jonathan Kaldor, Scott Michelson, Thawan Kooburat, Aravind Anbudurai, Matthew Clark, Kabir Gogia, Long Cheng, et al. Twine: A unified cluster management system for shared infrastructure. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 787--803, 2020.Google Scholar
- Prashanth Thinakaran, Jashwant Raj Gunasekaran, Bikash Sharma, Mahmut Taylan Kandemir, and Chita R Das. Phoenix: A constraint-aware scheduler for heterogeneous datacenters. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pages 977--987. IEEE, 2017.Google ScholarCross Ref
- Muhammad Tirmazi, Adam Barker, Nan Deng, Md E Haque, Zhijing Gene Qin, Steven Hand, Mor Harchol-Balter, and John Wilkes. Borg: the next generation. In Proceedings of the fifteenth European conference on computer systems, pages 1--14, 2020.Google ScholarDigital Library
- Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A Kozuch, Mor Harchol-Balter, and Gregory R Ganger. Tetrisched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters. In Proceedings of the Eleventh European Conference on Computer Systems, pages 1--16, 2016.Google ScholarDigital Library
- Vinod Kumar Vavilapalli, Arun C Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, et al. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing, pages 1--16, 2013.Google ScholarDigital Library
- Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. Large-scale cluster management at google with borg. In roceedings of the Tenth European Conference on Computer Systems, pages 1--17, 2015.Google Scholar
- Yanqi Zhang, Weizhe Hua, Zhuangzhuang Zhou, G Edward Suh, and Christina Delimitrou. Sinan: Ml-based and qos-aware resource management for cloud microservices. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 167--181, 2021.Google ScholarDigital Library
- Zhuo Zhang, Chao Li, Yangyu Tao, Renyu Yang, Hong Tang, and Jie Xu. Fuxi: a fault-tolerant resource management and job scheduling system at internet scale. In Proceedings of the VLDB Endowment, volume 7, pages 1393--1404. VLDB Endowment Inc., 2014.Google Scholar
- Wei Zhou, K Preston White, and Hongfeng Yu. Improving short job latency performance in hybrid job schedulers with dice. In Proceedings of the 48th International Conference on Parallel Processing, pages 1--10, 2019.Google ScholarDigital Library
Index Terms
- Gödel: Unified Large-Scale Resource Management and Scheduling at ByteDance
Recommendations
VMCTune: A Load Balancing Scheme for Virtual Machine Cluster Using Dynamic Resource Allocation
GCC '10: Proceedings of the 2010 Ninth International Conference on Grid and Cloud ComputingThis paper designs and implements a load balancing scheme based on dynamic resource allocation policy for virtual machine cluster, which are running under para-virtualization mode on a cluster of physical machines (PM) in shared storage architecture. It ...
Enterprise Resource Management in Mesos Clusters
SYSTOR '16: Proceedings of the 9th ACM International on Systems and Storage ConferenceEnterprise data centers increasingly adopt a cloud-like architecture that enables the execution of multiple workloads on a shared pool of resources, reduces the data center footprint and drives down the costs. A number of cluster resource managers have ...
A Constrained Genetic Algorithm for Rebalancing of Services in Cloud Data Centers
CLOUD '15: Proceedings of the 2015 IEEE 8th International Conference on Cloud ComputingIn Infrastructure-as-a-Service cloud data centers, services are provided to cloud customers in the form of virtual machines. Cloud customers can place restrictions on these services by specifying affinity and anti-affinity constraints. Load imbalance is ...
Comments