skip to main content
10.1145/3472883.3486971acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

George: Learning to Place Long-Lived Containers in Large Clusters with Operation Constraints

Published: 01 November 2021 Publication History

Abstract

Online cloud services are widely deployed as Long-Running Applications (LRAs) hosted in containers. Placing LRA containers turns out to be particularly challenging due to the complex interference between co-located containers and the operation constraints in production clusters such as fault tolerance, disaster avoidance and incremental deployment. Existing schedulers typically provide APIs for operators to manually specify the container scheduling requirements and offer only qualitative scheduling guidelines for container placement. Such schedulers, do not perform well in terms of both performance and scale, while also requiring manual intervention.
In this work, we propose George, an end-to-end generalpurpose LRA scheduler by leveraging the state-of-the-art Reinforcement Learning (RL) techniques to intelligently schedule LRA containers. We present an optimal container placement formulation for the first time with the objective of maximizing container placement performance subject to a set of operation constraints. One fundamental challenge in scheduling is to categorically satisfy different operation constraints in practice; specifically, to guarantee hard constraints and ensure soft constraints violations within a pre-defined threshold. We design a novel projection-based proximal policy optimization (PPPO) algorithm in combination with an Integer Linear optimization technique to intelligently schedule LRA containers under operation constraints. In order to reduce the training time, we apply transfer learning technique by taking advantage of the similarity in different LRA scheduling events. We prove theoretically that our proposed algorithm is effective, stable, and safe. We implement George as a plug-in service in Docker Swarm. Our in-house cluster demonstrates that George can maximize the LRA performance while enforcing the hard constraints and the soft constraints with a pre-defined threshold. The experiments show that George improves LRA performance and scale tremendously by requiring less than 1 hour scheduling time in a large cluster with 2K containers and 700 machines, 16x faster than existing schedulers. Compared with state-of-the-art alternatives, George also achieves 26% higher container performance with up to 70% lower constraint violation.

Supplementary Material

li (li.zip)
Supplemental movie, appendix, image and software files for, George: Learning to Place Long-Lived Containers in Large Clusters with Operation Constraints
MP4 File (Day2_5-4.mp4)
Presentation video

References

[1]
2018. Getting Started with A/B Testing. https://developer.amazon.com/blogs/appstore/post/Tx27HL6EMW36UCL/getting-started-with-a-b-testing.
[2]
2019. Aurora. http://aurora.apache.org.
[3]
2019. Marathon: A container orchestration platform for Mesos and DC/OS. http://mesosphere.github.io/marathon/.
[4]
2021. Alibaba production cluster data. https://github.com/alibaba/clusterdata.
[5]
2021. Amazon Elastic Compute Cloud (Amazon EC2). https://aws.amazon.com/ec2.
[6]
2021. AMAZON WEB SERVIES, INC. AWS Lambda: Serverless computing. https://aws.amazon.com/cn/lambda/.
[7]
2021. Apache flink. https://flink.apache.org/.
[8]
2021. Apache HBase. https://hbase.apache.org/.
[9]
2021. Apache Kafka. https://kafka.apache.org/.
[10]
2021. Apache MXNet. http://mxnet.incubator.apache.org/.
[11]
2021. Apache Storm. https://storm.apache.org/.
[12]
2021. Docker swarm. https://github.com/docker/swarm.
[13]
2021. George GitHub Repository. https://github.com/lwangbm/george-LRA-scheduler.
[14]
2021. George: Learning to Place Long-Lived Containers in Large Clusters with Operation Constraints (Appendix). https://1drv.ms/b/s!Ar8AJEjgeqwBbSgFMiE4LRhTs9o?e=tKlCBb.
[15]
2021. Google production cluster data. https://github.com/google/cluster-data.
[16]
2021. hashlib: Secure hashes and message digests. https://docs.python.org/3/library/hashlib.html.
[17]
2021. Kubernetes: Production-Grade Container Orchestration. https://kubernetes.io/.
[18]
2021. Locust: an open source load testing tool. https://locust.io/.
[19]
2021. Memcached. https://memcached.org/.
[20]
2021. Model Server for Apache MXNet. https://github.com/awslabs/mxnet-model-server.
[21]
2021. PySceneDetect: Python and OpenCV-based scene cut/transition detection program & library. https://github.com/Breakthrough/PySceneDetect/.
[22]
2021. Redis: an open source, in-memory data structure store. https://redis.io/.
[23]
2021. Redis-benchmark. https://redis.io/topics/benchmarks.
[24]
2021. Solr: An open-source enterprise search platform built on Apache Lucene. https://solr.apache.org.
[25]
2021. wrk: Modern HTTP benchmarking tool. https://github.com/wg/wrk.
[26]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In Proc. USENIX OSDI.
[27]
Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. 2017. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. 22 -31.
[28]
Yixin Bao, Yanghua Peng, and Chuan Wu. 2019. Deep Learning-based Job Placement in Distributed Machine Learning Clusters. In Proc. IEEE INFOCOM.
[29]
Andrew G Barto and Sridhar Mahadevan. 2003. Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems 13, 1-2 (2003), 41--77.
[30]
Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5--32.
[31]
Luiz A Celiberto Jr, Jackson P Matsuura, Ramón López De Màntaras, and Reinaldo AC Bianchi. 2010. Using transfer learning to speedup reinforcement learning: a cased-based approach. In 2010 Latin American Robotics Symposium and Intelligent Robotics Meeting. IEEE, 55--60.
[32]
Yanpei Chen, Sara Alspaugh, and Randy Katz. 2012. Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads. Proc. VLDB Endow. 5, 12 (Aug. 2012), 1802--1813.
[33]
Yanpei Chen, Archana Ganapathi, Rean Griffith, and Randy Katz. 2011. The case for evaluating mapreduce performance using workload suites. In 2011 IEEE 19th annual international symposium on modelling, analysis, and simulation of computer and telecommunication systems. IEEE, 390--399.
[34]
Yue Cheng, Zheng Chai, and Ali Anwar. 2018. Characterizing Co-Located Datacenter Workloads: An Alibaba Case Study. In Proceedings of the 9th Asia-Pacific Workshop on Systems (Jeju Island, Republic of Korea) (APSys '18). Association for Computing Machinery, New York, NY, USA, Article 12, 3 pages. https://doi.org/10.1145/3265723.3265742
[35]
Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proc. ACM SoCC.
[36]
Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 2017. Clipper: A Low-Latency Online Prediction Serving System. In Proc. USENIX NSDI.
[37]
Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337--340.
[38]
Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoSaware scheduling for heterogeneous datacenters. In ACM SIGPLAN Notices, Vol. 48. ACM, 77--88.
[39]
Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: resource-efficient and QoS-aware cluster management. In ACM SIGARCH Computer Architecture News, Vol. 42. ACM, 127--144.
[40]
Thomas G Dietterich. 2000. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of artificial intelligence research 13 (2000), 227--303.
[41]
Francesco Cardinale et al. 2018. ISR. https://github.com/idealo/image-super-resolution.
[42]
Panagiotis Garefalakis, Konstantinos Karanasos, Peter Pietzuch, Arun Suresh, and Sriram Rao. 2018. Medea: scheduling of long running applications in shared production clusters. In Proc. ACM EuroSys.
[43]
Arpan Gujarati, Sameh Elnikety, Yuxiong He, Kathryn S McKinley, and Björn B Brandenburg. 2017. Swayam: distributed autoscaling to meet SLOs of machine learning inference services with resource efficiency. In Proc. ACM/IFIP/USENIX Middleware.
[44]
Jing Guo, Zihao Chang, Sa Wang, Haiyang Ding, Yihui Feng, Liang Mao, and Yungang Bao. 2019. Who limits the resource efficiency of my datacenter: an analysis of Alibaba datacenter traces. In Proc. ACM IWQoS.
[45]
Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, and Yee Jiun Song. 2017. Canopy: An End-to-End Performance Tracing And Analysis System. In Proceedings of the 26th Symposium on Operating Systems Principles (Shanghai, China) (SOSP '17). Association for Computing Machinery, New York, NY, USA, 34--50. https://doi.org/10.1145/3132747.3132749
[46]
John K Karlof. 2005. Integer programming: theory and practice. CRC Press.
[47]
Eugene L Lawler and David E Wood. 1966. Branch-and-bound methods: A survey. Operations research 14, 4 (1966), 699--719.
[48]
Qixiao Liu and Zhibin Yu. 2018. The elasticity and plasticity in semi-containerized co-locating cloud workload: a view from alibaba trace. In Proc. ACM SoCC.
[49]
David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2015. Heracles: Improving resource efficiency at scale. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 450--462.
[50]
Chengzhi Lu, Kejiang Ye, Guoyao Xu, Cheng-Zhong Xu, and Tongxin Bai. 2017. Imbalance in the cloud: An analysis on alibaba cluster trace. In Proc. IEEE Big Data.
[51]
Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. 2019. Learning Scheduling Algorithms for Data Processing Clusters. In Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM '19). Association for Computing Machinery, New York, NY, USA, 270--288. https://doi.org/10.1145/3341302.3342080
[52]
Hongzi Mao, Shaileshh Bojja Venkatakrishnan, Malte Schwarzkopf, and Mohammad Alizadeh. 2018. Variance reduction for reinforcement learning in input-driven environments. arXiv preprint arXiv:1807.02264 (2018).
[53]
Daniel A Menascé. 2002. TPC-W: A benchmark for e-commerce. IEEE Internet Computing 6, 3 (2002), 83--87.
[54]
Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. 2016. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research 17, 1 (2016), 1235--1241.
[55]
Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. 2018. Ray: A distributed framework for emerging AI applications. In 13th USENIX Symposium on Operating Systems Design and Implementation. 561--577.
[56]
Ripal Nathuji, Aman Kansal, and Alireza Ghaffarkhah. 2010. Q-clouds: managing performance interference effects for qos-aware clouds. In Proc. ACM Eurosys.
[57]
Dejan Novaković, Nedeljko Vasić, Stanko Novaković, Dejan Kostić, and Ricardo Bianchini. 2013. Deepdive: Transparently identifying and managing performance interference in virtualized environments. In Proc. USENIX ATC.
[58]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.
[59]
Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun Murthy, and Carlo Curino. 2015. Apache tez: A unifying framework for modeling and building data processing applications. In Proc. ACM SIGMOD.
[60]
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 37), Francis Bach and David Blei (Eds.). PMLR, Lille, France, 1889--1897.
[61]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
[62]
Bikash Sharma, Victor Chudnovsky, Joseph L Hellerstein, Rasekh Rifaat, and Chita R Das. 2011. Modeling and synthesizing task placement constraints in Google compute clusters. In Proceedings of the 2nd ACM Symposium on Cloud Computing. 1--14.
[63]
Ruben Solozabal, Josu Ceberio, and Martin Takáč. 2020. Constrained combinatorial optimization with reinforcement learning. arXiv preprint arXiv:2006.11984 (2020).
[64]
Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
[65]
Matthew E Taylor, Gregory Kuhlmann, and Peter Stone. 2008. Autonomous transfer for reinforcement learning. In AAMAS (1). Citeseer, 283--290.
[66]
Matthew E Taylor and Peter Stone. 2009. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research 10, 7 (2009).
[67]
Chen Tessler, Daniel J. Mankowitz, and Shie Mannor. 2019. Reward Constrained Policy Optimization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
[68]
Prashanth Thinakaran, Jashwant Raj Gunasekaran, Bikash Sharma, Mahmut Taylan Kandemir, and Chita R Das. 2017. Phoenix: A constraint-aware scheduler for heterogeneous datacenters. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS). IEEE, 977--987.
[69]
Huangshi Tian, Minchen Yu, and Wei Wang. 2018. Continuum: A Platform for Cost-Aware, Low-Latency Continual Learning. In Proc. ACM SoCC.
[70]
Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A Kozuch, Mor Harchol-Balter, and Gregory R Ganger. 2016. TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters. In Proc. ACM EuroSys.
[71]
Vinod Kumar Vavilapalli, Arun C Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, et al. 2013. Apache hadoop yarn: Yet another resource negotiator. In Proc. ACM SoCC.
[72]
Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. In Proc. ACM Eurosys.
[73]
Luping Wang, Qizhen Weng, Wei Wang, Chen Chen, and Bo Li. 2020. Metis: Learning to Schedule Long-Running Applications in Shared Container Clusters at Scale. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. ACM/IEEE.
[74]
H. Wu, W. Zhang, Y. Xu, H. Xiang, T. Huang, H. Ding, and Z. Zhang. 2019. Aladdin: Optimized Maximum Flow Management for Shared Production Clusters. In Proc. IEEE IPDPS. 696--707.
[75]
Reynold S Xin, Josh Rosen, Matei Zaharia, Michael J Franklin, Scott Shenker, and Ion Stoica. 2013. Shark: SQL and rich analytics at scale. In Proc. ACM SIGMOD.
[76]
Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. 2013. Bubble-flux: Precise online qos management for increased utilization in warehouse scale computers. ACM SIGARCH Computer Architecture News 41, 3 (2013), 607--618.
[77]
Tsung-Yen Yang, Justinian Rosca, Karthik Narasimhan, and Peter J Ramadge. 2020. Projection-Based Constrained Policy Optimization. In ICLR.
[78]
Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized streams: Fault-tolerant streaming computation at scale. In Proc. ACM SOSP.
[79]
Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. 2013. CPI 2: CPU performance isolation for shared compute clusters. In Proc. ACM Eurosys.

Cited By

View all
  • (2025)Retrospecting Available CPU Resources: SMT-Aware Scheduling to Prevent SLA Violations in Data CentersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.349487936:1(67-83)Online publication date: Jan-2025
  • (2024)DeployFix: Dynamic Repair of Software Deployment Failures via Constraint SolvingProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695268(2053-2064)Online publication date: 27-Oct-2024
  • (2024) Tetris: Proactive Container Scheduling for Long-Term Load Balancing in Shared Clusters IEEE Transactions on Services Computing10.1109/TSC.2024.344254417:5(2918-2930)Online publication date: Sep-2024
  • Show More Cited By

Index Terms

  1. George: Learning to Place Long-Lived Containers in Large Clusters with Operation Constraints

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SoCC '21: Proceedings of the ACM Symposium on Cloud Computing
    November 2021
    685 pages
    ISBN:9781450386388
    DOI:10.1145/3472883
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 November 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Cloud Computing
    2. Container
    3. Reinforcement Learning
    4. Resource Scheduling

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • RGC General Research Fund (GRF)
    • RGC Research Impact Fund (RIF)

    Conference

    SoCC '21
    Sponsor:
    SoCC '21: ACM Symposium on Cloud Computing
    November 1 - 4, 2021
    WA, Seattle, USA

    Acceptance Rates

    Overall Acceptance Rate 169 of 722 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)54
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 17 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Retrospecting Available CPU Resources: SMT-Aware Scheduling to Prevent SLA Violations in Data CentersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.349487936:1(67-83)Online publication date: Jan-2025
    • (2024)DeployFix: Dynamic Repair of Software Deployment Failures via Constraint SolvingProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695268(2053-2064)Online publication date: 27-Oct-2024
    • (2024) Tetris: Proactive Container Scheduling for Long-Term Load Balancing in Shared Clusters IEEE Transactions on Services Computing10.1109/TSC.2024.344254417:5(2918-2930)Online publication date: Sep-2024
    • (2024)Freyr +: Harvesting Idle Resources in Serverless Computing via Deep Reinforcement LearningIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.346229435:11(2254-2269)Online publication date: Nov-2024
    • (2024) InSS : An Intelligent Scheduling Orchestrator for Multi-GPU Inference With Spatio-Temporal Sharing IEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343006335:10(1735-1748)Online publication date: Oct-2024
    • (2024)Scheduling Multi-Component Applications Across Federated Edge Clusters With PhareIEEE Open Journal of the Communications Society10.1109/OJCOMS.2024.33779175(1814-1826)Online publication date: 2024
    • (2024)Online Policy Adaptation for Networked Systems using RolloutNOMS 2024-2024 IEEE Network Operations and Management Symposium10.1109/NOMS59830.2024.10575707(1-9)Online publication date: 6-May-2024
    • (2024)Comparing Transfer Learning and Rollout for Policy Adaptation in a Changing Network EnvironmentNOMS 2024-2024 IEEE Network Operations and Management Symposium10.1109/NOMS59830.2024.10575398(1-7)Online publication date: 6-May-2024
    • (2024)Tackling Cold Start in Serverless Computing with Multi-Level Container Reuse2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00017(89-99)Online publication date: 27-May-2024
    • (2024)A self-stabilizing and auto-provisioning orchestration for microservices in edge-cloud continuumComputer Networks: The International Journal of Computer and Telecommunications Networking10.1016/j.comnet.2024.110279242:COnline publication date: 2-Jul-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media