research-article

George: Learning to Place Long-Lived Containers in Large Clusters with Operation Constraints

Authors:

Suyi Li,

Luping Wang,

Wei Wang,

Yinghao Yu,

Bo LiAuthors Info & Claims

SoCC '21: Proceedings of the ACM Symposium on Cloud Computing

Pages 258 - 272

https://doi.org/10.1145/3472883.3486971

Published: 01 November 2021 Publication History

Get Access

Abstract

Online cloud services are widely deployed as Long-Running Applications (LRAs) hosted in containers. Placing LRA containers turns out to be particularly challenging due to the complex interference between co-located containers and the operation constraints in production clusters such as fault tolerance, disaster avoidance and incremental deployment. Existing schedulers typically provide APIs for operators to manually specify the container scheduling requirements and offer only qualitative scheduling guidelines for container placement. Such schedulers, do not perform well in terms of both performance and scale, while also requiring manual intervention.

In this work, we propose George, an end-to-end generalpurpose LRA scheduler by leveraging the state-of-the-art Reinforcement Learning (RL) techniques to intelligently schedule LRA containers. We present an optimal container placement formulation for the first time with the objective of maximizing container placement performance subject to a set of operation constraints. One fundamental challenge in scheduling is to categorically satisfy different operation constraints in practice; specifically, to guarantee hard constraints and ensure soft constraints violations within a pre-defined threshold. We design a novel projection-based proximal policy optimization (PPPO) algorithm in combination with an Integer Linear optimization technique to intelligently schedule LRA containers under operation constraints. In order to reduce the training time, we apply transfer learning technique by taking advantage of the similarity in different LRA scheduling events. We prove theoretically that our proposed algorithm is effective, stable, and safe. We implement George as a plug-in service in Docker Swarm. Our in-house cluster demonstrates that George can maximize the LRA performance while enforcing the hard constraints and the soft constraints with a pre-defined threshold. The experiments show that George improves LRA performance and scale tremendously by requiring less than 1 hour scheduling time in a large cluster with 2K containers and 700 machines, 16x faster than existing schedulers. Compared with state-of-the-art alternatives, George also achieves 26% higher container performance with up to 70% lower constraint violation.

Supplementary Material

li (li.zip)

Supplemental movie, appendix, image and software files for, George: Learning to Place Long-Lived Containers in Large Clusters with Operation Constraints

Download
380.32 KB

MP4 File (Day2_5-4.mp4)

Presentation video

Download
227.26 MB

References

[1]

2018. Getting Started with A/B Testing. https://developer.amazon.com/blogs/appstore/post/Tx27HL6EMW36UCL/getting-started-with-a-b-testing.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Cost-efficient Workflow as a Service using Containers

Contention-aware container placement strategy for docker swarm with machine learning based clustering algorithms

DeepCTS: A Deep Reinforcement Learning Approach for AI Container Task Scheduling

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations