ABSTRACT
We consider the problem of balancing the load among servers in dense racks for microsecond-scale workloads. To balance the load in such settings, tens of millions of scheduling decisions have to be made per second. Achieving this throughput while providing microsecond-scale latency is extremely challenging. To address this challenge, we design a fully decentralized load-balancing framework, which allows servers to collectively balance the load in the system. We model the interactions among servers as a cooperative stochastic game. To find the game's parametric Nash equilibrium, we design and implement a decentralized algorithm based on multi-agent-learning theory. We empirically show that our proposed algorithm is adaptive and scalable while outperforming state-of-the art alternatives. The full paper of this abstract can be found at https://doi.org/10.1145/3570611.
- Luiz Barroso, Mike Marty, David Patterson, and Parthasarathy Ranganathan. 2017. Attack of the Killer Microseconds. Communications of the ACM (CACM), Vol. 60, 4 (mar 2017), 48--54.Google ScholarDigital Library
- Adam Belay, George Prekas, Mia Primorac, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. 2016. The IX operating system: Combining low latency, high throughput, and efficiency in a protected dataplane. ACM Transactions on Computer Systems (TOCS), Vol. 34, 4 (2016), 1--39.Google ScholarDigital Library
- Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Communications of the ACM (CACM), Vol. 56, 2 (2013), 74--80.Google ScholarDigital Library
- Donghwan Lee, Niao He, Parameswaran Kamalaruban, and Volkan Cevher. 2020. Optimization for reinforcement learning: From a single agent to cooperative agents. IEEE Signal Processing Magazine, Vol. 37, 3 (2020), 123--135.Google ScholarCross Ref
- Sergio Valcarcel Macua, Javier Zazo, and Santiago Zazo. 2018. Learning Parametric Closed-Loop Policies for Markov Potential Games. In Proceedings of the International Conference on Learning Representations (ICLR).Google Scholar
- James McCauley, Aurojit Panda, Arvind Krishnamurthy, and Scott Shenker. 2019. Thoughts on Load Distribution and the Role of Programmable Switches. ACM SIGCOMM Computer Communication Review, Vol. 49, 1 (2019), 18--23.Google ScholarDigital Library
- George Prekas, Marios Kogias, and Edouard Bugnion. 2017. ZygOS: Achieving low tail latency for microsecond-scale networked tasks. In Proceedings of the 26th Symposium on Operating Systems Principles, (SOSP). 325--341.Google ScholarDigital Library
- Alexander L Stolyar. 2015. Pull-based load distribution in large-scale heterogeneous service systems. Queueing Systems, Vol. 80, 4 (2015), 341--361.Google ScholarDigital Library
- Hang Zhu, Kostis Kaffes, Zixu Chen, Zhenming Liu, Christos Kozyrakis, Ion Stoica, and Xin Jin. 2020. RackSched: A microsecond-scale scheduler for rack-scale computers. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 1225--1240.Google Scholar
Index Terms
- Malcolm: Multi-agent Learning for Cooperative Load Management at Rack Scale
Recommendations
Malcolm: Multi-agent Learning for Cooperative Load Management at Rack Scale
POMACSWe consider the problem of balancing the load among servers in dense racks for microsecond-scale workloads. To balance the load in such settings tens of millions of scheduling decisions have to be made per second. Achieving this throughput while ...
Malcolm: Multi-agent Learning for Cooperative Load Management at Rack Scale
SIGMETRICS '23We consider the problem of balancing the load among servers in dense racks for microsecond-scale workloads. To balance the load in such settings, tens of millions of scheduling decisions have to be made per second. Achieving this throughput while ...
Agent coalitions for load balancing in cloud data centers
AbstractThe workload of Cloud data centers is constantly fluctuating causing imbalances across physical hosts that may lead to violations of service-level agreements. To mitigate workload imbalances, this work proposes a concurrent agent-based ...
Highlights- Agents in coalitions progressively balance data center sections.
- Supported by a ...
Comments