skip to main content
10.1145/3578338.3593550acmconferencesArticle/Chapter ViewAbstractPublication PagesmetricsConference Proceedingsconference-collections
abstract

Malcolm: Multi-agent Learning for Cooperative Load Management at Rack Scale

Published: 19 June 2023 Publication History

Abstract

We consider the problem of balancing the load among servers in dense racks for microsecond-scale workloads. To balance the load in such settings, tens of millions of scheduling decisions have to be made per second. Achieving this throughput while providing microsecond-scale latency is extremely challenging. To address this challenge, we design a fully decentralized load-balancing framework, which allows servers to collectively balance the load in the system. We model the interactions among servers as a cooperative stochastic game. To find the game's parametric Nash equilibrium, we design and implement a decentralized algorithm based on multi-agent-learning theory. We empirically show that our proposed algorithm is adaptive and scalable while outperforming state-of-the art alternatives. The full paper of this abstract can be found at https://doi.org/10.1145/3570611.

References

[1]
Luiz Barroso, Mike Marty, David Patterson, and Parthasarathy Ranganathan. 2017. Attack of the Killer Microseconds. Communications of the ACM (CACM), Vol. 60, 4 (mar 2017), 48--54.
[2]
Adam Belay, George Prekas, Mia Primorac, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. 2016. The IX operating system: Combining low latency, high throughput, and efficiency in a protected dataplane. ACM Transactions on Computer Systems (TOCS), Vol. 34, 4 (2016), 1--39.
[3]
Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Communications of the ACM (CACM), Vol. 56, 2 (2013), 74--80.
[4]
Donghwan Lee, Niao He, Parameswaran Kamalaruban, and Volkan Cevher. 2020. Optimization for reinforcement learning: From a single agent to cooperative agents. IEEE Signal Processing Magazine, Vol. 37, 3 (2020), 123--135.
[5]
Sergio Valcarcel Macua, Javier Zazo, and Santiago Zazo. 2018. Learning Parametric Closed-Loop Policies for Markov Potential Games. In Proceedings of the International Conference on Learning Representations (ICLR).
[6]
James McCauley, Aurojit Panda, Arvind Krishnamurthy, and Scott Shenker. 2019. Thoughts on Load Distribution and the Role of Programmable Switches. ACM SIGCOMM Computer Communication Review, Vol. 49, 1 (2019), 18--23.
[7]
George Prekas, Marios Kogias, and Edouard Bugnion. 2017. ZygOS: Achieving low tail latency for microsecond-scale networked tasks. In Proceedings of the 26th Symposium on Operating Systems Principles, (SOSP). 325--341.
[8]
Alexander L Stolyar. 2015. Pull-based load distribution in large-scale heterogeneous service systems. Queueing Systems, Vol. 80, 4 (2015), 341--361.
[9]
Hang Zhu, Kostis Kaffes, Zixu Chen, Zhenming Liu, Christos Kozyrakis, Ion Stoica, and Xin Jin. 2020. RackSched: A microsecond-scale scheduler for rack-scale computers. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 1225--1240.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMETRICS '23: Abstract Proceedings of the 2023 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems
June 2023
123 pages
ISBN:9798400700743
DOI:10.1145/3578338
  • cover image ACM SIGMETRICS Performance Evaluation Review
    ACM SIGMETRICS Performance Evaluation Review  Volume 51, Issue 1
    SIGMETRICS '23
    June 2023
    108 pages
    ISSN:0163-5999
    DOI:10.1145/3606376
    Issue’s Table of Contents
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2023

Check for updates

Author Tags

  1. cooperative game theory
  2. distributed load balancing
  3. heterogeneous systems
  4. multi-agent reinforcement learning
  5. task scheduling

Qualifiers

  • Abstract

Funding Sources

  • CFI
  • NSERC
  • ORF

Conference

SIGMETRICS '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 459 of 2,691 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)31
  • Downloads (Last 6 weeks)6
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media