skip to main content
10.1145/3578338.3593550acmconferencesArticle/Chapter ViewAbstractPublication PagesmetricsConference Proceedingsconference-collections
abstract

Malcolm: Multi-agent Learning for Cooperative Load Management at Rack Scale

Published:19 June 2023Publication History

ABSTRACT

We consider the problem of balancing the load among servers in dense racks for microsecond-scale workloads. To balance the load in such settings, tens of millions of scheduling decisions have to be made per second. Achieving this throughput while providing microsecond-scale latency is extremely challenging. To address this challenge, we design a fully decentralized load-balancing framework, which allows servers to collectively balance the load in the system. We model the interactions among servers as a cooperative stochastic game. To find the game's parametric Nash equilibrium, we design and implement a decentralized algorithm based on multi-agent-learning theory. We empirically show that our proposed algorithm is adaptive and scalable while outperforming state-of-the art alternatives. The full paper of this abstract can be found at https://doi.org/10.1145/3570611.

References

  1. Luiz Barroso, Mike Marty, David Patterson, and Parthasarathy Ranganathan. 2017. Attack of the Killer Microseconds. Communications of the ACM (CACM), Vol. 60, 4 (mar 2017), 48--54.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Adam Belay, George Prekas, Mia Primorac, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. 2016. The IX operating system: Combining low latency, high throughput, and efficiency in a protected dataplane. ACM Transactions on Computer Systems (TOCS), Vol. 34, 4 (2016), 1--39.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Communications of the ACM (CACM), Vol. 56, 2 (2013), 74--80.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Donghwan Lee, Niao He, Parameswaran Kamalaruban, and Volkan Cevher. 2020. Optimization for reinforcement learning: From a single agent to cooperative agents. IEEE Signal Processing Magazine, Vol. 37, 3 (2020), 123--135.Google ScholarGoogle ScholarCross RefCross Ref
  5. Sergio Valcarcel Macua, Javier Zazo, and Santiago Zazo. 2018. Learning Parametric Closed-Loop Policies for Markov Potential Games. In Proceedings of the International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  6. James McCauley, Aurojit Panda, Arvind Krishnamurthy, and Scott Shenker. 2019. Thoughts on Load Distribution and the Role of Programmable Switches. ACM SIGCOMM Computer Communication Review, Vol. 49, 1 (2019), 18--23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. George Prekas, Marios Kogias, and Edouard Bugnion. 2017. ZygOS: Achieving low tail latency for microsecond-scale networked tasks. In Proceedings of the 26th Symposium on Operating Systems Principles, (SOSP). 325--341.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Alexander L Stolyar. 2015. Pull-based load distribution in large-scale heterogeneous service systems. Queueing Systems, Vol. 80, 4 (2015), 341--361.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Hang Zhu, Kostis Kaffes, Zixu Chen, Zhenming Liu, Christos Kozyrakis, Ion Stoica, and Xin Jin. 2020. RackSched: A microsecond-scale scheduler for rack-scale computers. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 1225--1240.Google ScholarGoogle Scholar

Index Terms

  1. Malcolm: Multi-agent Learning for Cooperative Load Management at Rack Scale

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              SIGMETRICS '23: Abstract Proceedings of the 2023 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems
              June 2023
              123 pages
              ISBN:9798400700743
              DOI:10.1145/3578338
              • cover image ACM SIGMETRICS Performance Evaluation Review
                ACM SIGMETRICS Performance Evaluation Review  Volume 51, Issue 1
                SIGMETRICS '23
                June 2023
                108 pages
                ISSN:0163-5999
                DOI:10.1145/3606376
                Issue’s Table of Contents

              Copyright © 2023 Owner/Author

              Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 19 June 2023

              Check for updates

              Qualifiers

              • abstract

              Acceptance Rates

              Overall Acceptance Rate459of2,691submissions,17%
            • Article Metrics

              • Downloads (Last 12 months)59
              • Downloads (Last 6 weeks)10

              Other Metrics

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader