research-article

Open access

Efficient Microsecond-scale Blind Scheduling with Tiny Quanta

Authors:

Emmanuel Amaro,

Amy Ousterhout,

Sylvia Ratnasamy,

Scott ShenkerAuthors Info & Claims

ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

Pages 305 - 319

https://doi.org/10.1145/3620665.3640381

Published: 27 April 2024 Publication History

Abstract

A longstanding performance challenge in datacenter-based applications is how to efficiently handle incoming client requests that spawn many very short (μs scale) jobs that must be handled with high throughput and low tail latency. When no assumptions are made about the duration of individual jobs, or even about the distribution of their durations, this requires blind scheduling with frequent and efficient preemption, which is not scalably supported for μs-level tasks. We present Tiny Quanta (TQ), a system that enables efficient blind scheduling of μs-level workloads. TQ performs fine-grained preemptive scheduling and does so with high performance via a novel combination of two mechanisms: forced multitasking and two-level scheduling. Evaluations with a wide variety of μs-level workloads show that TQ achieves low tail latency while sustaining 1.2x to 6.8x the throughput of prior blind scheduling systems.

References

[1]

Haitham Akkary and Michael A Driscoll. A dynamic multithreading processor. In Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture, pages 226--236. IEEE, 1998.

Digital Library

[2]

Matthew Arnold and Barbara G Ryder. A framework for reducing the cost of instrumented code. In Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation, pages 168--179, 2001.

Digital Library

[3]

Remzi H Arpaci-Dusseau and Andrea C Arpaci-Dusseau. Operating systems: Three easy pieces. Arpaci-Dusseau Books, LLC, 2018.

[4]

Thomas Ball and James R Larus. Optimally profiling and tracing programs. ACM Transactions on Programming Languages and Systems (TOPLAS), 16(4):1319--1360, 1994.

[5]

Thomas Ball and James R Larus. Efficient path profiling. In Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29, pages 46--57. IEEE, 1996.

[6]

Luiz Barroso, Mike Marty, David Patterson, and Parthasarathy Ranganathan. Attack of the killer microseconds. Communications of the ACM, 60(4):48--54, 2017.

Digital Library

[7]

Luiz André Barroso, Jeffrey Dean, and Urs Holzle. Web search for a planet: The google cluster architecture. IEEE micro, 23(2):22--28, 2003.

Digital Library

[8]

Nilanjana Basu, Claudio Montanari, and Jakob Eriksson. Frequent background polling on a shared thread, using light-weight compiler interrupts. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, pages 1249--1263, 2021.

Digital Library

[9]

Adam Belay, Andrea Bittau, Ali Mashtizadeh, David Terei, David Mazières, and Christos Kozyrakis. Dune: Safe user-level access to privileged cpu features. In 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), pages 335--348, 2012.

[10]

Tom Bergan, Owen Anderson, Joseph Devietti, Luis Ceze, and Dan Grossman. Coredet: A compiler and runtime system for deterministic multithreaded execution. In Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems, pages 53--64, 2010.

Digital Library

[11]

Kristof Beyls and Erik D'Hollander. Reuse distance as a metric for cache behavior. In Proceedings of the IASTED Conference on Parallel and Distributed Computing and systems, volume 14, pages 350--360. Citeseer, 2001.

[12]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The parsec benchmark suite: Characterization and architectural implications. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 72--81, 2008.

Digital Library

[13]

boegel. Mica: a pin tool for collecting microarchitecture-independent workload characteristics. https://github.com/boegel/MICA, 2023.

[14]

Boost. Performance of boost context switch. https://www.boost.org/doc/libs/1_79_0/libs/context/doc/html/context/performance.html, 2022.

[15]

Boost. Performance of boost coroutine2. https://www.boost.org/doc/libs/1_81_0/libs/coroutine2/doc/html/coroutine2/performance.html, 2022.

[16]

Sem Borst, Rudesindo Núñez-Queija, and Bert Zwart. Sojourn time asymptotics in processor-sharing queues. Queueing Systems, 53:31--51, 2006.

Digital Library

[17]

Sol Boucher, Anuj Kalia, David G Andersen, and Michael Kaminsky. Putting the" micro" back in microservice. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), pages 645--650, 2018.

Digital Library

[18]

Bryan Cantrill, Michael W Shapiro, and Adam H Leventhal. Dynamic instrumentation of production systems. In USENIX Annual Technical Conference, General Track, pages 15--28, 2004.

Digital Library

[19]

Ana Lúcia De Moura, Noemi Rodriguez, and Roberto Ierusalimschy. Coroutines in lua. Journal of Universal Computer Science, 10(7):910--925, 2004.

[20]

Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: amazon's highly available key-value store. In Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles, pages 205--220, 2007.

Digital Library

[21]

Henri Maxime Demoulin, Joshua Fried, Isaac Pedisich, Marios Kogias, Boon Thau Loo, Linh Thi Xuan Phan, and Irene Zhang. When idling is ideal: Optimizing tail-latency for heavy-tailed datacenter workloads with perséphone. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, pages 621--637, 2021.

Digital Library

[22]

Chen Ding and Yutao Zhong. Predicting whole-program locality through reuse distance analysis. In Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation, pages 245--257, 2003.

Digital Library

[23]

Stephen Dolan, Servesh Muralidharan, and David Gregg. Compiler support for lightweight context switching. ACM Transactions on Architecture and Code Optimization (TACO), 9(4):1--25, 2013.

[24]

DPDK. Data plane development kit. https://www.dpdk.org/, 2022.

[25]

Kenneth J Duda and David R Cheriton. Borrowed-virtual-time (BVT) scheduling: supporting latency-sensitive threads in a general-purpose scheduler. In Proceedings of the seventeenth ACM symposium on Operating systems principles, pages 261--276, 1999.

Digital Library

[26]

Agner Fog. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. Technical University of Denmark. Copyright © 1996 -- 2022. Last updated 2022-11-04.

[27]

Joshua Fried, Zhenyuan Ruan, Amy Ousterhout, and Adam Belay. Caladan: Mitigating interference at microsecond timescales. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 281--297, 2020.

[28]

Souradip Ghosh, Michael Cuevas, Simone Campanoni, and Peter Dinda. Compiler-based timing for extremely fine-grain preemptive parallelism. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--15. IEEE, 2020.

Digital Library

[29]

Varun Gupta, Mor Harchol Balter, Karl Sigman, and Ward Whitt. Analysis of join-the-shortest-queue routing for web server farms. Performance Evaluation, 64(9-12):1062--1081, 2007.

Digital Library

[30]

Kyle C. Hale and Peter A Dinda. Enabling hybrid parallel runtimes through kernel and virtualization support. In VEE 2016 - Proceedings of the 12th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, pages 161--175, March 2016.

[31]

Mor Harchol-Balter. Performance modeling and design of computer systems: queueing theory in action. Cambridge University Press, 2013.

[32]

Rishabh Iyer, Musa Unal, Marios Kogias, and George Candea. Achieving microsecond-scale tail latency efficiently with approximate optimal scheduling. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 466--481, 2023.

Digital Library

[33]

Richard C Johnson, David Pearson, and Keshav Pingali. Finding regions fast: Single entry single exit and control regions in linear time. Technical report, Cornell University, 1993.

Digital Library

[34]

Kostis Kaffes, Timothy Chong, Jack Tigar Humphries, Adam Belay, David Mazières, and Christos Kozyrakis. Shinjuku: Preemptive scheduling for μsecond-scale tail latency. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 345--360, 2019.

[35]

Chris Lattner and Vikram Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization, 2004. CGO 2004., pages 75--86. IEEE, 2004.

[36]

Chang-Gun Lee, Hoosun Hahn, Yang-Min Seo, Sang Lyul Min, Rhan Ha, Seongsoo Hong, Chang Yun Park, Minsuk Lee, and Chong Sang Kim. Analysis of cache-related preemption delay in fixed-priority preemptive scheduling. IEEE transactions on computers, 47(6):700--713, 1998.

[37]

Chuanpeng Li, Chen Ding, and Kai Shen. Quantifying the cost of context switch. In Proceedings of the 2007 workshop on Experimental computer science, pages 2--es, 2007.

Digital Library

[38]

Yueying Li, Nikita Lazarev, David Koufaty, Yijun Yin, Andy Anderson, Zhiru Zhang, Edward Suh, Kostis Kaffes, and Christina Delimitrou. Towards fast, adaptive, and hardware-assisted user-space scheduling. arXiv preprint arXiv:2308.02896, 2023.

[39]

Hwa-Chun Lin and Cauligi S Raghavendra. An approximate analysis of the join the shortest queue (JSQ) policy. IEEE Transactions on Parallel and Distributed Systems, 7(3):301--307, 1996.

Digital Library

[40]

Fang Liu and Yan Solihin. Understanding the behavior and implications of context switch misses. ACM Transactions on Architecture and Code Optimization (TACO), 7(4):1--28, 2010.

[41]

LLVM. LLVM's analysis and transform passes. https://llvm.org/docs/Passes.html, 2022.

[42]

Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. Acm sigplan notices, 40(6):190--200, 2005.

[43]

Sohil Mehta. x86 user interrupts support. https://lwn.net/Articles/869140/, 2021.

[44]

Meta. Rocksdb. https://rocksdb.org/, 2022.

[45]

Michael Mitzenmacher. The power of two choices in randomized load balancing. IEEE Transactions on Parallel and Distributed Systems, 12(10):1094--1104, 2001.

Digital Library

[46]

Ana Lúcia De Moura and Roberto Ierusalimschy. Revisiting coroutines. ACM Transactions on Programming Languages and Systems (TOPLAS), 31(2):1--31, 2009.

[47]

Jorge Munoz-Gama, Josep Carmona, and Wil MP Van Der Aalst. Single-entry single-exit decomposed conformance checking. Information Systems, 46:102--122, 2014.

Digital Library

[48]

Hemendra Singh Negi, Tulika Mitra, and Abhik Roychoudhury. Accurate estimation of cache-related preemption delay. In Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, pages 201--206, 2003.

Digital Library

[49]

Gor Nishanov. C++ extensions for coroutines. 2018.

[50]

Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, et al. Scaling memcache at facebook. In Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13), pages 385--398, 2013.

Digital Library

[51]

Misja Nuyens and Adam Wierman. The foreground-background queue: a survey. Performance evaluation, 65(3-4):286--307, 2008.

Digital Library

[52]

Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, and Hari Balakrishnan. Shenango: Achieving high {CPU} efficiency for latency-sensitive datacenter workloads. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 361--378, 2019.

[53]

Chandandeep Singh Pabla. Completely fair scheduler. Linux Journal, 2009(184):4, 2009.

Digital Library

[54]

George Prekas, Marios Kogias, and Edouard Bugnion. Zygos: Achieving low tail latency for microsecond-scale networked tasks. In Proceedings of the 26th Symposium on Operating Systems Principles, pages 325--341, 2017.

Digital Library

[55]

Idris A Rai, Guillaume Urvoy-Keller, and Ernst W Biersack. Analysis of las scheduling for job size distributions with high variance. In Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pages 218--228, 2003.

Digital Library

[56]

Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 13--24. Ieee, 2007.

Digital Library

[57]

Boris Schäling. The boost C++ libraries. Boris Schäling, 2011.

Digital Library

[58]

Jori Selen, Ivo Adan, and Stella Kapodistria. Approximate performance analysis of generalized join the shortest queue routing. EAI Endorsed Transactions on Future Internet, 3(10), 1 2016.

[59]

Hamed Seyedroudbari, Srikar Vanavasam, and Alexandros Daglis. Turbo: SmartNIC-enabled dynamic load balancing of μs-scale RPCs.

[60]

John Paul Shen and Mikko H Lipasti. Modern processor design: fundamentals of superscalar processors. Waveland Press, 2013.

[61]

TPCC. Tpc-c. https://www.tpc.org/tpcc/, 2022.

[62]

Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden. Speedy transactions in multicore in-memory databases. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 18--32, 2013.

Digital Library

[63]

Shay Vargaftik, Isaac Keslassy, and Ariel Orda. LSQ: Load balancing in large-scale heterogeneous systems with multiple dispatchers. IEEE/ACM Transactions on Networking, 28(3):1186--1198, 2020.

Digital Library

[64]

Adam Wierman and Bert Zwart. Is tail-optimal scheduling possible? Operations research, 60(5):1249--1257, 2012.

[65]

Reinhard Wilhelm, Jakob Engblom, Andreas Ermedahl, Niklas Holsti, Stephan Thesing, David Whalley, Guillem Bernat, Christian Ferdinand, Reinhold Heckmann, Tulika Mitra, Frank Mueller, Isabelle Puaut, Peter Puschner, Jan Staschulat, and Per Stenstrom. The worst-case execution-time problem---overview of methods and survey of tools. ACM Transactions on Embedded Computing Systems (TECS), 7(3):1--53, 2008.

[66]

Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The splash-2 programs: Characterization and methodological considerations. ACM SIGARCH computer architecture news, 23(2):24--36, 1995.

[67]

Sergey F Yashkov. Processor-sharing queues: Some progress in analysis. Queueing systems, 2:1--17, 1987.

Digital Library

[68]

Irene Zhang, Amanda Raybuck, Pratyush Patel, Kirk Olynyk, Jacob Nelson, Omar S Navarro Leija, Ashlie Martinez, Jing Liu, Anna Kornfeld Simpson, Sujay Jayakar, Pedro Henrique Pennar, Max Demoulin, Piali Choudhuryr, and Anirudh Badam. The demikernel datapath os architecture for microsecond-scale datacenter systems. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, pages 195--211, 2021.

Digital Library

[69]

Xingyu Zhou, Ness Shroff, and Adam Wierman. Asymptotically optimal load balancing in large-scale heterogeneous systems with multiple dispatchers. ACM SIGMETRICS Performance Evaluation Review, 48(3):57--58, 2021.

Digital Library

[70]

Hang Zhu, Kostis Kaffes, Zixu Chen, Zhenming Liu, Christos Kozyrakis, Ion Stoica, and Xin Jin. Racksched: A microsecond-scale scheduler for rack-scale computers. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 1225--1240, 2020.

Cited By

Luo ZSon SRatnasamy SShenker SGavrilovska ATerry D(2024)Harvesting memory-bound CPU stall cycles in software with MSHProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691942(57-75)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691938.3691942

Index Terms

Efficient Microsecond-scale Blind Scheduling with Tiny Quanta
1. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language features
        Coroutines
  2. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Scheduling
    2. Software functional properties
      1. Formal methods
        Automated static analysis

Recommendations

Scheduling for Reduced Tail Task Latencies in Highly Utilized Datacenters
SoCC '24: Proceedings of the 2024 ACM Symposium on Cloud Computing

Modern datacenters run diverse workloads that increasingly comprise data-parallel computational jobs. There has been a steady rise in their demand leading to high-volume traffic. To meet these demands, datacenter providers operate their clusters at ...
Consistent Low-Latency Scheduling for Microsecond-Scale Tasks in Data Centers
Wireless Artificial Intelligent Computing Systems and Applications
Abstract
In large-scale data centers, many cloud applications with stringent latency requirements exhibit partition-aggregate patterns. Individual jobs necessitate responses from thousands of software services, thereby demanding that the tail latency of ...
Draconis: Network-Accelerated Scheduling for Microsecond-Scale Workloads
EuroSys '24: Proceedings of the Nineteenth European Conference on Computer Systems

We present Draconis, a novel scheduler for workloads in the range of tens to hundreds of microseconds. Draconis challenges the popular belief that programmable switches cannot house the complex data structures, such as queues, needed to support an in-...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

April 2024

1299 pages

ISBN:9798400703850

DOI:10.1145/3620665

General Chairs:
Nael Abu-Ghazaleh,
Rajiv Gupta,
Program Chairs:
Madan Musuvathi,
Dan Tsafrir

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 April 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS '24

Sponsor:

ASPLOS '24: 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

April 27 - May 1, 2024

CA, La Jolla, USA

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
857
Total Downloads

Downloads (Last 12 months)857
Downloads (Last 6 weeks)93

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Luo ZSon SRatnasamy SShenker SGavrilovska ATerry D(2024)Harvesting memory-bound CPU stall cycles in software with MSHProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691942(57-75)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691938.3691942

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten