skip to main content
10.1145/3620665.3640381acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Open access

Efficient Microsecond-scale Blind Scheduling with Tiny Quanta

Published: 27 April 2024 Publication History

Abstract

A longstanding performance challenge in datacenter-based applications is how to efficiently handle incoming client requests that spawn many very short (μs scale) jobs that must be handled with high throughput and low tail latency. When no assumptions are made about the duration of individual jobs, or even about the distribution of their durations, this requires blind scheduling with frequent and efficient preemption, which is not scalably supported for μs-level tasks. We present Tiny Quanta (TQ), a system that enables efficient blind scheduling of μs-level workloads. TQ performs fine-grained preemptive scheduling and does so with high performance via a novel combination of two mechanisms: forced multitasking and two-level scheduling. Evaluations with a wide variety of μs-level workloads show that TQ achieves low tail latency while sustaining 1.2x to 6.8x the throughput of prior blind scheduling systems.

References

[1]
Haitham Akkary and Michael A Driscoll. A dynamic multithreading processor. In Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture, pages 226--236. IEEE, 1998.
[2]
Matthew Arnold and Barbara G Ryder. A framework for reducing the cost of instrumented code. In Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation, pages 168--179, 2001.
[3]
Remzi H Arpaci-Dusseau and Andrea C Arpaci-Dusseau. Operating systems: Three easy pieces. Arpaci-Dusseau Books, LLC, 2018.
[4]
Thomas Ball and James R Larus. Optimally profiling and tracing programs. ACM Transactions on Programming Languages and Systems (TOPLAS), 16(4):1319--1360, 1994.
[5]
Thomas Ball and James R Larus. Efficient path profiling. In Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29, pages 46--57. IEEE, 1996.
[6]
Luiz Barroso, Mike Marty, David Patterson, and Parthasarathy Ranganathan. Attack of the killer microseconds. Communications of the ACM, 60(4):48--54, 2017.
[7]
Luiz André Barroso, Jeffrey Dean, and Urs Holzle. Web search for a planet: The google cluster architecture. IEEE micro, 23(2):22--28, 2003.
[8]
Nilanjana Basu, Claudio Montanari, and Jakob Eriksson. Frequent background polling on a shared thread, using light-weight compiler interrupts. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, pages 1249--1263, 2021.
[9]
Adam Belay, Andrea Bittau, Ali Mashtizadeh, David Terei, David Mazières, and Christos Kozyrakis. Dune: Safe user-level access to privileged cpu features. In 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), pages 335--348, 2012.
[10]
Tom Bergan, Owen Anderson, Joseph Devietti, Luis Ceze, and Dan Grossman. Coredet: A compiler and runtime system for deterministic multithreaded execution. In Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems, pages 53--64, 2010.
[11]
Kristof Beyls and Erik D'Hollander. Reuse distance as a metric for cache behavior. In Proceedings of the IASTED Conference on Parallel and Distributed Computing and systems, volume 14, pages 350--360. Citeseer, 2001.
[12]
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The parsec benchmark suite: Characterization and architectural implications. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 72--81, 2008.
[13]
boegel. Mica: a pin tool for collecting microarchitecture-independent workload characteristics. https://github.com/boegel/MICA, 2023.
[14]
Boost. Performance of boost context switch. https://www.boost.org/doc/libs/1_79_0/libs/context/doc/html/context/performance.html, 2022.
[15]
Boost. Performance of boost coroutine2. https://www.boost.org/doc/libs/1_81_0/libs/coroutine2/doc/html/coroutine2/performance.html, 2022.
[16]
Sem Borst, Rudesindo Núñez-Queija, and Bert Zwart. Sojourn time asymptotics in processor-sharing queues. Queueing Systems, 53:31--51, 2006.
[17]
Sol Boucher, Anuj Kalia, David G Andersen, and Michael Kaminsky. Putting the" micro" back in microservice. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), pages 645--650, 2018.
[18]
Bryan Cantrill, Michael W Shapiro, and Adam H Leventhal. Dynamic instrumentation of production systems. In USENIX Annual Technical Conference, General Track, pages 15--28, 2004.
[19]
Ana Lúcia De Moura, Noemi Rodriguez, and Roberto Ierusalimschy. Coroutines in lua. Journal of Universal Computer Science, 10(7):910--925, 2004.
[20]
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: amazon's highly available key-value store. In Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles, pages 205--220, 2007.
[21]
Henri Maxime Demoulin, Joshua Fried, Isaac Pedisich, Marios Kogias, Boon Thau Loo, Linh Thi Xuan Phan, and Irene Zhang. When idling is ideal: Optimizing tail-latency for heavy-tailed datacenter workloads with perséphone. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, pages 621--637, 2021.
[22]
Chen Ding and Yutao Zhong. Predicting whole-program locality through reuse distance analysis. In Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation, pages 245--257, 2003.
[23]
Stephen Dolan, Servesh Muralidharan, and David Gregg. Compiler support for lightweight context switching. ACM Transactions on Architecture and Code Optimization (TACO), 9(4):1--25, 2013.
[24]
DPDK. Data plane development kit. https://www.dpdk.org/, 2022.
[25]
Kenneth J Duda and David R Cheriton. Borrowed-virtual-time (BVT) scheduling: supporting latency-sensitive threads in a general-purpose scheduler. In Proceedings of the seventeenth ACM symposium on Operating systems principles, pages 261--276, 1999.
[26]
Agner Fog. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. Technical University of Denmark. Copyright © 1996 -- 2022. Last updated 2022-11-04.
[27]
Joshua Fried, Zhenyuan Ruan, Amy Ousterhout, and Adam Belay. Caladan: Mitigating interference at microsecond timescales. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 281--297, 2020.
[28]
Souradip Ghosh, Michael Cuevas, Simone Campanoni, and Peter Dinda. Compiler-based timing for extremely fine-grain preemptive parallelism. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--15. IEEE, 2020.
[29]
Varun Gupta, Mor Harchol Balter, Karl Sigman, and Ward Whitt. Analysis of join-the-shortest-queue routing for web server farms. Performance Evaluation, 64(9-12):1062--1081, 2007.
[30]
Kyle C. Hale and Peter A Dinda. Enabling hybrid parallel runtimes through kernel and virtualization support. In VEE 2016 - Proceedings of the 12th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, pages 161--175, March 2016.
[31]
Mor Harchol-Balter. Performance modeling and design of computer systems: queueing theory in action. Cambridge University Press, 2013.
[32]
Rishabh Iyer, Musa Unal, Marios Kogias, and George Candea. Achieving microsecond-scale tail latency efficiently with approximate optimal scheduling. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 466--481, 2023.
[33]
Richard C Johnson, David Pearson, and Keshav Pingali. Finding regions fast: Single entry single exit and control regions in linear time. Technical report, Cornell University, 1993.
[34]
Kostis Kaffes, Timothy Chong, Jack Tigar Humphries, Adam Belay, David Mazières, and Christos Kozyrakis. Shinjuku: Preemptive scheduling for μsecond-scale tail latency. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 345--360, 2019.
[35]
Chris Lattner and Vikram Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization, 2004. CGO 2004., pages 75--86. IEEE, 2004.
[36]
Chang-Gun Lee, Hoosun Hahn, Yang-Min Seo, Sang Lyul Min, Rhan Ha, Seongsoo Hong, Chang Yun Park, Minsuk Lee, and Chong Sang Kim. Analysis of cache-related preemption delay in fixed-priority preemptive scheduling. IEEE transactions on computers, 47(6):700--713, 1998.
[37]
Chuanpeng Li, Chen Ding, and Kai Shen. Quantifying the cost of context switch. In Proceedings of the 2007 workshop on Experimental computer science, pages 2--es, 2007.
[38]
Yueying Li, Nikita Lazarev, David Koufaty, Yijun Yin, Andy Anderson, Zhiru Zhang, Edward Suh, Kostis Kaffes, and Christina Delimitrou. Towards fast, adaptive, and hardware-assisted user-space scheduling. arXiv preprint arXiv:2308.02896, 2023.
[39]
Hwa-Chun Lin and Cauligi S Raghavendra. An approximate analysis of the join the shortest queue (JSQ) policy. IEEE Transactions on Parallel and Distributed Systems, 7(3):301--307, 1996.
[40]
Fang Liu and Yan Solihin. Understanding the behavior and implications of context switch misses. ACM Transactions on Architecture and Code Optimization (TACO), 7(4):1--28, 2010.
[41]
LLVM. LLVM's analysis and transform passes. https://llvm.org/docs/Passes.html, 2022.
[42]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. Acm sigplan notices, 40(6):190--200, 2005.
[43]
Sohil Mehta. x86 user interrupts support. https://lwn.net/Articles/869140/, 2021.
[44]
Meta. Rocksdb. https://rocksdb.org/, 2022.
[45]
Michael Mitzenmacher. The power of two choices in randomized load balancing. IEEE Transactions on Parallel and Distributed Systems, 12(10):1094--1104, 2001.
[46]
Ana Lúcia De Moura and Roberto Ierusalimschy. Revisiting coroutines. ACM Transactions on Programming Languages and Systems (TOPLAS), 31(2):1--31, 2009.
[47]
Jorge Munoz-Gama, Josep Carmona, and Wil MP Van Der Aalst. Single-entry single-exit decomposed conformance checking. Information Systems, 46:102--122, 2014.
[48]
Hemendra Singh Negi, Tulika Mitra, and Abhik Roychoudhury. Accurate estimation of cache-related preemption delay. In Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, pages 201--206, 2003.
[49]
Gor Nishanov. C++ extensions for coroutines. 2018.
[50]
Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, et al. Scaling memcache at facebook. In Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13), pages 385--398, 2013.
[51]
Misja Nuyens and Adam Wierman. The foreground-background queue: a survey. Performance evaluation, 65(3-4):286--307, 2008.
[52]
Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, and Hari Balakrishnan. Shenango: Achieving high {CPU} efficiency for latency-sensitive datacenter workloads. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 361--378, 2019.
[53]
Chandandeep Singh Pabla. Completely fair scheduler. Linux Journal, 2009(184):4, 2009.
[54]
George Prekas, Marios Kogias, and Edouard Bugnion. Zygos: Achieving low tail latency for microsecond-scale networked tasks. In Proceedings of the 26th Symposium on Operating Systems Principles, pages 325--341, 2017.
[55]
Idris A Rai, Guillaume Urvoy-Keller, and Ernst W Biersack. Analysis of las scheduling for job size distributions with high variance. In Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pages 218--228, 2003.
[56]
Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 13--24. Ieee, 2007.
[57]
Boris Schäling. The boost C++ libraries. Boris Schäling, 2011.
[58]
Jori Selen, Ivo Adan, and Stella Kapodistria. Approximate performance analysis of generalized join the shortest queue routing. EAI Endorsed Transactions on Future Internet, 3(10), 1 2016.
[59]
Hamed Seyedroudbari, Srikar Vanavasam, and Alexandros Daglis. Turbo: SmartNIC-enabled dynamic load balancing of μs-scale RPCs.
[60]
John Paul Shen and Mikko H Lipasti. Modern processor design: fundamentals of superscalar processors. Waveland Press, 2013.
[61]
TPCC. Tpc-c. https://www.tpc.org/tpcc/, 2022.
[62]
Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden. Speedy transactions in multicore in-memory databases. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 18--32, 2013.
[63]
Shay Vargaftik, Isaac Keslassy, and Ariel Orda. LSQ: Load balancing in large-scale heterogeneous systems with multiple dispatchers. IEEE/ACM Transactions on Networking, 28(3):1186--1198, 2020.
[64]
Adam Wierman and Bert Zwart. Is tail-optimal scheduling possible? Operations research, 60(5):1249--1257, 2012.
[65]
Reinhard Wilhelm, Jakob Engblom, Andreas Ermedahl, Niklas Holsti, Stephan Thesing, David Whalley, Guillem Bernat, Christian Ferdinand, Reinhold Heckmann, Tulika Mitra, Frank Mueller, Isabelle Puaut, Peter Puschner, Jan Staschulat, and Per Stenstrom. The worst-case execution-time problem---overview of methods and survey of tools. ACM Transactions on Embedded Computing Systems (TECS), 7(3):1--53, 2008.
[66]
Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The splash-2 programs: Characterization and methodological considerations. ACM SIGARCH computer architecture news, 23(2):24--36, 1995.
[67]
Sergey F Yashkov. Processor-sharing queues: Some progress in analysis. Queueing systems, 2:1--17, 1987.
[68]
Irene Zhang, Amanda Raybuck, Pratyush Patel, Kirk Olynyk, Jacob Nelson, Omar S Navarro Leija, Ashlie Martinez, Jing Liu, Anna Kornfeld Simpson, Sujay Jayakar, Pedro Henrique Pennar, Max Demoulin, Piali Choudhuryr, and Anirudh Badam. The demikernel datapath os architecture for microsecond-scale datacenter systems. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, pages 195--211, 2021.
[69]
Xingyu Zhou, Ness Shroff, and Adam Wierman. Asymptotically optimal load balancing in large-scale heterogeneous systems with multiple dispatchers. ACM SIGMETRICS Performance Evaluation Review, 48(3):57--58, 2021.
[70]
Hang Zhu, Kostis Kaffes, Zixu Chen, Zhenming Liu, Christos Kozyrakis, Ion Stoica, and Xin Jin. Racksched: A microsecond-scale scheduler for rack-scale computers. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 1225--1240, 2020.

Cited By

View all
  • (2024)Harvesting memory-bound CPU stall cycles in software with MSHProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691942(57-75)Online publication date: 10-Jul-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2
April 2024
1299 pages
ISBN:9798400703850
DOI:10.1145/3620665
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 April 2024

Check for updates

Author Tags

  1. blind scheduling
  2. microsecond scale
  3. tail latency

Qualifiers

  • Research-article

Conference

ASPLOS '24

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)857
  • Downloads (Last 6 weeks)93
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Harvesting memory-bound CPU stall cycles in software with MSHProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691942(57-75)Online publication date: 10-Jul-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media