CTS: An operating system CPU scheduler to mitigate tail latency for latency-sensitive multi-threaded applications

https://doi.org/10.1016/j.jpdc.2018.04.003Get rights and content

Highlights

  • It has been proven that FCFS scheduling of threads leads to lower tail latency.

  • Experiments show that CFS policies lead to LCFS scheduling, aggravating tail latency.

  • CTS policies ensure FCFS thread scheduling, which yields lower tail latency.

  • CTS enforces our policies while maintaining the key features of the Linux scheduler.

  • Experimental results show that CTS significantly outperforms the Linux scheduler.

Abstract

Large-scale interactive Web services break a user’s request to many sub-requests and send them to a large number of independent servers so as to consult multi-terabyte datasets instantaneously. Service responsiveness hinges on the slowest server, making the tail of the latency distribution of individual servers a matter of great concern. A large number of latency-sensitive applications hosted on individual servers use thread-driven concurrency model wherein a thread is spawned for each user connection. Threaded applications rely on the operating system CPU scheduler for determining the order of thread execution. Our experiments show that the default Linux scheduler (CFS) idiosyncrasies result in LCFS (Last Come First Served) scheduling of threads belonging to the same application. On the other hand, studies have shown that FCFS (First Come First Served) scheduling yields the lowest response time variability and tail latency, making the default scheduler of Linux a source of long tail latency for multi-threaded applications. In this paper, we present CTS, an operating system CPU scheduler to trim the tail of the latency distribution for latency-sensitive multi-threaded applications while maintaining the key characteristics of the default Linux scheduler (e.g., fairness). By adding new data structures to the Linux kernel, CTS tracks threads belonging to an application in a timely manner and schedules them in FCFS manner, mitigating the tail latency. To keep the existing features of the default Linux scheduler intact, CTS keeps CFS responsible for system-wide load balancing and core level process scheduling; CTS merely schedules threads of the CFS chosen process in FCFS order, ensuring tail latency mitigation without sacrificing the default Linux scheduler properties. Experiments with a prototype implementation of CTS in the Linux kernel demonstrate that CTS significantly outperforms the Linux default scheduler. For example, CTS mitigates the tail latency of a Null RPC server by up to 96%, a Thrift server by up to 90% and an Apache Web server by up to 51% at 99.9th percentile.

Introduction

Large-scale Web applications (e.g., search engines and social networks) use parallelization to process large datasets instantaneously by breaking a user request into many sub-requests and distributing the sub-requests across a large number of individual servers. The user request, therefore, does not complete until the slowest of these sub-requests has fulfilled. In fact, the responsiveness of individual servers dominates the quality of delivered services because not only the main request’s responsiveness hinges on the slowest sub-operations but also the middleware may decide to drop replies arriving after a predefined deadline, which further degrades the quality of service. In such systems, focusing on the average latency is not sufficient. System designers concentrate on the tail of the latency distribution of individual servers that is the key driver of user perception [[2], [15], [19]]. For example, to answer a user request, consider a scenario wherein a user-facing system collects responses from 200 individual servers whose tail latency at 99th percentile is one second, meaning that one request out of 100 requests takes more than one second to complete on each server. It can be calculated1 that 86.6% of the user requests will take more than one second. Therefore, under high degrees of parallelism, poor tail latency of individual servers impacts most of the user requests. This makes the tail of the latency a great challenge for developers of individual services and the main subject of intensive research that aims at eliminating the tail latency contributors by proposing novel application architectures, runtime environments, operating systems and hardware supports [[8], [24], [30], [32]].

Latency-sensitive applications typically use a thread-driven or an event-driven approach. Thread-driven applications use blocking/synchronous I/O where a newly spawned thread handles the I/O requests for each new client connection. The number of threads running on the system is hence proportional to the number of active connections. Event-driven applications, on the other hand, use asynchronous/non-blocking I/O wherein several main threads (worker threads) handle I/O tasks by registering callbacks to be notified asynchronously. A dispatcher pulls out events (i.e., I/O tasks) from a buffer of ready file descriptors (FDs) and passes them to worker threads. The number of worker threads is fixed and typically is equal to the number of available CPU cores.

In thread-driven applications, the operating system process scheduler decides the execution order of threads. In event-driven applications, an application-level dispatcher determines the serving order [[7], [26], [28], [34]]. Regardless of the concurrency model, the execution time of an I/O task is typically short. In fact, the execution times of I/O tasks are so short that the need for preemption is obviated. This leads us to conclude that non-preemptive queuing disciplines are suitable for modeling systems running latency-sensitive applications. Therefore, classical queuing disciplines such as First Come First Served (FCFS), Last Come First Served (LCFS) and Random Order of Service (ROS) are suitable policies for scheduling I/O tasks.

Many research works (such as [[9], [10]]) have proven that even though FCFS, LCFS, and ROS policies lead to the same average response time, they result in different response time variability and tail latency. Studies (such as [[6], [23]]) have shown that FCFS scheduling provides the lowest tail latency and variability compared to LCFS and ROS, making FCFS queuing the best performer in terms of the tail of the latency distribution. In event-driven applications, the application-level dispatcher can be tuned to serve I/O tasks in FCFS manner when the tail latency matters, whereas, in thread-driven applications, the operating system CPU scheduler determines the serving order.

CFS (Completely Fair Scheduler) is currently the default scheduler of Linux. The main objective of CFS is to share processor resources among running tasks fairly. To achieve this goal, CFS assigns each task an attribute called vRuntime to track the CPU consumption of running tasks. It also uses a red–black tree to keep the ready tasks sorted based on the values of their vRuntime attribute. At each scheduling decision, the scheduler chooses the task with the minimum vRuntime to be executed next, ensuring fairness. Since tasks belonging to IO-intensive applications spend less time on CPU compared to their CPU-intensive counterparts, they have a higher chance to be executed first, enhancing responsiveness. Moreover, CFS sets the vRuntime of newly awoken tasks to be the minimum of vRuntime values to increase the chance of immediate execution of newly woken tasks because it is likely that the task has been woken up due to an incoming I/O request [[11], [17], [22], [25]]. CFS policy for prompt execution of newly woken tasks leads to LCFS scheduling of threads belonging to a threaded application. As mentioned before, compared to FCFS, LCFS service discipline exacerbates the tail of the latency distribution, making current Linux CPU scheduler unsupportive of thread-driven and latency-sensitive applications, regarding the tail of the latency distribution.

In this paper, we present CTS, an operating system CPU scheduler whose main goal is to trim the tail latency for thread-driven and latency-sensitive workloads. CTS is a 3-dimensional CPU scheduler. It leverages the default scheduler of Linux to perform the system-wide load balancing to distribute tasks among cores evenly (first dimension). At each core, CTS lets the default Linux scheduler to choose a process that one of its threads must be executed next (second dimension). Once the next process is selected, the CTS thread scheduler chooses a ready thread of the chosen process for execution (third dimension).

The main objective of the thread scheduler is to execute threads in FCFS order. To achieve this, we have added a new data structure called Shadow Graph to perform all operations needed for CTS, including tracking threads belonging to a process in O(1) time complexity. At each scheduling decision, once the default Linux scheduler chooses the next process, CTS finds the first thread of the process to be executed next, ensuring sibling threads are served in FCFS order, alleviating the tail of the latency distribution for latency sensitive, threaded workloads. Note that, Linux refers to schedulable&executable entities as tasks. In this paper, we use the same terminology. However, we use threads to refer to tasks that do not have any child (forked) tasks and processes to indicate tasks that are the parents of one or more threads.

CTS scheduling strategy ensures that the main characteristics of the Linux scheduler, namely fairness and responsiveness, are not adversely affected. To achieve this, CTS leverages the default scheduler for process scheduling. At each scheduling decision, a process having a thread with the minimum vRuntime value is chosen as the next process. Hence, fairness is kept intact at process level, meaning that every process gets a fair share of processing resources as before. Given that under CTS, threads of the process that is chosen by CFS, run in FCFS order, and threads belonging to the same process typically perform the same job (stays on CPUs for the same amount of time that is bounded by the time slice), threads will get a fair share of processor resources. Similarly, CTS architecture ensures to keep the responsiveness of the default Linux scheduler intact at the process level, meaning that a process’s CPU access latency is not impacted under CTS because the default Linux scheduler is still in charge of process scheduling. Furthermore, for threads belonging to a process, CTS enhances responsiveness by executing threads in FCFS order, resulting in lower response time variability and tail latency.

In summary, we make the following contributions:

  • We present CTS, an operating system CPU scheduler whose objective is to trim the tail of the latency distribution for thread-driven workloads. To do so, using new data structures added to the Linux kernel, CTS tracks threads belonging to an application and guarantees to execute them in FCFS order using a thread scheduler, mitigating the tail latency for latency-sensitive threaded workloads.

  • We use a 3-dimensional scheduling technique to maintain the main characteristics of the default Linux scheduler including fairness and responsiveness while mitigating tail latency. To this end, CTS leverages CFS to perform system-wide load balancing (first dimension) and core level process scheduling (second dimension). Finally, The CTS thread scheduler schedules threads of the CFS’s chosen process in FCFS order. Having CFS in fully functional mode as the process scheduler guarantees maintaining the key properties of the default Linux scheduler.

  • We have implemented a prototype of CTS in the Linux kernel and conducted extensive experiments using both micro-benchmarks and application-level benchmarks. The results demonstrate that CTS significantly outperforms the default Linux CPU scheduler. For example, it mitigates the tail of the latency distribution for a null RPC server by up to 96%, a Thrift server by up to 90% and the Apache Web server by up to 51% at 99.9th percentile.

The remainder of this paper is organized as follows. Section 2 presents a background on the Linux default scheduler and the impact of different queuing disciplines on tail latency followed by the design of CTS in Section 3. Section 4 reports the results of experiments on a prototype implementation of CTS in Linux. Section 5 presents related work and Section 6 concludes the paper.

Section snippets

The default Linux scheduler

The completely fair scheduler (CFS) is the current default scheduler of Linux. Its main objective is to share processor time among running tasks fairly. To do this, it assigns each task an attribute called vRuntime to track the execution time of tasks. While a task executes on a processor, its vRuntime inflates. The speed of inflation depends on the task priority. The higher the priority, the lower the rate of inflation. To achieve fairness, CFS tries to keep the vRuntime of all existing tasks

Design

In this section, we describe the CTS design including the CTS policy and its corresponding mechanism.

Evaluation

In this section, we present our evaluation of CTS prototype implemented in the Linux Kernel for different types of latency-sensitive multi-threaded applications. We evaluate the effectiveness of CTS using both micro-benchmarks and application-level benchmarks. All benchmark applications are thread-driven. We use an RPC socket server as a micro-benchmark and multi-threaded Thrift server and Apache2 Web server as application-level benchmarks. Apache Thrift is a thread-driven, scalable and

Related work

The choice between thread or event concurrency models for latency-sensitive applications have been investigated in the recent years [34]. The event-driven concurrency model offers high scalability, low resource utilization and better synchronization support [[7], [28]]. On the other hand, Thread-driven systems provide an easier programming and debugging experience as they delegate scheduling and resource management to the underlying system. Von Behren et al. [26] believe that the inefficiencies

Conclusion

Responsiveness is one of the main concerns of large-scale interactive Web applications. Amazon has found that every 100 ms of latency costs them 1% in sales. Google also found that an extra 0.5 s in search page generation time drops the traffic by 20% [33]. Large-scale interactive services leverage parallelization to fan out the sub-requests across a large number of individual servers. The main request does not complete until the slowest sub-request is fulfilled. Therefore, the tail latency of

Esmail Asyabi is currently a researcher in Computer Science Department at Boston University. His research focuses on designing and building system software (e.g., OS kernels, hypervisors, device drivers) to (1) mitigate energy consumption of computer systems, ranging from mobile devices to data center servers, (2) enhance the performance of computer systems for specific applications or workload types (e.g., IO intensive workloads) and (3) enable efficient system support for emerging computing

References (34)

  • SaezJ.C. et al.

    Towards completely fair scheduling on asymmetric single-ISA multicore processors

    J. Parallel Distrib. Comput.

    (2017)
  • BelayA. et al.

    The IX operating system: Combining low latency, high throughput, and efficiency in a protected dataplane

    ACM Trans. Comput. Syst.

    (2016)
  • DeanJ. et al.

    The tail at scale

    Commun. ACM

    (2013)
  • DelgadoP. et al.

    Job-aware scheduling in eagle: Divide and stick to your probes

  • DelimitrouC. et al.

    Tarcil: reconciling scheduling speed and quality in large shared clusters

  • A.B. de Oliveira, S. Fischmeister, A. Diwan, M. Hauswirth, P.F. Sweeney, Why you should care about quantile regression,...
  • EgorovaR.

    Sojourn time tails in processor-sharing systems

    (2009)
  • ElmeleegyK. et al.

    Lazy asynchronous I/O for event-driven servers

  • HaqueM.E. et al.

    Few-to-many: Incremental parallelism for reducing tail latency in interactive services

    SIGPLAN Not.

    (2015)
  • Harchol-BalterM. et al.

    Queueing Disciplines

    (2010)
  • JainR.

    The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling

    (1990)
  • KimM. et al.

    Fair-share scheduling in single-ISA asymmetric multicore architecture via scaled virtual runtime and load redistribution

    J. Parallel Distrib. Comput.

    (2017)
  • KimS. et al.

    Delayed-dynamic-selective (DDS) prediction for reducing extreme tail latency in web search

  • KleinrockL.

    Queueing Systems, Volume II: Computer Applications

    (1976)
  • C. Kolivas, BFS Linux Process Scheduler FAQ. BFS Linux Process Scheduler....
  • LiJ. et al.

    Tales of the tail: Hardware, OS, and application-level sources of tail latency

  • LilkaerF.P.

    Enhancing Quality of Service metrics for high fan-in Node.js applications by optimising the network stack : Leveraging IX: The Dataplane Operating System

    (2015)
  • Cited by (0)

    Esmail Asyabi is currently a researcher in Computer Science Department at Boston University. His research focuses on designing and building system software (e.g., OS kernels, hypervisors, device drivers) to (1) mitigate energy consumption of computer systems, ranging from mobile devices to data center servers, (2) enhance the performance of computer systems for specific applications or workload types (e.g., IO intensive workloads) and (3) enable efficient system support for emerging computing models (e.g., cloud computing). Currently, his research focuses on providing efficient system support for virtualized clouds. He is mainly interested in distributed systems, operating systems and cloud computing. He is also a Ph.D. student in software engineering in the School of Computer Engineering of Iran University of Science and Technology, where he is a research assistant at distributed systems research laboratory.

    Erfan Sharafzadeh received his honors B.S. degree from Iran University of Science and Technology at 2016 and is currently pursuing his Master’s degree in computer software engineering at the same institution. His major research interests include Operating Systems, Performance Evaluation and Queuing Systems.

    SeyedAlireza SanaeeKohroudi received his B.S. degree in computer science from (IUST) Iran University of Science in 2016 and he is studying M.Sc. of computer science at IUST. He is currently involved in Smart Grid project at École polytechnique fédérale de Lausanne. The research area he is following centers around computer systems and networked systems.

    Mohsen Sharifi is a professor of software engineering in the School of Computer Engineering of Iran University of Science and Technology. He directs a distributed systems research group and laboratory. He is mainly interested in the engineering of distributed systems, solutions, and applications, particularly for use in various fields of science. The development of a true distributed operating system is on top of his wish list. He received his B.Sc., M.Sc., and Ph.D. in computer science from Victoria University, Manchester, UK, in 1982, 1986, and 1990, respectively. His website is http://webpages.iust.ac.ir/msharifi.

    View full text