An experimental comparison of different real-time schedulers on multicore systems

doi:10.1016/j.jss.2012.05.048

Journal of Systems and Software

Volume 85, Issue 10, October 2012, Pages 2405-2416

https://doi.org/10.1016/j.jss.2012.05.048 Get rights and content

Abstract

In this work, an experimental comparison among the Rate Monotonic (RM) and Earliest Deadline First (EDF) multiprocessor real-time schedulers is performed, with a focus on soft real-time systems. We generated random workloads of synthetic periodic task sets and executed them on a big multi-core machine, using Linux as Operating System, gathering an extensive amount of data related to their exhibited performance under various real-time scheduling strategies. The comparison involves the fixed-priority scheduler for multiprocessors as available in the Linux kernel (with priorities set so as to achieve RM), and on our own implementation of EDF, both configured in global, partitioned and clustered mode. The impact of the various scheduling strategies on the performance of the applications, as well as the generated scheduling overheads, are compared presenting an extensive set of experimental results. These provide a comprehensive view of the performance achievable by the different schedulers under various workload conditions.

Highlights

► Experimental comparison among RM and EDF on multi-processors. ► Comparison made with partitioned, clustered and global policies. ► Random workloads of synthetic periodic tasks. ► Experimentation carried out on a 48-core machine with Linux. ► Overheads achieved in the various scenarios are reported. ► Global and clustered real-time algorithms prove to be a viable solution.

Introduction

Multi-processor and multi-core computing platforms are nowadays largely used in the vast majority of application domains, ranging from embedded systems, to personal computing, to server-side computing including GRIDs and Cloud Computing, and finally high-performance computing. In embedded systems, small multi-core platforms are considered a viable and cost-effective solution, especially for their lower power requirements as compared to a traditional single processor system with equivalent computing capabilities. The increased level of parallelism in these systems may be conveniently exploited to run multiple real-time applications, like found in industrial control, aerospace or military systems; or to support soft real-time Quality of Service (QoS) oriented applications, like found in multimedia, gaming or virtual reality systems.

Servers and data centres are shifting towards (massively) parallel architectures with enhanced maintainability, often accompanied by a decrease in the clock frequency driven by the increasing need for “green computing” (von Weizsaecker et al., 2009).² Cloud Computing applications promise to move most of the increasing personal computing needs of users into the “cloud”. This is leading to an unprecedented need for supporting a large number of interactive and real-time applications, often involving on-the-fly media streaming, processing and transformations with demanding performance and latency requirements. These applications usually exhibit nearly periodic workload patterns which often cannot saturate the available computing power of a single (powerful) CPU. Therefore, there is a strong industrial interest in executing an ever-increasing number of applications of this type on the same system, node, physical CPU and even core, whenever possible, in order to minimise the number of needed nodes (and reduce both power consumption and costs).

In this context, a key role is played by real-time CPU scheduling algorithms for multi-processor systems, due to their potential impact on the performance experienced by the scheduled applications. These can be roughly categorised into global schedulers and partitioned schedulers. In global scheduling, the ready tasks with the highest priorities execute on the available processors at any time. This implies the need for dynamically migrating tasks among processors. On the other hand, in partitioned scheduling each task is statically allocated on one processor, according to a specific allocation algorithm, and tasks cannot migrate. Clustered schedulers reside somewhat in the middle, where the available processors are partitioned into clusters to which tasks are statically assigned, but in each cluster tasks are globally scheduled. In a multi-core system, the use of partitioned or clustered scheduling policies brings the additional problem of how to partition the tasks among cores or clusters of cores.

An orthogonal way to categorise schedulers is according to the way task priorities are assigned. If tasks’ priorities never change during the task lifetime, we have a fixed priority scheduler, otherwise we have a dynamic priority scheduler. In this paper, we focus on the two most popular schedulers: Rate Monotonic (RM) priority assignment for fixed priority schedulers; and Earliest Deadline First (EDF) for dynamic priority schedulers.

The designer of a real-time system often needs to compare different available real-time scheduling strategies, in terms of their impact on the performance of the hosted applications.

Many recent papers compare different scheduling strategies (global/partitioned, and fixed/dynamic priority) from a theoretical point of view (see Section 2 for an analysis of related work). In these works different schedulers are compared with respect to their achievable overall utilisation, under the constraint of maintaining the task-set schedulable, assuming worst-case conditions for the tasks execution, i.e., the analysis is based on the Worst-Case Execution Times (WCET). Although appropriate for hard real-time systems, in soft real-time ones such approaches are at risk of neglecting (or merely reminding to WCET analysis for) many practical issues, such as the overhead of the scheduler, the increased execution times as due to migrations and increased cache misses, and the presence of variability of memory access times in Non-Uniform Memory-Access (NUMA) machines. These issues may have a great influence on the actual performance, especially as the number of cores increases.

For instance, partitioned schedulers typically present less overhead. However, in an open system where tasks may dynamically enter and leave the system, a static allocation strategy may lead to underutilised systems. On the other hand, global scheduling is more flexible as it automatically balances the load across all processors. In addition, global dynamic priority schedulers (like EDF) are known to guarantee bounded tardiness as long as the total load does not exceed the system capacity (Valente and Lipari, 2005, Devi and Anderson, 2009, Devi and Anderson, 2008). However, global schedulers typically present a higher overhead; they cause migrations, which in turn may lead to a non-negligible increase in the tasks execution times.

Therefore, it is important to quantify such overheads in order to complement the theoretical properties of a scheduler with its practical performance figures. In this way, the designer can take a more informed decision on which scheduler to select for various application workload types.

In this paper the performance of partitioned, clustered and global variants of Rate Monotonic (RM) and Earliest Deadline First (EDF) scheduling algorithms in the Linux OS are compared. The experimental comparison is conducted on the Linux OS due to its wide applicability (with various kernel-level patches) in the domain of real-time systems.

We compare our own implementation of Global EDF (G-EDF) in the Linux kernel with respect to the fixed priority Linux scheduler (configured so as to realise RM). The goal is not to demonstrate the effectiveness of our scheduler, but rather to make a thorough performance comparison, and establish which scheduler performs better in different contexts. In order to precisely control the experiments, our methodology consists in generating sets of synthetic real-time tasks with various characteristics in terms of execution time and memory requirements and usage. The task set is then executed on a multi-core platform and the tasks’ performance is measured. The focus is on the metrics typically of interest for developers and other people who investigate on performance issues, and not purely on schedulability analysis. Indeed, we consider the laxity (tasks should not complete too close to their deadlines), the number of migrations and context switches (they may potentially affect negatively the performance), and the number and type of cache misses (they have a direct impact on the application execution times and performance). Since we focus on the comparison among different CPU scheduling policies, in this paper we only consider independent tasks. The hardware platform is a AMD^® Opteron™ 6168 with 48 cores (4 sockets with 12 cores for each processor).

Our implementation of G-EDF has been made available as open-source code. This, together with the details about the configuration of the experiments, allows to reproduce and verify all the results that have been included in the paper (see the Section 6). Also, this allows other researchers to perform independent investigations on partitioned, clustered and global EDF scheduling on Linux, as well as to develop new schedulers and concretely compare their performance with these policies. Last, but not least, this gives to any developer the possibility to try these policies for their real-time applications.

The remainder of this paper is organised as follows: in Section 2, the related work is briefly recalled. In Section 3, the background concepts needed to understand the remainder of the paper are introduced, and in Section 4 the modifications performed on the Linux kernel, to make it support global EDF scheduling, are sketched out. Sections 5 and 6 describe the methodology and report the results of the experimental evaluation phase, respectively. Finally, in Section 7 conclusions are drawn and possible directions for future work are envisioned in Section 8.

Section snippets

Related work

The comparisons available in the literature between different real-time multiprocessor scheduling solutions are almost always conducted by measuring the percentage of schedulable task sets among a number of randomly-generated ones. For example, this has been done in Baker, 2005, Bertogna and Baruah, 2010, Masrur et al., 2010. These approaches often rely on schedulability tests or simulations, and they do not involve real tasks running on a real system, thus they cannot collect such run-time

Background

In this section a few background concepts about real-time multiprocessor scheduling are introduced for a better understanding of the rest of the paper.

In this paper, a real-time system is considered as a set of n real-time tasks {τ_i, …, τ_n} to be scheduled over a set of m identical unit-capacity processors p₁, …, p_m . Each task τ_i follows the periodic task model: it activates periodically with an inter-arrival time of T_i, generating a sequence of jobs. Each job executes for at most a worst-case

Implementation of global scheduling in SCHED_DEADLINE

In the Linux kernel, scheduling decisions are implemented inside scheduling classes. Stock Linux comes with two classes, one for fairly scheduling best-effort activities (SCHED_OTHER policy) and one implementing fixed priority real-time scheduling (SCHED_FIFO or SCHED_RR policies), following the POSIX 1001.3b (IEEE, 2004) specification. Recently, a new real-time scheduler has been made available for the Linux kernel in form of a new scheduling class. It is called SCHED_DEADLINE³

Hardware platform

Experiments have been conducted on a Dell PowerEdge R815 server equipped with 64 GB of RAM and 4 AMD^® Opteron™ 6168 12-core processors (running at 1.9 GHz), for a total of 48 cores. From a NUMA viewpoint, each processor contains two 6-core NUMA nodes and is attached to two memory controllers. The memory is globally shared among all the cores, and the cache hierarchy is on 3 levels, private per-core 64 kB L1D and 512 kB L2 caches, and a global 10240 kB L3 cache.

The hardware platform runs the Linux OS

Experimental results

Running all the tests took several days, and yielded to an extensive set of experimental data. In this section, an excerpt of such data is reported. The full obtained data set (4.3 GB in compressed form) is available for download from: http://retis.sssup.it/people/jlelli/papers/JSS2012. Statistics come from the results of running 3 different randomly generated task sets for each configuration in terms of scheduler, allocation policy, number of tasks and their WSS.

For example, one of the 3 task

Conclusions

In this paper, an experimental comparison of various multi-processor scheduling algorithms has been performed by running synthetic workloads of real tasks on a Linux system. The performance of the various solutions has been evaluated under diverse metrics and under multiple combinations of CPU utilisation and number of tasks.

The experimental results lead to some interesting considerations. It appears clear that global and clustered algorithms are a viable solution for multi-core platforms with

Future work

There are various parameters that have not been considered during our practical evaluation, yet.

In clustered and partitioning strategies, we applied an off-line partitioning algorithm based on Linear Programming for partitioning tasks among cores (or clusters). However, in an open system, such an off-line optimisation phase might not be feasible, and one might want to keep in consideration a far simpler and quicker heuristic for this activity (e.g., first-fit, worst-fit). We would expect

Juri Lelli received a Bachelor’s degree in Computer Engineering at the University of Pisa (Italy) in 2006, and a Master’s degree in Computer Engineering at the University of Pisa (Italy) in 2010. At the moment, he is a PhD student at the ReTiS Lab, Scuola Superiore Sant'Anna in Pisa (Italy). His research area is Quality of Service control for soft real-time systems and real-time scheduling for multi-processor systems.

References (36)

U.C. Devi et al.
Improved conditions for bounded tardiness under EPDF Pfair multiprocessor scheduling
Journal of Computer and Systems Sciences
(2009)
G.L. Stavrinides et al.
Scheduling multiple task graphs in heterogeneous distributed real-time systems by exploiting schedule holes with bin packing techniques
Simulation Modelling Practice and Theory
(2011)
X. Tang et al.
A stochastic scheduling algorithm for precedence constrained tasks on grid
Future Generation Computer Systems
(2011)
Abeni, L., Buttazzo, G., 1998. Integrating multimedia applications in hard real-time systems. In: Proceedings of the...
B. Andersson et al.
Static-priority scheduling on multiprocessors
Andersson, B., 2003. Static-Priority Scheduling on Multiprocessors. PhD Thesis, Department of Computer Engineering,...
T.P. Baker
An analysis of fixed-priority schedulability on a multiprocessor
Real-Time Systems: The International Journal of Time-Critical Computing
(2006)
Baker, T.P., 2005. A comparison of global and partitioned EDF schedulability tests for multiprocessors. In: Proceeding...
A. Bastoni et al.
Cache-related preemption and migration delays: empirical approximation and impact on schedulability
A. Bastoni et al.
An empirical comparison of global, partitioned, and clustered multiprocessor EDF schedulers

A. Bastoni et al.

Is semi-partitioned scheduling practical?

M. Bertogna et al.

Tests for global EDF schedulability analysis

Journal of Systems Architecture

(2010)

M. Bertogna et al.

Response-time analysis for globally scheduled symmetric multiprocessor platforms

B. Brandenburg et al.

On the scalability of real-time scheduling algorithms on multicore platforms: a case study

B.B. Brandenburg et al.

On the implementation of global real-time schedulers

G. Buttazzo

Rate monotonic vs. EDF: judgment day

Real-Time Systems

(2005)

G. Buttazzo et al.

Soft Real-Time Systems Predictability vs. Efficiency, Number 10.1007/0-387-28147-9-3 in Series in Computer Science

(2005)

J.M. Calandrino et al.

LITMUS-RT: a testbed for empirically comparing real-time multiprocessor schedulers

Cited by (23)

Tight Lower bound on power consumption for scheduling real-time periodic tasks in core-level DVFS systems
2022, Parallel Computing
Dynamic voltage and frequency scaling (DVFS) is a widely used solution to reduce power consumption. Modern multi-core architectures support core-level DVFS, where each core has its own power supply and can change its frequency independently from other cores. This paper aims at optimizing power consumption of multi-core processors while ensuring deadline constraints of real-time periodic tasks. From theoretical aspects, we prove a tight lower bound of power consumption for executing real-time tasks, which indicates to what extent scheduling algorithms can approach. From practical aspects, we propose a Power Scaling Algorithm (PSA) to assign real-time periodic tasks to a power efficient platform. PSA not only determines the optimal frequencies for each core, but also provides the appropriate number of active cores, which can skip the local optimum and achieve the global minimum. This lower power bound is validated by several extensive experiments.
A multivariate and quantitative model for predicting cross-application interference in virtual environments
2017, Journal of Systems and Software
Citation Excerpt :
The SUM operation is executed by the inner loop Memory Access Loop (lines 3 to 8) which is controlled by two input parameters, γ and δ. The first one defines the sizes of vectors A, B and C, and is indirectly used to determine application’s Working Set Size (WSS) (Lelli et al., 2012; Gupta et al., 2013b). A small WSS usually increases application’s cache hit ratio because all data needed by it in a given time interval can be entirely loaded in cache.
Cross-application interference can drastically affect performance of HPC applications executed in clouds. The problem is caused by concurrent access of co-located applications to shared resources such as cache and main memory. Several works of the related literature have considered general characteristics of HPC applications or the total amount of SLLC accesses to determine the cross-application interference. However, our experiments showed that the cross-application interference problem is related to the amount of simultaneous access to several shared resources, revealing its multivariate and quantitative nature. Thus, in this work we propose a multivariate and quantitative model able to predict cross-application interference level that considers the amount of concurrent accesses to SLLC, DRAM and virtual network, and the similarity between the amount of those accesses. An experimental analysis of our prediction model by using a real reservoir petroleum simulator and applications from a well-known HPC benchmark showed that our model could estimate the interference, reaching an average and maximum prediction errors around 4% and 12%, and achieving errors less than 10% in approximately 96% of all tested cases.
Strict Partitioning for Sporadic Rigid Gang Tasks
2024, arXiv
Multi-criteria Optimization of Real-time DAGs on Heterogeneous Platforms under P-EDF
2024, ACM Transactions on Embedded Computing Systems
PSIC: Priority-Strict Multi-Core IRQ Processing
2022, Proceedings - 2022 IEEE 25th International Symposium on Real-Time Distributed Computing, ISORC 2022
An Evaluation of Adaptive Partitioning of Real-Time Workloads on Linux
2021, Proceedings - 2021 IEEE 24th International Symposium on Real-Time Distributed Computing, ISORC 2021

View all citing articles on Scopus

Dario Faggioli received a PhD degree in Computer Engineering from the Scuola Superiore Sant’Anna of Pisa (Italy) in 2012. His research interests were mainly in the area of ‘‘open systems’’, i.e., systems where hard, soft and non real-time activities co-exist. In particular, he focused on QoS guarantee provisions to soft real-time applications. He is currently employed by Citrix and he is working on the Xen Open Source hypervisor.

Tommaso Cucinotta graduated in Computer Engineering at the University of Pisa (Italy) in 2000, and received the PhD degree in Computer Engineering from the Scuola Superiore Sant'Anna of Pisa in 2004. He has been Assistant Professor of Computer Engineering at the Real-Time Systems Laboratory (ReTiS) of Scuola Superiore Sant’Anna, with research interests mainly in the areas of real-time and embedded systems, with a particular focus on real-time support for general-purpose Operating Systems, and security, with a particular focus on smart-card based authentication. Since January 2012, he is a researcher at Bell Laboratories, Alcatel Lucent in Dublin (Ireland).

Giuseppe Lipari is Associate Professor of Computer Engineering (scientific sector ING-INF/05) at Scuola Superiore Sant’Anna. He is part of the RETIS lab of the TeCIP (Institute of Communication, Information and Perception Engineering). From April 2012, he is on a leave to spend two years at the Laboratoire de Spécification et Vérification, École Normal Supérieure de Cachan, France. He is IEEE member, and associate editor of the Real-Time Systems Journal and of the Journal of System Architectures. His research interests are in real-time systems, real-time operating systems, scheduling algorithms, embedded systems, wireless sensor networks.

^☆: The research leading to these results has received funding from the European Community's Seventh Framework Programme FP7 under grant agreements no. FP7-ICT-216586 and no. 248465 in the context of the ACTORS and S(o)OS projects. Giuseppe Lipari and Tommaso Cucinotta were previously with Scuola Superiore Sant’Anna.

¹: Researcher at Alcatel-Lucent Bell Laboratories, Blanchardstown Business & Technology Park, Dublin – Ireland.

View full text

An experimental comparison of different real-time schedulers on multicore systems☆