Real-Time scheduling and analysis of parallel tasks on heterogeneous multi-cores
Introduction
To meet the increasing performance requirements, parallel hardware architectures have become the mainstream in the multi-cores embedded field. Parallel programming models are fundamental to exploit the performance capabilities of these architectures [1], [2], [3]. In recent years, parallel task scheduling problems with real-time constraints have made great progress [4]. Some researchers developed multiple parallel programming paradigms, such as MPI [5], OpenMP [6], [7] or parallel programming languages as CilkPlus [8] to aid developers in the creation of parallel programs. All these parallel programming paradigms currently support intra-task parallelism, where a single task consists of multiple parallel code parts that can be executed simultaneously. DAG task model is a promising model to formulate the intra-task parallelism software. The real-time scheduling and analysis of DAG parallel task model has gained a lot of attention in the real-time and High-Performance Computing communities [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18].
Moreover, heterogeneous hardware architecture that can utilize specialized processing capabilities and can offer higher performance and energy efficiency than homogeneous architecture has received more and more attention [19], [20], [21], [22], [23]. In general, heterogeneous hardware architecture consists of equipment that is asymmetric in performance and functionality [24], [25] which integrate low power general purpose multi-cores (known as the host) with specialized coprocessors (e.g., Cell/BE SPUs) or data-parallel accelerators (e.g., GPUs), such as NVIDIA Tegra X1 [26] or Xilinx UltraScale [27]. Heterogeneous multiprocessor systems on a chip (MPSoCs) as one of heterogeneous hardware architecture has been widely used in many real-time embedded systems. As introduced in [28], [29], [30] MPSoCs can be broadly classified into performance heterogeneity and functional heterogeneity. Performance heterogeneity is that cores with the same functionality (i.e., same instruction set architecture (ISA)) but different power-performance characteristics are integrated. Functional heterogeneity is that cores with very different functionality (i.e., different ISA) are interspersed on the same die. Jetson TX2 [31] belongs to performance heterogeneity since it adopts the big. LITTLE architecture [32] that integrates high-performance cores (big cores) with low-power cores (LITTLE cores). It contains two Denver cores (high-performance cores) and four ARM Cortex-A57 cores (low-power cores). Denver cores and ARM cores have different power-performance characteristics but they are coherent and share the same ISA.
Current parallel programming languages tend to support heterogeneous multi-cores. For example, in OpenMP [6], the proc_bind clause can be used to specify a mapping of threads to some processing core. In CUDA [33], the cudaSetDevice function can be used to set the following execution to the target device. In OpenCL [34], the clCreateCommandQueue function can be used to create command queues for some devices.
In this paper, we consider real-time scheduling of typed DAG tasks on heterogeneous multi-cores, where each vertex is explicitly bound to a specific type of cores for execution. Binding code snippets of a program to a specific type of cores is a common operation in heterogeneous multi-cores scheduling and can be easily implemented in mainstream parallel programming frameworks and operating systems.
The real-time scheduling of typed DAG tasks under heterogeneous platforms is studied in [35], [36], [37]. All these work schedules the DAG tasks under the work-conserving algorithm, the response time analysis methods introduced as follows. Jeffrey et al [35] proposed the first WCRT bound for the general typed DAG task model. However, Jeffrey’s response time bound is very pessimistic. Serrano et al [36] proposed the response time bound for a specific typed DAG task model with two typed cores that has certain limitations. Han et al [37] developed two response time bounds in which the first bound dominated Jeffrey’s bound [35] in analysis precision and another bound significantly improved the analysis precision by exploring more detailed task graph structure information. Even Han’s response time bounds are still very pessimistic. When analyzing the worst response time of each path, it took into account some blocked time from vertices of the same type and in different paths that have completed or not yet executed, which is unnecessary. Yang et al. [38] studied the scheduling of typed DAG tasks by decomposing each DAG task into a set of independent subtasks with artificial release time and deadlines.
This paper aims to get a more accurate WCRT upper bound for typed DAG tasks. To solve the problems in the early work, we propose a criticality allocation strategy, which assigns a criticality to each vertex. The criticality determines the urgency of vertex execution and decreases as the remaining workload of the vertex decreases. Based on the strategy, we propose a new WCRT bound to verify the schedulability of the DAG task supporting heterogeneous computing. It can reduce the number of potentially parallel vertices to a relatively small range. Experiments with randomly generated workload show that our proposed criticality allocation strategy and new bound are significantly more precise than the existing bound.
The rest of the paper is organized as follows. Section 2 discusses the related work. Section 3 introduces the system model. Section 4 enumerates the known WCRT upper bounds of the same model with us and analyzes the problems in these bounds. Section 5 presents a new scheduling strategy for typed DAG tasks based on criticality allocation. Section 6 describes the response time analysis for our new scheduling strategy. Section 7 presents the experimental results comparing our approach with existing WCRT bounds. Finally, Section 8 concludes the paper highlighting some future research directions.
Section snippets
Related work
Parallel and heterogeneous hardware architectures become mainstream in the embedded real-time domain to cope with the increasing performance requirements. Scheduling of an application modeled by DAG which as fundamental in parallel programming models is a key problem when aiming at high performance. The classical response time bound for untyped DAG tasks was proposed by Graham [39] and Graham [40]. Based on [39], [40], the response time analysis for multiple untyped DAG (in which the vertices
Platform model
The heterogeneous multi-core platform consisting of S types of cores is formulated as a collection of cores where Cs (s ∈ [1, S]) is the set of the s-th type of cores. For the sake of convenience, we let ms be the cardinality of Cs, i.e., 1
Task model
The parallel task is formulated as a typed DAG model where V is the set of vertices, E
Existing WCRT bound
We briefly review some prior results from the response time analysis literature, that we will compare some of them with the response time bound which derived in this paper. On the known work for the considered model, the typed DAG task G is scheduled on the heterogeneous multi-core platform by a work-conserving scheduling algorithm, under which an eligible vertex of type s, that all of its predecessors have finished, must be executed if there are available cores of type s. It also applies to
A new scheduling algorithm for the typed DAG task
In the previous section, we analyzed the existing bounds and the problems within them. From the bounds we can know that the response time calculation of each path l is divided into two parts: one is the length of the path l, the other is the time that the vertices on the path l blocked. The key of our work is to improve the accuracy of blocked time calculating. These bounds overestimate the number of vertices in the DAG task which can block the vertices with the same type on the path l that
Response time analysis for typed DAG task
We present a new response time analysis method to support heterogeneous and parallel computation based on the RTA presented in (8). The new bound CPB can reduce the workload that does not cause any blocked time on the parallel workload executed in the same type of cores as much as possible. It allows a reduction of the self-interference factor, being the new response time upper bound more accurate than (5).
In Section 5, all the unit-nodes have been assigned to criticality sets and each
Evaluation
In this section, we experimentally evaluate the performance of our proposed response time analysis method CPB which based on the criticality allocation strategy with the known WCRT bounds in terms of both precision and efficiency. Since the platform model studied in [36] is different from the other known WCRT bounds, so we divide the experiments into two parts.
Conclusion
This paper presents a new scheduling algorithm and a new WCRT bound for typed DAG parallel task supporting heterogeneous computing, where the workload of each vertex in the typed DAG is only allowed to execute on a particular type of cores. The known WCRT bounds scheduled by the work-conserving scheduling algorithm are pessimistic because of considering more unnecessary blocked time comes from the vertices with the same type and on the parallel paths but have been completed or not started yet.
Declaration of Competing Interest
Title: Real-Time Scheduling and Analysis of Parallel Tasks on Heterogeneous Multi-cores. Author: Shuangshuang Chang, Xufeng Zhao, Qingxu Deng, declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and company that could be construed as influencing the position presented in, or the review of, the manuscript
Acknowledgement
This work was supported by the National Key R&D Program of China under Grant No. 2018YFB1702000, the Joint Funds of the National Natural Science Foundation of China under Grant No. U1908212, the National Natural Science Foundation of China under Grant No. 61972076, the National Natural Science Foundation of China under Grant No. 61871107 and the National Natural Science Foundation of China under Grant No. 61602104.
Shuangshuang Chang received the M.S. degree in computer technology from Northeastern University, Shenyang, China, in 2016, where she is currently pursuing the Ph.D. degree. Her current research interests include embedded real-time system, scheduling analysis in mixed criticality system, and security mechanism of cyber-physical systems.
References (55)
- et al.
Building real-time parallel task systems on multi-cores: a hierarchical scheduling approach
J. Syst. Archit.
(2019) - et al.
Bounding carry-in interference for synchronous parallel tasks under global fixed-priority scheduling
J. Syst. Archit.
(2018) - et al.
Evaluation framework for energy-aware multiprocessor scheduling in real-time systems
J. Syst. Archit.
(2019) - et al.
Scope-aware data cache analysis for OpenMP programs on multi-core processors
J. Syst. Archit.
(2019) - et al.
Analysis of federated and global scheduling for parallel real-time tasks
2014 26th Euromicro Conference on Real-Time Systems
(2014) - et al.
Thread-level priority assignment in global multiprocessor scheduling for DAG tasks
J. Syst. Softw.
(2016) - et al.
Data dependency reduction for high-performance fpga implementation of deflate compression algorithm
J. Syst. Archit.
(2019) - et al.
Exploring heterogeneous scheduling for edge computing with cpu and fpga mpsocs
J. Syst. Archit.
(2019) - et al.
Ker-one: a new hypervisor managing fpga reconfigurable accelerators
J. Syst. Archit.
(2019) - et al.
A survey and taxonomy of fpga-based deep learning accelerators
J. Syst. Archit.
(2019)
Efficient large-scale heterogeneous debugging using dynamic tracing
J. Syst. Archit.
Minimizing temperature and energy of real-time applications with precedence constraints on heterogeneous MPSoC systems
J. Syst. Archit.
Predicting performance in multi-core systems with shared reconfigurable accelerators
J. Syst. Archit.
Parallel batch scheduling with inclusive processing set restrictions and non-identical capacities to minimize makespan
Eur. J. Oper. Res.
On the optimality of the tls algorithm for solving the online-list scheduling problem with two job types on a set of multipurpose machines
J. Comb. Optim.
Scheduling with processing set restrictions: ptas results for several variants
Int. J. Prod. Econ.
Scheduling with processing set restrictions: asurvey
Int. J. Prod. Econ.
Scheduling with processing set restrictions: aliterature update
Int. J. Prod. Econ.
Measuring the performance of schedulability tests
Real-Time Syst.
A survey of parallel hard-real time scheduling on task models and scheduling approaches
ARCS 2017; 30th International Conference on Architecture of Computing Systems
Overview of the mpi-io parallel i/o interface
IPPS95 Workshop on Input/Output in Parallel and Distributed Systems
Multi-core real-time scheduling for generalized parallel task models
Real-Time Syst.
A multi-DAG model for real-time parallel applications with conditional execution
Proceedings of the 30th Annual ACM Symposium on Applied Computing
Semi-federated scheduling of parallel real-time tasks on multiprocessors
2017 IEEE Real-Time Systems Symposium (RTSS)
Scheduling parallel real-time recurrent tasks on multicore platforms
IEEE Trans. Parallel Distrib. Syst.
The federated scheduling of systems of conditional sporadic DAG tasks
Proceedings of the 12th International Conference on Embedded Software
Cited by (27)
VPSS: A DAG scheduling heuristic with improved response time bound
2024, Journal of Systems ArchitectureA systematic review on security aware real-time task scheduling
2023, Sustainable Computing: Informatics and SystemsAn optimal semi-partitioned algorithm for scheduling real-time applications on uniform multicore processors
2023, Sustainable Computing: Informatics and SystemsResponse time analysis of parallel tasks on accelerator-based heterogeneous platforms
2022, Journal of Systems ArchitectureCitation Excerpt :He et al. [34] proposed the response time analysis method for DAG tasks with arbitrary intra-task priority assignment. Real-time scheduling and analysis of DAG tasks on heterogeneous platforms have attracted attention due to the use of the advantages of different architectures which provide higher performance. [20,21,23,24] studied the WCRT bound of DAG tasks that have multiple types of vertices running on the specified type of processors.
Computing exact WCRT for typed DAG tasks on heterogeneous multi-core processors
2022, Journal of Systems ArchitectureCitation Excerpt :Yang et al. [29] based on the non-preemptive G-EDF (global earliest deadline first) scheduling strategy, studied scheduling problems of multiple typed DAG tasks by decomposing each DAG task into a group of independent subtasks with release time and deadlines, and analyzed the decomposed independent subtasks by known methods. Chang et al. [21] analyzed typed DAG tasks off-line, assigned a dynamic criticality to each vertex, scheduled vertices according to the criticality of each vertex, and then proposed a new response time analysis method to obtain a tighter WCRT upper bound. Zahaf et al. [30] proposed a novel HPC-DAG (Heterogeneous Parallel Condition Directed Acyclic Graph Model) for heterogeneous platforms.
SEAMERS: A Semi-partitioned Energy-Aware scheduler for heterogeneous MulticorEReal-time Systems
2021, Journal of Systems ArchitectureCitation Excerpt :To meet such requirements, the industry is advancing towards specialized processing cores, like multi-CPU platforms with graphics processing cores, signal processing cores, etc. With the advent of heterogeneous platforms such as ARM’s big.LITTLE, Nvidia Tegra, Samsung Exynos, etc., there is a need for embedded systems design strategies to adapt to these newer platforms [2,3]. Given a group of real-time applications and a heterogeneous multicore processing platform, successfully guaranteeing timing, energy, and performance constraints is a scheduling problem.
Shuangshuang Chang received the M.S. degree in computer technology from Northeastern University, Shenyang, China, in 2016, where she is currently pursuing the Ph.D. degree. Her current research interests include embedded real-time system, scheduling analysis in mixed criticality system, and security mechanism of cyber-physical systems.
Xufeng Zhao was born in Panjin, Liaoning of China in 1996. He received his bachelor’s degree in internet of things engineering from Northeastern University, China in 2018, he is a master candidate at the computer system architecture in Northeastern University, China. His research interests are broadly in embedded real-time systems, especially the real-time scheduling on multi-cores systems.
Zhenyu Liu was born in Nanyang, Henan of China in 1995. He received his bachelor’s degree in internet of things engineering from Northeastern University, China in 2018, he is a master candidate at the computer applications technology in Northeastern University, China. His research include real-time embedded systems and cyber-physical systems.
Qingxu Deng received his Ph.D. degree in computer science from Northeastern University, China, in 1997. He is a professor of the School of Computer Science and Engineering, Northeastern University, China, where he serves as the Director of institute of Cyber-Physical Systems. His main research interests include Cyber-Physical systems, embedded systems, and realtime systems.