Efficient algorithms for task mapping on heterogeneous CPU/GPU platforms for fast completion time
Introduction
Graphics processing units (GPUs) are now commonly used as co-processors in many embedded systems to accelerate general-purpose applications. They are particularly capable of executing data-parallel applications, due to their highly multi-threaded architecture and high-bandwidth memory. Various embedded system domains can benefit high performance and better energy efficiency from utilizing GPUs. For example, GPUs can efficiently perform matrix operations such as factorization on large data sets and multidimensional FFTs and convolutions. Such operations are often seen in many embedded applications including signal processing, imaging and video processing. By leveraging new programming models, such as CUDA[1] and OpenCL[2], programmers can effectively develop highly data-parallel tasks to execute such applications on GPUs.
By providing heterogeneous processing elements with different performance characteristics in the same system, heterogeneous CPU/GPU architectures are expected to provide more flexibility for better performance compared to homogeneous systems. Fast completion time is an imperative performance metric that needs to be optimized in most embedded systems. For example, in a driver-assisted and autonomous vehicle, the video streaming and sensor data processing tasks need to be completed in a rapid manner. In order to minimize the completion time for running a set of workloads, the step that maps computations to processing elements is critical. In this paper, we consider the mapping problem in a heterogeneous system containing multiple CPUs and GPUs. Our goal is to minimize the completion time.
This mapping problem is quite challenging due to a large size of the policy space. First of all, applications may demonstrate (sometimes significantly) different performance characteristics when executed on GPUs than CPUs. The mapping algorithm thus needs to consider such heterogeneity when making prioritization and mapping decisions. Moreover, most real world workloads are implemented using rather complex task graphs, where a task graph contains a number of data- or logical-dependent tasks. The precedence constraints among tasks require the mapping algorithm to consider: (i) the task graph structure and (ii) different data transfer costs among tasks if executed on different processors. Furthermore, for data-intensive tasks, data partitioning techniques need to be incorporated into the mapping algorithm because partitioning a task into threads that can be run on multiple devices in parallel improves the overall utilization.
Without considering the above-mentioned factors, mapping algorithms are unlikely to perform consistently well across different workloads. Prior work on heterogeneous CPU/GPU systems has focused on new programming models and API extensions for supporting multiple heterogeneous devices[3], [4], [5], automating the mapping processor[6], [7], [8], enabling CPU and GPU sharing[9]. Different mapping heuristics have been designed and applied in these work. However, since the fine-grain mapping problem is not the major focus of these work, the existing mapping heuristics make simplified mapping decisions based upon a limited set of metrics (e.g.,data locality or execution time).
In this paper, motivated by a number of measurements-based case studies, we design three mapping algorithms, each of which explores a specific set of factors that may affect the completion time performance. We evaluated such algorithms by implementing them on a real heterogeneous system, i.e.,containing a four-core CPU and two discrete GPUs with different performance characteristics. Extensive experiments were conducted using a set of popular benchmarks and workloads, such as Cholesky factorization, Monte Carlo. Experimental results demonstrate that our proposed algorithms can achieve much faster completion time (up to 30% improvement) compared to the state-of-the-art mapping techniques. By testing workloads with varying characteristics, experiments show that the completion time performance under our mapping algorithms is also consistent.
We further extend the original scheduling algorithms by including the data transfer time, i.e.,reducing the time consumption for data transfer prediction. To evaluate our extended algorithm, we deployed the original StarPU in a new heterogeneous system containing two ten-core CPUs and four discrete GPUs with close setups. To further verify the performance of all proposed scheduling algorithms, we implemented an extended benchmark testing suite for simulating runtime neural network applications in real-world scenarios. Experimental results demonstrate that our extended algorithm can achieve much faster completion time (averagely 29% in more computing resource case and averagely 37% in less computing resource case) compared to the state-of-the-art mapping techniques.
The contributions of this paper are listed as follows.
- •
Idea and approach. We attempt to minimize the completion time of generalized applications in heterogeneous systems by optimizing scheduling strategies via proposing a set of algorithms including heterogeneity ratio-based mapping algorithm, structure rank based heuristics algorithm and data partition algorithm[10].
- •
Evaluation. We conduct a set of experiments on a real-world runtime system, e.g.,StarPU to evaluate our proposed algorithms and compare their performance with naive and state-of-the-art scheduling algorithms.
- •
Extended Approach. We propose an extended algorithm, namely heterogeneity ratio-based and data-partition optimizing scheduling, to minimize the completion time and enhance the runtime performance of deep-neural-network-based applications under resource-limited infrastructure.
- •
Extended Evaluation. We conduct a set of experiments upon StarPU with server-level setups to evaluate the runtime performance of heterogeneity ratio-based and data-partition optimizing scheduling and compare it to the naive, state-of-the-art, and the originally proposed algorithms.
The rest of this paper is organized as follows. Section2 presents the background, system model and our theoretical framework. Section3 describes the measurement-based and extensive case studies as our motivation. Section4 presents the practical mapping algorithms and the extended mapping algorithm for neural network acceleration. Section5 describes our implementation. Section6 discusses our experimental results, including the experimental results for the extended algorithm in both the original benchmarks and the extended benchmark. Section7 describes related work. Section8 concludes this paper.
Section snippets
Background
In this section we give out a list of notations and definitions to help us better formalize the proposed problem, and then we briefly describe the general structure of heterogeneous schedulers.
Case studies: What to consider for making mapping decisions
In this section, we present several measurements-based case studies that motivate the design of our mapping algorithms. We measured the completion time of executing a vector add application ( is the th application) and a matrix multiplication application on a heterogeneous system configured with one Intel Core i7 CPU and NVIDIA GeForce GTX660 GPU. can be expressed as , where and are vectors and is a constant. can be expressed as , where , ,
Practical mapping algorithms
In this section, we present three practical online algorithms for mapping tasks in a heterogeneous platform consisting of multiple CPUs and GPUs. Our algorithmic design is motivated by the observations as discussed in Section3. Specifically, the proposed mapping algorithms consider heterogeneity, task graph structure, and data partitioning. The first algorithm (we call it the baseline algorithm) mainly factors heterogeneity into making mapping decisions (besides considering traditional factors
Implementation
We implement our scheduler algorithms on top of the StarPU runtime platform[11] as customized schedulers. The role of the StarPU scheduler is to dispatch tasks onto different processing units (named “workers” internally). In general, the process of a scheduler can be described as follows: given applications, each application consists of a number of tasks waiting to be executed. The scheduler selects the tasks from the runnable tasks (i.e.,the tasks that obtain all the data they need) for each
Evaluation
In this section, we present the implementation methodology and experimental results used to evaluate the effectiveness of our proposed algorithms.
Related work
Scheduling algorithms for heterogeneous systems. The general problem of scheduling in heterogeneous systems has received much attention. A number of scheduling heuristics have been proposed for scheduling directed acyclic graph-based (DAG) applications in heterogeneous systems[24], [25], [26], [27], [28], [29]. These algorithms schedule a single DAG (Directed Acyclic Graph) of tasks onto heterogeneous processing units with varying speed for minimizing the completion time. Zhao et.[30] proposed
Conclusion
In this paper, we investigate the problem of mapping multiple applications implemented using task graphs in a heterogeneous system consisting of CPUs and GPUs. To achieve fast competition time, we present a fine-grain mapping framework that explores a set of critical factors that are suggested by several measurements-based case studies. We present a theoretical framework that formulates this problem as an integer program and a set of practically efficient mapping algorithms. We implement the
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is partially supported by the National Natural Science Foundation of China (Grant No. 61902169), Shenzhen Peacock Plan, China (Grant No. KQTD2016112514355531), and Science and Technology Innovation Committee Foundation of Shenzhen, China (Grant No. JCYJ20170817110848086). This work is also partially supported by the Climbing Project of China under Grant No. pdjh2019c438. We would like to thank Yiwei Cheng for his help on collecting the trace data and Mingyuan Wu for his help on
Zexin Li received the B.S. degree from the Southern University of Science and Technology, Shenzhen, China, in 2020. He is currently a Ph.D. student in the University of Texas at Dallas. His research interests focus on real-time and embedded systems.
References (82)
- et al.
Dynamic task mapping and scheduling with temperature-awareness on network-on-chip based multicore systems
J. Syst. Archit.
(2019) - et al.
Building real-time parallel task systems on multi-cores: A hierarchical scheduling approach
J. Syst. Archit.
(2019) - et al.
Pessimism in multicore global schedulability analysis
J. Syst. Archit.
(2019) - et al.
Task mapping and scheduling for network-on-chip based multi-core platform with transient faults
J. Syst. Archit.
(2018) - et al.
Thermal-aware correlated two-level scheduling of real-time tasks with reduced processor energy on heterogeneous mpsocs
J. Syst. Archit.
(2018) - et al.
Gpuart-an application-based limited preemptive gpu real-time scheduler for embedded systems
J. Syst. Archit.
(2019) - et al.
An empirical study of boosting spectrum-based fault localization via pagerank
IEEE Trans. Softw. Eng.
(2019) - et al.
Reinforcement-learning-guided source code summarization via hierarchical attention
IEEE Trans. Softw. Eng.
(2020) - et al.
Unpu: A 50.6 tops/w unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision
Compute unified device architecture programming guide
(2014)
The OpenCL Language
Starpu-mpi: Task programming over clusters of machines enhanced with accelerators
Application-specific topology-aware mapping for three dimensional topologies
Skepu: a multi-backend skeleton programming library for multi-gpu systems
A static task partitioning approach for heterogeneous systems using opencl
Timegraph: Gpu scheduling for real-time multi-tasking environments
Task mapping in heterogeneous embedded systems for fast completion time
Starpu: A unified platform for task scheduling on heterogeneous multicore architectures
Concurr. Comput. : Pract. Exper.
Supporting preemptive task executions and memory copies in gpgpus
Transparent cpu-gpu collaboration for data-parallel kernels on heterogeneous systems
How to optimize performance with starpu
Generational Thread Scheduler Using Reservations for Fair Scheduling
Starpu-dmdar
Fast r-cnn
Faster r-cnn: Towards real-time object detection with region proposal networks
Handwriting arabic character recognition lenet using neural network
Int. Arab J. Inf. Technol.
Imagenet: A large-scale hierarchical image database
Performance-effective and low-complexity task scheduling for heterogeneous computing
IEEE Trans. Parallel Distrib. Syst.
Dag scheduling using a lookahead variant of the heterogeneous earliest finish time algorithm
An experimental investigation into the rank function of the heterogeneous earliest finish time scheduling algorithm
List scheduling algorithm for heterogeneous systems by an optimistic cost table
IEEE Trans. Parallel Distrib. Syst.
A hybrid heuristic for dag scheduling on heterogeneous systems
Comparative evaluation of the robustness of dag scheduling heuristics
Scheduling multiple dags onto heterogeneous systems
Exploring the multitude of real-time multi-gpu configurations
Edsketch: execution-driven sketching for java
STTT
Cited by (13)
A novel GPU-based approach for embedded NARMAX/FROLS system identification
2024, Mechanical Systems and Signal ProcessingI/O-efficient GPU-based acceleration of coherent dedispersion for pulsar observation
2023, Journal of Systems ArchitectureAn intelligent scheduling framework for DNN task acceleration in heterogeneous edge networks
2023, Computer CommunicationsA survey on neural networks for (cyber-) security and (cyber-) security of neural networks
2022, NeurocomputingCitation Excerpt :In one case, changing the activation function from ReLU to sigmoid produced drastically different classifier performance. As noted in [117], the objective of hyperparameter optimization procedures usually focuses on pushing the accuracy metrics to the limits, not taking into consideration the computational requirements of the matrix multiplications along with the memory access times and proper, efficient cpu-gpu queueing [118,119]. Another challenge to the application of ANNs is that they can only be as good as the data they receive.
Optimized partitioning and priority assignment of real-time applications on heterogeneous platforms with hardware acceleration
2022, Journal of Systems ArchitectureCitation Excerpt :A wide survey on this topic of partitioning techniques and benchmarks regarding combined CPU/GPU architectures can be found in [22]. Among the many, it is worth mentioning the work of Li et al. [23], which presents a set of algorithms for task mapping in heterogeneous platforms with GPUs, with the goal of minimizing the makespan. The introduction of heterogeneous platforms and hardware accelerators is more recent for what concerns embedded real-time applications.
Zexin Li received the B.S. degree from the Southern University of Science and Technology, Shenzhen, China, in 2020. He is currently a Ph.D. student in the University of Texas at Dallas. His research interests focus on real-time and embedded systems.
Yuqun Zhang received the B.S. degree from Tianjin University, Tianjin, China, the M.S. degree from the University of Rochester, Rochester, NY, USA, and the Ph.D. degree from the University of Texas at Austin, Austin, TX, USA. He is now an Assistant Professor with the Southern University of Science and Technology, Shenzhen, China. His research interests include software engineering and services computing.
Ao Ding received the B.S. degree from the Southern University of Science and Technology, Shenzhen, China, in 2020. He is currently a M.S. student in the Southern University of Science and Technology, Shenzhen, China. His research interests include software engineering and computer vision.
Husheng Zhou received the Ph.D. degree in computer science from the University of Texas at Dallas, in 2018. He is currently working in VMware, Austin, Texas, as a member of technical staff. His research interests include real-time systems, autonomous embedded systems and computer security.
Cong Liu received the Ph.D. degree in computer science from the University of North Carolina at Chapel Hill, in Jul. 2013. He is an associate professor in the Department of Computer Science, the University of Texas at Dallas. His research interests include real-time systems and GPGPU. He has published more than 30 papers in premier conferences and journals. He received the Best Paper Award at the 30th IEEE RTSS and the 17th RTCSA. He is a member of the IEEE.