Efficient algorithms for task mapping on heterogeneous CPU/GPU platforms for fast completion time

https://doi.org/10.1016/j.sysarc.2020.101936Get rights and content

Abstract

In GPU-based embedded systems, the problem of computation and data mapping for multiple applications while minimizing the completion time is quite challenging due to large size of the policy space. To achieve fast competition time, a fine-grain mapping framework that explores a set of critical factors is needed for heterogeneous embedded systems. In this paper, we present a theoretical framework that yields a sub-optimal solution via three practical mapping algorithms with low time complexity. We evaluate such algorithms upon StarPU with a large set of popular benchmarks. Experimental results demonstrate that algorithms proposed by the original EMSOFT paper can achieve up to 30% faster completion time compared to state-of-the-art mapping techniques, and can perform consistently well across different workloads. We further extend such algorithms to minimize the completion time and enhance the runtime performance of complex heterogeneous applications under resource-limited infrastructure. We also extend the evaluation by deploying StarPU under multiple setups with an additional benchmark testing suite for simulating real-world runtime neural networks. Experimental results demonstrate that our extended algorithm can achieve much faster completion time (averagely 30% to 37% under multiple resource-constraint scenarios) compared to the state-of-the-art mapping techniques.

Introduction

Graphics processing units (GPUs) are now commonly used as co-processors in many embedded systems to accelerate general-purpose applications. They are particularly capable of executing data-parallel applications, due to their highly multi-threaded architecture and high-bandwidth memory. Various embedded system domains can benefit high performance and better energy efficiency from utilizing GPUs. For example, GPUs can efficiently perform matrix operations such as factorization on large data sets and multidimensional FFTs and convolutions. Such operations are often seen in many embedded applications including signal processing, imaging and video processing. By leveraging new programming models, such as CUDA[1] and OpenCL[2], programmers can effectively develop highly data-parallel tasks to execute such applications on GPUs.

By providing heterogeneous processing elements with different performance characteristics in the same system, heterogeneous CPU/GPU architectures are expected to provide more flexibility for better performance compared to homogeneous systems. Fast completion time is an imperative performance metric that needs to be optimized in most embedded systems. For example, in a driver-assisted and autonomous vehicle, the video streaming and sensor data processing tasks need to be completed in a rapid manner. In order to minimize the completion time for running a set of workloads, the step that maps computations to processing elements is critical. In this paper, we consider the mapping problem in a heterogeneous system containing multiple CPUs and GPUs. Our goal is to minimize the completion time.

This mapping problem is quite challenging due to a large size of the policy space. First of all, applications may demonstrate (sometimes significantly) different performance characteristics when executed on GPUs than CPUs. The mapping algorithm thus needs to consider such heterogeneity when making prioritization and mapping decisions. Moreover, most real world workloads are implemented using rather complex task graphs, where a task graph contains a number of data- or logical-dependent tasks. The precedence constraints among tasks require the mapping algorithm to consider: (i) the task graph structure and (ii) different data transfer costs among tasks if executed on different processors. Furthermore, for data-intensive tasks, data partitioning techniques need to be incorporated into the mapping algorithm because partitioning a task into threads that can be run on multiple devices in parallel improves the overall utilization.

Without considering the above-mentioned factors, mapping algorithms are unlikely to perform consistently well across different workloads. Prior work on heterogeneous CPU/GPU systems has focused on new programming models and API extensions for supporting multiple heterogeneous devices[3], [4], [5], automating the mapping processor[6], [7], [8], enabling CPU and GPU sharing[9]. Different mapping heuristics have been designed and applied in these work. However, since the fine-grain mapping problem is not the major focus of these work, the existing mapping heuristics make simplified mapping decisions based upon a limited set of metrics (e.g.,data locality or execution time).

In this paper, motivated by a number of measurements-based case studies, we design three mapping algorithms, each of which explores a specific set of factors that may affect the completion time performance. We evaluated such algorithms by implementing them on a real heterogeneous system, i.e.,containing a four-core CPU and two discrete GPUs with different performance characteristics. Extensive experiments were conducted using a set of popular benchmarks and workloads, such as Cholesky factorization, Monte Carlo. Experimental results demonstrate that our proposed algorithms can achieve much faster completion time (up to 30% improvement) compared to the state-of-the-art mapping techniques. By testing workloads with varying characteristics, experiments show that the completion time performance under our mapping algorithms is also consistent.

We further extend the original scheduling algorithms by including the data transfer time, i.e.,reducing the time consumption for data transfer prediction. To evaluate our extended algorithm, we deployed the original StarPU in a new heterogeneous system containing two ten-core CPUs and four discrete GPUs with close setups. To further verify the performance of all proposed scheduling algorithms, we implemented an extended benchmark testing suite for simulating runtime neural network applications in real-world scenarios. Experimental results demonstrate that our extended algorithm can achieve much faster completion time (averagely 29% in more computing resource case and averagely 37% in less computing resource case) compared to the state-of-the-art mapping techniques.

The contributions of this paper are listed as follows.

  • Idea and approach. We attempt to minimize the completion time of generalized applications in heterogeneous systems by optimizing scheduling strategies via proposing a set of algorithms including heterogeneity ratio-based mapping algorithm, structure rank based heuristics algorithm and data partition algorithm[10].

  • Evaluation. We conduct a set of experiments on a real-world runtime system, e.g.,StarPU to evaluate our proposed algorithms and compare their performance with naive and state-of-the-art scheduling algorithms.

  • Extended Approach. We propose an extended algorithm, namely heterogeneity ratio-based and data-partition optimizing scheduling, to minimize the completion time and enhance the runtime performance of deep-neural-network-based applications under resource-limited infrastructure.

  • Extended Evaluation. We conduct a set of experiments upon StarPU with server-level setups to evaluate the runtime performance of heterogeneity ratio-based and data-partition optimizing scheduling and compare it to the naive, state-of-the-art, and the originally proposed algorithms.

The rest of this paper is organized as follows. Section2 presents the background, system model and our theoretical framework. Section3 describes the measurement-based and extensive case studies as our motivation. Section4 presents the practical mapping algorithms and the extended mapping algorithm for neural network acceleration. Section5 describes our implementation. Section6 discusses our experimental results, including the experimental results for the extended algorithm in both the original benchmarks and the extended benchmark. Section7 describes related work. Section8 concludes this paper.

Section snippets

Background

In this section we give out a list of notations and definitions to help us better formalize the proposed problem, and then we briefly describe the general structure of heterogeneous schedulers.

Case studies: What to consider for making mapping decisions

In this section, we present several measurements-based case studies that motivate the design of our mapping algorithms. We measured the completion time of executing a vector add application τ1 (τi is the ith application) and a matrix multiplication application τ2 on a heterogeneous system configured with one Intel Core i7 CPU and NVIDIA GeForce GTX660 GPU. τ1 can be expressed as (v1+v2)π, where v1 and v2 are vectors and π is a constant. τ2 can be expressed as (ab)+(cd), where a, b, c

Practical mapping algorithms

In this section, we present three practical online algorithms for mapping tasks in a heterogeneous platform consisting of multiple CPUs and GPUs. Our algorithmic design is motivated by the observations as discussed in Section3. Specifically, the proposed mapping algorithms consider heterogeneity, task graph structure, and data partitioning. The first algorithm (we call it the baseline algorithm) mainly factors heterogeneity into making mapping decisions (besides considering traditional factors

Implementation

We implement our scheduler algorithms on top of the StarPU runtime platform[11] as customized schedulers. The role of the StarPU scheduler is to dispatch tasks onto different processing units (named “workers” internally). In general, the process of a scheduler can be described as follows: given n applications, each application consists of a number of tasks waiting to be executed. The scheduler selects the tasks from the runnable tasks (i.e.,the tasks that obtain all the data they need) for each

Evaluation

In this section, we present the implementation methodology and experimental results used to evaluate the effectiveness of our proposed algorithms.

Related work

Scheduling algorithms for heterogeneous systems. The general problem of scheduling in heterogeneous systems has received much attention. A number of scheduling heuristics have been proposed for scheduling directed acyclic graph-based (DAG) applications in heterogeneous systems[24], [25], [26], [27], [28], [29]. These algorithms schedule a single DAG (Directed Acyclic Graph) of tasks onto heterogeneous processing units with varying speed for minimizing the completion time. Zhao et.[30] proposed

Conclusion

In this paper, we investigate the problem of mapping multiple applications implemented using task graphs in a heterogeneous system consisting of CPUs and GPUs. To achieve fast competition time, we present a fine-grain mapping framework that explores a set of critical factors that are suggested by several measurements-based case studies. We present a theoretical framework that formulates this problem as an integer program and a set of practically efficient mapping algorithms. We implement the

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is partially supported by the National Natural Science Foundation of China (Grant No. 61902169), Shenzhen Peacock Plan, China (Grant No. KQTD2016112514355531), and Science and Technology Innovation Committee Foundation of Shenzhen, China (Grant No. JCYJ20170817110848086). This work is also partially supported by the Climbing Project of China under Grant No. pdjh2019c438. We would like to thank Yiwei Cheng for his help on collecting the trace data and Mingyuan Wu for his help on

Zexin Li received the B.S. degree from the Southern University of Science and Technology, Shenzhen, China, in 2020. He is currently a Ph.D. student in the University of Texas at Dallas. His research interests focus on real-time and embedded systems.

References (82)

  • The OpenCL Language

    (2011)
  • C. Luk, S. Hong, H. Kim, Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping, in:...
  • AugonnetC. et al.

    Starpu-mpi: Task programming over clusters of machines enhanced with accelerators

  • U. Dastgeer, C. Kessler, S. Thibault, et al. Flexible runtime support for efficient skeleton programming on hybrid...
  • BhateleA. et al.

    Application-specific topology-aware mapping for three dimensional topologies

  • EnmyrenJ. et al.

    Skepu: a multi-backend skeleton programming library for multi-gpu systems

  • GreweD. et al.

    A static task partitioning approach for heterogeneous systems using opencl

  • KatoS. et al.

    Timegraph: Gpu scheduling for real-time multi-tasking environments

  • ZhouH. et al.

    Task mapping in heterogeneous embedded systems for fast completion time

  • AugonnetR.N.C. et al.

    Starpu: A unified platform for task scheduling on heterogeneous multicore architectures

    Concurr. Comput. : Pract. Exper.

    (2011)
  • G. Elliott, J.H. Anderson, Real-world constraints of GPUs in real-time systems, in: Proceedings of the First...
  • BasaranC. et al.

    Supporting preemptive task executions and memory copies in gpgpus

  • LeeJ. et al.

    Transparent cpu-gpu collaboration for data-parallel kernels on heterogeneous systems

  • How to optimize performance with starpu

    (2008)
  • GrochowskiE.T. et al.

    Generational Thread Scheduler Using Reservations for Fair Scheduling

    (2016)
  • Starpu-dmdar

    (2020)
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with...
  • GirshickR.

    Fast r-cnn

  • RenS. et al.

    Faster r-cnn: Towards real-time object detection with region proposal networks

  • Al-JawfiR.

    Handwriting arabic character recognition lenet using neural network

    Int. Arab J. Inf. Technol.

    (2009)
  • P. Ballester, R.M. Araujo, On the performance of GoogLeNet and AlexNet applied to sketches, in: Thirtieth AAAI...
  • DengJ. et al.

    Imagenet: A large-scale hierarchical image database

  • TopcuougluH. et al.

    Performance-effective and low-complexity task scheduling for heterogeneous computing

    IEEE Trans. Parallel Distrib. Syst.

    (2002)
  • BittencourtL.F. et al.

    Dag scheduling using a lookahead variant of the heterogeneous earliest finish time algorithm

  • ZhaoH. et al.

    An experimental investigation into the rank function of the heterogeneous earliest finish time scheduling algorithm

  • ArabnejadH. et al.

    List scheduling algorithm for heterogeneous systems by an optimistic cost table

    IEEE Trans. Parallel Distrib. Syst.

    (2014)
  • SakellariouR. et al.

    A hybrid heuristic for dag scheduling on heterogeneous systems

  • CanonL.-C. et al.

    Comparative evaluation of the robustness of dag scheduling heuristics

  • ZhaoH. et al.

    Scheduling multiple dags onto heterogeneous systems

  • ElliottG.A. et al.

    Exploring the multitude of real-time multi-gpu configurations

  • HuaJ. et al.

    Edsketch: execution-driven sketching for java

    STTT

    (2019)
  • Cited by (13)

    • A survey on neural networks for (cyber-) security and (cyber-) security of neural networks

      2022, Neurocomputing
      Citation Excerpt :

      In one case, changing the activation function from ReLU to sigmoid produced drastically different classifier performance. As noted in [117], the objective of hyperparameter optimization procedures usually focuses on pushing the accuracy metrics to the limits, not taking into consideration the computational requirements of the matrix multiplications along with the memory access times and proper, efficient cpu-gpu queueing [118,119]. Another challenge to the application of ANNs is that they can only be as good as the data they receive.

    • Optimized partitioning and priority assignment of real-time applications on heterogeneous platforms with hardware acceleration

      2022, Journal of Systems Architecture
      Citation Excerpt :

      A wide survey on this topic of partitioning techniques and benchmarks regarding combined CPU/GPU architectures can be found in [22]. Among the many, it is worth mentioning the work of Li et al. [23], which presents a set of algorithms for task mapping in heterogeneous platforms with GPUs, with the goal of minimizing the makespan. The introduction of heterogeneous platforms and hardware accelerators is more recent for what concerns embedded real-time applications.

    View all citing articles on Scopus

    Zexin Li received the B.S. degree from the Southern University of Science and Technology, Shenzhen, China, in 2020. He is currently a Ph.D. student in the University of Texas at Dallas. His research interests focus on real-time and embedded systems.

    Yuqun Zhang received the B.S. degree from Tianjin University, Tianjin, China, the M.S. degree from the University of Rochester, Rochester, NY, USA, and the Ph.D. degree from the University of Texas at Austin, Austin, TX, USA. He is now an Assistant Professor with the Southern University of Science and Technology, Shenzhen, China. His research interests include software engineering and services computing.

    Ao Ding received the B.S. degree from the Southern University of Science and Technology, Shenzhen, China, in 2020. He is currently a M.S. student in the Southern University of Science and Technology, Shenzhen, China. His research interests include software engineering and computer vision.

    Husheng Zhou received the Ph.D. degree in computer science from the University of Texas at Dallas, in 2018. He is currently working in VMware, Austin, Texas, as a member of technical staff. His research interests include real-time systems, autonomous embedded systems and computer security.

    Cong Liu received the Ph.D. degree in computer science from the University of North Carolina at Chapel Hill, in Jul. 2013. He is an associate professor in the Department of Computer Science, the University of Texas at Dallas. His research interests include real-time systems and GPGPU. He has published more than 30 papers in premier conferences and journals. He received the Best Paper Award at the 30th IEEE RTSS and the 17th RTCSA. He is a member of the IEEE.

    View full text