Elsevier

Parallel Computing

Volume 39, Issue 12, December 2013, Pages 867-878
Parallel Computing

Improving application behavior on heterogeneous manycore systems through kernel mapping

https://doi.org/10.1016/j.parco.2013.08.011Get rights and content

Highlights

  • Off-line profiling analysis to extract kernel characteristics of applications.

  • An adaptive greedy algorithm (GA) to select the suitable device for a kernel.

  • An improved version of the greedy algorithm (IA).

  • A Mixed-Integer Programming (MIP) implementation to determine optimal mapping.

Abstract

Many-core accelerators are being more frequently deployed to improve the system processing capabilities. In such systems, application mapping must be enhanced to maximize utilization of the underlying architecture. Especially, in graphics processing units (GPUs), mapping kernels that are part of multi-kernel applications has a great impact on overall performance, since kernels may exhibit different characteristics on different CPUs and GPUs. While some kernels run faster on GPUs, others may perform better in CPUs. Thus, heterogeneous execution may yield better performance than executing the application only on a CPU or only on a GPU. In this paper, we investigate on two approaches: a novel profiling-based adaptive kernel mapping algorithm to assign each kernel of an application to the proper device, and a Mixed-Integer Programming (MIP) implementation to determine optimal mapping. We utilize profiling information for kernels on different devices and generate a map that identifies which kernel should run where in order to improve the overall performance of an application. Initial experiments show that our approach can efficiently map kernels on CPUs and GPUs, and outperforms CPU-only and GPU-only approaches.

Introduction

Today’s high performance and parallel computing systems consist of different types of accelerators, such as Application-Specific Integrated Circuits (ASICs) [1], Field Programmable Gate Arrays (FPGAs) [2], Graphics Processing Units (GPUs) [3], and Accelerated Processing Units (APUs) [4]. In addition to the variety of accelerators in these systems, applications that are running on these systems also have different processing, memory, communication, and storage requirements. Even a single application may exhibit different such requirements throughout its execution. Thus, leveraging the provided computational power and tailoring the usage of resources based on the application’s execution characteristics is immensely important to maximize both application performance and resource utilization.

Applications running on heterogeneous platforms are usually composed of multiple exclusive regions known as kernels. Efficient mapping of these kernels onto the available computing resources is challenging due to the variation in characteristics and requirements of these kernels. For example, each kernel has a different execution time and memory performance on different devices. It is our goal to generate a kernel mapping system that takes the characteristics of each kernel and their dependencies into account, leading to improved performance.

In this paper, we propose a novel adaptive profiling-based kernel mapping algorithm for multi-kernel applications running on heterogeneous platforms. Specifically, we run and analyze each application on every device (CPUs and GPUs) in the system to collect necessary information, including kernel execution time and input/output data transfer time. We, then, pass this information to a solver to determine the mapping of each kernel on the heterogeneous system. Solvers used are a greedy algorithm (GA) based solver, an improved version of the same algorithm (IA), and a Mixed-Integer Programming (MIP) based solver. Our specific contributions are:

  • an off-line profiling analysis to extract kernel characteristics of applications.

  • an adaptive greedy algorithm (GA) to select the suitable device for a kernel considering its execution time and data requirements.

  • an improved version of the greedy algorithm (IA) to avoid getting stuck in local minima.

  • a Mixed-Integer Programming (MIP) implementation to determine optimal mapping and to compare it with the greedy approach.

The initial results revealed that our approach increases the performance of an application considerably compared to a CPU-only or GPU-only approach. Furthermore, in many cases, our generated mappings are equivalent to the mappings of MIP implementations, or very close to them. Although our initial experiments are limited to a single type of CPU and GPU, it is possible to extend this work to support multiple CPUs, GPUs, and other types of accelerators.

The remainder of this paper is organized as follows. Related work on general-purpose GPU computing (GPGPU) is given in Section 2. The problem definition and an introduction to the proposed approach are given in Section 3. The details of the greedy algorithm (GA) and the implementation are given in Section 4. The MIP formulation is introduced in Section 5. The experimental evaluations are presented in Section 6. Finally, we conclude the paper in Section 7.

Section snippets

Related work

OpenCL is an open standard for parallel programming, targeting heterogeneous systems [5]. It began as an open alternative to Brook [6], IBM CELL [7], AMD Stream [8], and CUDA [9]. It provides a standard API that can be employed on many architectures regardless of their characteristics, and therefore has become widely accepted and supported by major vendors. In this work, we also used OpenCL and evaluated our approach on an OpenCL version of the NAS benchmarks [10].

Recent advancement in chip

Preliminaries

A major challenge in a heterogeneous platform is using existing resources while obtaining an application’s highest performance. This issue is mainly due to the nature of such systems, because they comprise computing devices with different characteristics and capabilities. Therefore, the main goal of this work is to utilize these devices by capturing tasks’ specific characteristics and making task-assignment decisions in a way that tasks are assigned to the device they perform better.

Greedy mapping algorithm

In this section, we introduce the greedy algorithm (GA) to generate kernel mapping, which minimizes the overall execution cost. Due to the complex data dependencies among kernels, minimizing the execution cost of each kernel may not minimize the overall performance of an application. Therefore, base algorithm may not yield the best results.

In the base algorithm, we try to minimize the execution time of each kernel by selecting the device that will run the given kernel faster. Eventually, we aim

Mixed-Integer Programming

In this section, our aim is to present a MIP formulation of the kernel mapping to find the optimal mapping and compare it with the mapping generated by our greedy algorithm (GA) presented in the previous section. It is not presented as a separate mapping strategy, instead the results gathered are used as a guideline for our greedy algorithm.

Mixed-Integer Programming provides a set of techniques that solves optimization problems with a linear objective function and linear constraints [20]. We

Setup

Profiling each benchmark was done on a heterogeneous system consisting of a six-core AMD CPU and an NVIDIA GeForce GTX 460 GPU. Table 2 shows the details of our experimental system setup.

We assessed our algorithms on OpenCL versions of NAS parallel benchmarks [24]. Details of the benchmarks we used in the experiments are given in Table 3. Each benchmark has different characteristics; some have over 50 kernels and others have only few kernels. Each benchmark differs from others in terms of

Conclusion and future work

Efficient kernel mapping for multi-kernel applications on heterogeneous platforms is important to exploit the provided computational resources and to obtain higher performance. In this paper, we introduce an efficient mapping algorithm for multi-kernel applications. We first employ a greedy approach to select the most suitable device for a specific kernel by using profiling information; then we enhance it to avoid getting stuck in local minima. Our initial experiments show that our approach

Acknowledgments

The authors thank Prof. Kagan Gokbayrak of Bilkent University for his much-appreciated help in developing the MIP formulation, and also for the solver environment. This research is supported in part by TUBITAK grant 112E360, by a grant from NVidia, and by a grant from Turk Telekom under Grant Number 3015-04.

References (26)

  • R. Gupta et al.

    Hardware-software cosynthesis for digital systems

    IEEE Design Test of Computers

    (1993)
  • J. Rose et al.

    Architecture of field-programmable gate arrays

    Proceedings of the IEEE

    (1993)
  • C.J. Thompson et al.

    Using modern graphics architectures for general-purpose computing: a framework and analysis

  • M. Daga, A. Aji, W. chun Feng, On the efficacy of a fused cpu+gpu processor (or apu) for parallel computing, in: 2011...
  • Khronos group, OpenCL – the open standard for parallel programming of heterogeneous systems. [Online]. Available:...
  • I. Buck et al.

    Brook for gpus: stream computing on graphics hardware

    ACM Transaction on Graphics

    (2004)
  • Ibm, CELL. [Online]. Available:...
  • AMD, Accelerated parallel programming SDK. [Online]. Available:...
  • NVIDIA, CUDA. [Online]. Available:...
  • D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, P. Frederickson, T. Lasinski, R....
  • C.-K. Luk et al.

    Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping,

  • D. Grewe et al.

    A static task partitioning approach for heterogeneous systems using opencl

  • T. Scogland, B. Rountree, W. chun Feng, B. De Supinski, “Heterogeneous task scheduling for accelerated openmp,” 2012...
  • Cited by (0)

    A preliminary version of this paper appears in P2S2 2012 workshop proceedings. This submission extends our previous work by presenting a mixed integer programming based mapping implementation and by presenting an extensive experimental evaluation.

    View full text