Keywords

1 Introduction and Motivation

Heterogeneous systems that include multi-core processors, GPUs (Graphics Processing Units) and other accelerators are becoming mainstream to continue improving the performance of parallel applications. However, partition the work among processing units, PUs, is challenging due to their asymmetry, resource contention, low bandwidth of the PCIe bus, etc. This work describes an efficient data-partitioning model tailored for data-parallel applications such as, linear algebra kernels and image processing applications commonly found on the bioinformatics domain.

Finding the optimal workload partitioning among PUs depends on the characteristics of the system, the size of the involved data, and the data-parallel application itself. Previous works make the partitioning decision based on the information obtained from an online or offline profiling of the application. Most strategies target integrated CPU-GPU chips or one discrete GPU based systems and assume that the execution time is a linear function of the problem size \(T(v)=a v+b\)  [8] or even simpler \(T(v)=a v\) with \(b=0\) [3, 5, 10, 16]. Our main observation is that using a time function per PU provides better results.

In this work, we provide a more accurate workload balancing model that (i) finds the optimal partition on systems with multiple accelerators, (ii) it is able to detect when a device is not worth to be included in the execution time, and (iii) it is simple to use in any hybrid system. The automatic mapping model consists of two stages, learning stage and execution stage. In the learning stage, a light off-line training is carefully crafted to capture the inter-relation between kernel and each one of the dissimilar devices then, the linear approximation of the execution time on each device is calculated and finally the obtained system of equations is solved to find the near-optimal distribution. In the execution stage the data-parallel code is executed with the calculated optimal partitioning.

This paper is organized as follows. The description of the proposed mapping model is given in Sect. 2. Its experimental verification on different heterogeneous platforms is shown in Sect. 3 and Sect. 4 concludes.

2 Related Work

Previous work can be divided into two broad categories: on-line and off-line.

On-line training based strategies [1, 2]: The partition decision is made at runtime. These techniques are used when (i) the cost per parallel-for iteration is very dependent on the characteristics of the input data or/and (ii) the cost of the communications between different devices is either cheap or not needed. Those works are mainly limited by the problem size because they normally store all data in every device and distribute only the computation.

Off-line training based strategies [3, 7,8,9,10, 16]: They partition the data based on the information from previous real runs on different input sizes. These techniques provide optimal results for parallel-for with a stable cost per iteration and when data reallocation is avoided. In general, these works assume that the computing power among devices is similar and use an homogeneous model, \(T(v) = a v\), (where T represents time, a is a constant, and v is the work size), tested on only-CPU or CPUs plus single-GPU systems [3, 9, 10].

Luk et al. build a similar model with \(T(v)=av+b\) to calculate the optimal partition as the point where the CPU and GPU fitting curves cross [8]. More complex models use heuristics and tree structures to calculate the optimal partition [7]. The most related paper to our work is [16] which proposes a data-partitioning strategy for multi-GPU multi-core platforms based on the assumption \(T(v)=a v\); however this approach neither considers the fixed cost of using a non-host processing unit nor can deactivate slow devices when they increase the total execution time. Besides, their simulation framework only includes processing units with similar computing power.

3 Preliminary Considerations

figure a

Most data-parallel applications can be expressed as a temporal loop that iterates over a spatial parallel-for as shown in Algorithm 1.1. The number of operations and memory accesses performed in each individual parallel-for iteration are similar and can be considered as the smallest unit of work. The performance model used for these applications, which expresses the execution time T as a function of the problem size v, \(T(v)=a_i v+b_i\), was proven to be accurate on individual processing unit i [16], where:

  • v is the size of the input block of data.

  • \(a_i\) is the coefficient of the independent variable v.

  • \(b_i\) is the fixed initialization computational-cost we have to pay on each PUi.

On multi-GPU multi-core systems the speed of one device may depend on the load of others due to resource contention, therefore they cannot be considered as independent devices, and their execution time can not be measured separately. This work only considers parallel applications allowing flexible distribution of the data and including an optimized kernel for each device architecture.

We model multi-GPU multi-core platforms as a set of abstract PUs. A group of processing elements that execute the same kernel of the application will be represented in the model by an abstract PU. For example if a single-threaded kernel is used, then each CPU core executing this kernel will be represented as one abstract PU. If a multi-threaded kernel is executed in a group of CPU cores, these CPU cores will be grouped into one PU. A GPU is usually controlled by a host process that executes on a dedicated CPU core. Thus, the GPU together with its host CPU core will be considered as one PU.

4 Our Data-Partitioning Model

The main goal of this work is to find the near-optimal partition on a heterogenous system. Hence, given an data-parallel application, a problem size and hardware configuration, we aim at predicting the optimal workload to be assigned to each PU. To ease this discussion we reformulate this mapping problem into finding the optimal chunk sizes, v(\(v_1, v_2,..., v_n\)), to be assigned to each one of a total number of n PUs.

Fig. 1.
figure 1

Flow chart of the automatic data-partitioning.

We carefully design an empirical approach that predicts the execution time, T(v), on each PU in terms of the chunk size v using a linear approximation. Then, we calculate the optimal chunk sizes to be assigned to each PU based on this model. A schematic illustration of the overall approach is depicted in Fig. 1.

4.1 The Learning Stage

It is well-known that parallel-for loops with little or no load unbalance are well-suited for both multi-core and accelerators. Therefore, we first need to accurately determine the linear function of the execution time on each asymmetric processing unit. The more accurate the profiling gets, the more precise the partition becomes. We use an off-line profiling measuring the time for multiple input sizes. To guarantee the reliability of these experiments, the measurements are repeated multiple times until the mean values lie in the interval with a confidence level \(>= 95\%\).

Let us assume an initial input \(V_{total}\) that does not fit in the local memories of some processing units. On each asymmetric PU i we consider a maximum block size \(V^i_{fit}=\alpha V_{total}\) that fits the local memory of that PU and we measure the time it takes to process at most 10 block sizes, \(\alpha V^i_{fit}\) where \(\alpha =10\%, 20\%,...100\%\). We start measuring the runtime of the smaller block of size \(0,1\times V^i_{fit}\) and gradually measure the processing time of the rest of chunk sizes. On accelerators, ideal starting block size would be a multiple of the number of cores inside the accelerator. For the applications used in the analysis and verification of the model, we found that three measurements using three problem sizes are enough for an accurate load balancing.

In order to make this stage as short as possible, we profile only the first serial-loop iterations, which represent generally less than \(5\%\) of the total serial iterations. The learning process of the first iterations is performed only on dissimilar PUs. The built models of individual PUs can be stored in a kind of database so that they can be reused whenever needed.

4.2 Model for Finding the Optimal Mapping

The time function \(T(v)=a_iv+b_i\) for each PU \(i=1,...,n\) is estimated using a least-squares approximation. The desired balanced scenario occurs when all the involved n devices finish the computation at the same time T. Let the assignment \(v_1,\ldots ,v_n\) refer to the block sizes that reach execution time \(\approx T\) on the PUs. The optimization problem to find the optimal assignment has the following ingredients.

\(a_i,b_i\) :

Parameters of the linear model, computed once per kernel per dissimilar PUs and stored in database. Calculated internally by the model.

\(V_{total}\) :

Total volume to be mapped. Input of the model.

\(v_i \ge 0\) :

Unknown variable describing the optimal volume to be assigned to PU i. Output of the model

T :

Execution time on the Heterogeneous System (HS) for the optimal mapping

We intend to minimize the total runtime on the heterogeneous platform, \(\min T\), given the assumed relation:

$$\begin{aligned} {\left\{ \begin{array}{ll} T(v_i)=a_iv_i+b_i &{} \text{ if } v_i>0 \\ T(v_i)=T(v_j) \equiv T &{} i, j \in {1,...,n}\\ V_{total}=\mathop {\sum }\nolimits _{i} v_i &{} i \in {1,...,n} \end{array}\right. } \end{aligned}$$
(1)

Equation (1) defines a linear system of equations with \(n+1\) unknowns and is the key to determine the optimal execution time T and the corresponding mapping \(v_1,\ldots ,v_n\). T represents the optimal execution time on the heterogenous system when the optimal mapping is used and can be estimated from the parameters \(a_i\) and \(b_i\) as follows:

$$\begin{aligned} {\left\{ \begin{array}{ll} T \equiv T(V_{total})= a_H V_{total}+b_H ~~~~\text{ where }\\ a_H={\sum \frac{1}{a_i}}^{-1}\\ b_H=a_H{\sum \frac{b_i}{a_i}}\\ v_i=\frac{T-b_i}{a_i} ~ i \in {1,...,n} \end{array}\right. }\end{aligned}$$
(2)

Our aim is to determine the subset Q of \(\{1,\ldots ,n\}\) of active PUs and the distribution \(v_i>0\) within this set. The parameter \(b_i\) is a key parameter in our model to detect too slow PUs. It represents the cost that we must pay for activating PU i. If the computation cost of a PU is lower than the fixed cost \(b_i\), i.e., \(T-b_i<0\), it will be labeled as inactive by setting \(v_i\) to zero \(v_i=0\). Besides, if the assigned volume \(v_i\) is smaller than the volume needed for one unit-of-work the PU i will be also deactivated. More deactivation criteria can be included in our model according to the objectives of the implementation, like for example criteria for energy optimization.

figure b

The optimal mapping will be then computed only for the set of active PUs. The procedure that finds the optimal data-partitioning according to the deactivation criteria is provided below in Algorithm 1.

5 Experiments and Evaluation

This section provides the analysis and validation of our mapping model using multiple data sizes and multiple system configurations. The heterogeneous system used in this study includes the GPUs and multi-core described in Table 1. We consider three platforms based on different combinations of these PUs as follows.

  • Platform 1: One CPU core of the Intel Quad Core CPU Q9450 (labeled as cpu0) and two GPUs, GeForce GTX 480 (labeled as GPU0) and Tesla C2070 (labeled as GPU1).

  • Platform 2: Three GPUs, GeForce GTX 480 (GPU0), Tesla C2070 (GPU1) and Tesla S2050 (GPU2).

  • Platform 3: Two CPU cores of the Intel Quad Core CPU Q9450 (cpu1) and two GPU devices, GeForce GTX 480 (GPU0) and Tesla C2070 (GPU1).

Table 1. Characteristics of the GPUs and multi-core included in the used heterogenous system.

For testing our approach, we consider the data-parallel applications described in Table 2. Two widely used algebra kernels, the vector vector product, saxpy, and the sparse matrix vector product, SpMV [13, 15]. In addition to Anisotropic Nonlinear Diffusion (AND) method which is a complex application for de-noising 3D-images in bio-informatics and structural biology [11, 12]. AND iterates over a noisy 3D-image by preserving its edges until it is completely filtered. AND has an iterative structure that allows integrating the learning strategy proposed in this work. For the experiments, we used NVIDIA CUDA (6.5 version) and mpicc compilers with −O2 as optimization option. Our parallel implementation creates one MPI-process per target PUs. Then, each process executes the optimized kernel for the specific PU architecture. We consider the best thread-block configuration [14].

Table 2. Applications used for the validation of our mapping model.
Fig. 2.
figure 2

(left) The execution time of saxpy and (right) workload distribution on each PU of Platforms 1, 2 and 3 using five data volumes.

The proposed methodology balances the workload of the applications under test by following these steps: First, in the learning stage, saxpy, SpMV and AND applications are evaluated using several instances of the problem size on each PU. For saxpy and SpMV, we used 100.000, 1.000.000 and 5.000.000 problem sizes. Similarly, for AND, we used the execution time for three 3D-image sizes \(dimX\times dimY\times 16\), \(dimX\times dimY\times 32\) and \(dimX \times dimY \times 64\), where \(dimX \times dimY\) is the size of one plan of the input 3D-image. These measurements of the execution time were enough to build an accurate performance model for saxpy, SpMV and AND.

Afterwards, in the execution stage, the model defines several hybrid configurations, labeled as Platforms 1, 2 and 3, and takes into account the aforementioned measurements to obtains T for each configuration and calculates the size of the chunk \(v_i\) that must be assigned to each PU i. Finally, the applications are launched in platforms 1, 2, and 3 using the calculated \(v_i\) for \(i=1,...,n\).

Fig. 3.
figure 3

(left) The execution time of SpMV and (right) workload distribution on each PU of Platforms 1, 2 and 3 using five data volumes.

Figures 2, 3 and 4 show the execution time that each PU takes to carry out the workload assigned by the model for saxpy, SpMV and AND for several problem sizes on Platforms 1, 2 and 3. The Figures on the right side show the percentage of the total workload associated to each PU. The execution time includes the computation and communication time. \(T_{ideal}\) depicted in the figures, is calculated by the model as \(T=a_{H}V_{total}+b_H\), where \(a_H\) and \(b_H\) are the parameters of the overall hybrid system as shown in the previous section. In general, the model achieves an optimal load balancing between the involved PUs for all the problem sizes and on all the considered hybrid platforms. In addition, the obtained experimental execution time is very similar to the \(T_{ideal}\). In particular, the model is more accurate for all the applications on Platform 2, which includes only GPUs. A slight unbalance (i.e., \(max(|T_{ideal}-T_i|)/T_{ideal}\), where i is the index of the involved PU in the considered platform <0%) is produced for SpMV in Platform 3 but within a reasonable range. This is due to the fact that it is more complicated to balance the load on highly heterogeneous such as Platform 1 and 2, where the difference between the performance of their components, i.e., the cpu and GPUs, is too high.

Fig. 4.
figure 4

(left) The execution time and (right) workload distribution of AND-application on each PU of Platforms 1, 2 and 3 using three volume sizes.

The capacity of our model to deactivate too slow PUs can be observed in Fig. 2(a), where cpu0 is deactivated for all problem sizes and Fig. 4(a) and (c) in Platforms 1 and 3, where cpu0 and cpu1 are deactivated for the problem sizes \(128\times 128 \times 128\) and \(256\times 256 \times 256\). In those cases, the model chooses to deactivate cpu0 in Platform 1 or cpu1 in Platform 2 because the assigned volume is either smaller than the volume of one unit of work or because the b parameter of this PUs is larger than the hybrid ideal execution time \(T_{ideal}\).

Fig. 5.
figure 5

The execution time of each platform, i.e., cpu0, cpu1, GPU0, GPU1, Platforms 1, 2 and 3, takes to carry out the overall work for the 5.000.000 problem volume for (a) saxpy, (b) SpMV and (c) AND.

To compare the heterogeneity of the used hybrid platforms and the involved PUs, Figs. 5(a), (b) and (c) show the response time of saxpy, SpMV and AND, using different problem sizes, on cpu0, cpu1, GPU0, GPU1 and Platforms 1, 2 and 3. As it can be observed, it is always worth using hybrid platforms (together with a good load balancing strategy) than individual accelerators such as GPU0 in spite of their heterogeneity and the challenge of load balancing. As we can see the optimal performance is achieved on Platforms 1, 2 and 3 in all tests.

Several metrics have been defined for evaluation and comparison purposes on heterogeneous systems [4, 6]. In this analysis, the Ideal Relative Speedup is considered. It is calculated as the of ratio between the run-time on the fastest device when the problem of size v is executed and \(T_{ideal}(v)\), which represents the estimated run-time on the hybrid platform.

Figures 6(a), (b) and (c) show a comparison between the Ideal Relative Speedup and the Relative Speedup for saxpy, SpMV and AND, using the largest volume size, on Platforms 1, 2 and 3. The differences between the Ideal Speedup and the measured Speedup can be considered as a quantification of the current unbalance. As it can be seen these differences are very low. Figures 6(a), (b) and (c) depict an interesting quality of our data-partitioning model which is its ability to estimate before hand the PUs combination that provides the best performance.

Fig. 6.
figure 6

The ideal speedup versus relative speedup for the 5.000.000 problem volume for (a) saxpy, (b) SpMV and (c) AND.

6 Conclusions and Future Work

Selecting both the processing units and the amount of work each should perform is a challenging task. This paper proposes a work-balance strategy that helps overcoming these two difficulties. Namely, it builds a global performance model with minimum training enabling to find a close to minimum execution time. Our detailed results show that the model is within less than 3% percent of an ideal unrealistic balancing strategy. As future work, we plan to compare our work to Zhong et al. [16] to showcase the advantages of our method.