Quantized load distribution for tree and bus-connected processors
Introduction
Data-parallelism in many application areas is a natural extension of the serial programming model, allowing for faster problem solving and easy scaling with available processing nodes. Naturally it has received great attention in the literature [18]. In real-life situations, the processing load can generally be expressed as an integer multiple of some quantity, which can be considered to be the load quantum (e.g. a pixel of an image to be enhanced or a database record). However, the problem of optimally distributing a quantized processing load to the nodes of a parallel machine would require the use of integer programming techniques [36], or settling for a suboptimal heuristic.
Divisible Load Theory (DLT) [7] manages to overcome this obstacle by assuming that the load is arbitrarily divisible. Despite the simplistic nature of this assumption, DLT is a powerful tool that not only provides linear (in most cases) time complexity solutions, but also yields closed-form expressions that describe the solutions in a tractable form [3], [6], [29]. In turn, these closed-form expressions can form a basis for analytical modeling of parallel applications/architectures [23], allowing fine-tuning of their parameters. A trade-off is that the solutions offered cannot be directly applied to real-life problems. This paper investigates several ways to overcome this shortcoming.
In particular, based on a DLT solution, two algorithms are proposed in order to obtain a near-optimal solution for the distribution of such quantized loads. Even though simple rounding-off procedures could be used instead in order to produce integer loads, our motivation was the design of methods that would disturb the least, the load balance achieved by DLT. The algorithms proposed take into account not only the solution, but also the particular capabilities of target machine nodes in the load shifts they make.
These algorithms are independent of the model used for obtaining the divisible solution and thus are general enough to augment all the divisible load methods proposed in the literature to this date. Although specific cost models are required by DLT methods for obtaining a solution, the latter is the only prerequisite for the operation of the proposed algorithms. An additional asset is that they are suitable for both single [3] and multi-installment [7] distribution strategies that are widely used in the divisible load scheduling literature. The latter are capable of minimizing the delay caused by load distribution, by delivering the load in parts and thus allowing the processors to begin computation at the shortest possible time, thereby minimizing the overall processing time of the entire load. By combining DLT with our algorithms, we can arrive at a near optimum solution in time O(K P), for P processors and K installments, which is favorable compared to using integer programming.
In order to evaluate the performance obtained by these algorithms, the worst-case performance deterioration (execution time increase) is derived for single installment on single level trees and multiple installments on bus-connected processors. These theoretical bounds are based on affine communicational and computational cost models, which are the most general that have been proposed in the literature [1], [3], [9], [14], [15], [19]. The metric cost in the context of DLT refers to time delays. The above mentioned two cases were chosen because of the existence of closed-form solutions for the optimal processing time. Currently there exists no closed-form solution for a multi-installment distribution on anything other than bus-connected processors [7]. Additionally, a feel of what is to be expected in realistic situations is obtained by rigorous simulation tests on randomly generated processor trees. The results show promise with respect to the problem of determining an optimum working subset of processors, i.e., a subset of processors that solve the problem in a minimum amount of time.
The paper is organized as follows: In Section 2 we present the related work in this area, while in Section 3 we present the DLT models and notations that are used in this paper. The two novel quantization algorithms are presented in Section 4 and studied with respect to the maximum execution penalty they introduce in Section 5. Finally, the combined performance of the algorithms exhibited on a number of rigorous simulation tests is shown in Section 6.
Section snippets
Related work
The load partitioning and distribution problem in data-parallel applications (also referred to as Locally Parallel Globally Serial [27] or fine-grained), has been attacked by both static [27] and dynamic [22], [25] approaches. This problem is also of prime interest for parallel compiler and systolic architecture designers [27]. Static approaches have received much attention not only because of the occasional inefficiency of dynamic approaches to deliver adequate performance, but also because
DLT models and notations
The sections that follow are based on the communication and computation delay models presented in [3]. These models which are summarized below, are generalizations of the models used in the literature. Suppose that there is a node X connected to the load originating node Y. X receives a portion partX of the load L. For the case of a single installment distribution and given that
- (a)
there is no time gap between the various phases, i.e. no idle period between the distribution, processing and
Quantization algorithms
For analytical ease, we assume that the processing load quantum, i.e., a non-divisible load unit (basic granule size that can be assigned to a processor), is equal to unity, and so the total load . Given that X is an internal node of a processor tree, then for each of the nodes, Z ∈ {X} ⋃ {Y : parent (Y) = X}, the following notations are used:
- •
D is the load that should be assigned to a subtree rooted at X according to DLT (obviously, for the root of the tree D = L).
- •
partZ is the part of D that should be
Worst case behaviour of algorithm Quantify for a single installment strategy on a single level tree
In the case of a single level tree, Quantify modifies the load assigned to a node X so that ∣QX − DX∣ < 1. This condition does not hold though for n-depth trees, as errors accumulate during progression of Quantify from the root to the leaves.
Apart from the delay models, one has to take into account the communication capabilities of the load originating processor P0. The problem is not whether P0 can communicate and compute at the same time, as this is irrelevant to the P0 schedule disturbance
Simulation results
In order to evaluate the effects of quantization on the schedules delivered by a DLT approach, we performed a number of simulations for (a) single installment query and image processing applications on random processor trees and (b) multi-installment load processing on random bus-connected machines. The above references translate to specific choices for the computational and communicational parameters, reflecting typical behaviour for the corresponding problem categories, and are not related to
Conclusions
The contributions of this paper are of significant value to the domain of DLT as well as to applications that employ the divisible load paradigm. In reality, all applications work with integer quantum load portions. Predominantly, most of the multi-computer systems employ an equal load sharing strategy, which is best, if communication delays are negligible owing to high-speed links. However, in a heterogeneous environment, communication delays are often non-negligible and add considerable
References (36)
- et al.
Scheduling divisible loads with processor release times and finite size buffer capacity constraints in bus networks
Journal of Parallel and Distributed Systems
(2002) - et al.
Scheduling divisible jobs on hypercubes
Parallel Computing
(1995) - et al.
Distributed processing of divisible jobs with communication startup costs
Discrete Applied Mathematics
(1997) - et al.
Scheduling a divisible task in a two-dimensional toroidal mesh
Discrete Applied Mathematics
(1999) - et al.
Distributed computation with communication delays: asymptotic performance analysis
Journal of Parallel and Distributed Computing
(1994) - et al.
Load partitioning and trade-off study for large matrix–vector computations in multicast bus networks with communication delays
Journal of Parallel and Distributed Computing
(1998) - et al.
Partitioning techniques for large-grained parallelism
IEEE Transactions on Computers
(1988) - G.D. Barlas, Compression Algorithms for One-Dimensional Semiperiodical Biomedical Signals and Methods for their...
Collection-aware optimum sequencing of operations and closed-form solutions for the distribution of a divisible load on arbitrary processor trees
IEEE Transactions on Parallel & Distributed Systems
(1998)- G.D. Barlas, B. Veeravalli, Distributed Video Servers: Delivery of continuous-media documents using unreliable...