Quantized load distribution for tree and bus-connected processors

doi:10.1016/j.parco.2004.06.002

Parallel Computing

Volume 30, Issue 7, July 2004, Pages 841-865

https://doi.org/10.1016/j.parco.2004.06.002 Get rights and content

Abstract

Divisible load analysis is a valuable tool for generating solutions to data-partitioning and distribution/scheduling problems for data-parallel applications. This paper addresses an essential step required for applying these solutions to real-life problems where computing loads are multiples of some fundamental problem-specific non-divisible load unit. The algorithms that are proposed to this end, are suitable for both single and multi-installment strategies. The worst-case performances of the algorithms are derived for two cases: single installment on a single-level tree and multiple installments on a bus network. Finally, an estimation on the expected performance of the algorithms is obtained from rigorous simulation tests. The extensive analysis that accompanies these tests, depicts many aspects of the parallel computation associated with parallel machine architectures and the load distribution strategies (single- or multi-installment) used.

Introduction

Data-parallelism in many application areas is a natural extension of the serial programming model, allowing for faster problem solving and easy scaling with available processing nodes. Naturally it has received great attention in the literature [18]. In real-life situations, the processing load can generally be expressed as an integer multiple of some quantity, which can be considered to be the load quantum (e.g. a pixel of an image to be enhanced or a database record). However, the problem of optimally distributing a quantized processing load to the nodes of a parallel machine would require the use of integer programming techniques [36], or settling for a suboptimal heuristic.

Divisible Load Theory (DLT) [7] manages to overcome this obstacle by assuming that the load is arbitrarily divisible. Despite the simplistic nature of this assumption, DLT is a powerful tool that not only provides linear (in most cases) time complexity solutions, but also yields closed-form expressions that describe the solutions in a tractable form [3], [6], [29]. In turn, these closed-form expressions can form a basis for analytical modeling of parallel applications/architectures [23], allowing fine-tuning of their parameters. A trade-off is that the solutions offered cannot be directly applied to real-life problems. This paper investigates several ways to overcome this shortcoming.

In particular, based on a DLT solution, two algorithms are proposed in order to obtain a near-optimal solution for the distribution of such quantized loads. Even though simple rounding-off procedures could be used instead in order to produce integer loads, our motivation was the design of methods that would disturb the least, the load balance achieved by DLT. The algorithms proposed take into account not only the solution, but also the particular capabilities of target machine nodes in the load shifts they make.

These algorithms are independent of the model used for obtaining the divisible solution and thus are general enough to augment all the divisible load methods proposed in the literature to this date. Although specific cost models are required by DLT methods for obtaining a solution, the latter is the only prerequisite for the operation of the proposed algorithms. An additional asset is that they are suitable for both single [3] and multi-installment [7] distribution strategies that are widely used in the divisible load scheduling literature. The latter are capable of minimizing the delay caused by load distribution, by delivering the load in parts and thus allowing the processors to begin computation at the shortest possible time, thereby minimizing the overall processing time of the entire load. By combining DLT with our algorithms, we can arrive at a near optimum solution in time O(K P), for P processors and K installments, which is favorable compared to using integer programming.

In order to evaluate the performance obtained by these algorithms, the worst-case performance deterioration (execution time increase) is derived for single installment on single level trees and multiple installments on bus-connected processors. These theoretical bounds are based on affine communicational and computational cost models, which are the most general that have been proposed in the literature [1], [3], [9], [14], [15], [19]. The metric cost in the context of DLT refers to time delays. The above mentioned two cases were chosen because of the existence of closed-form solutions for the optimal processing time. Currently there exists no closed-form solution for a multi-installment distribution on anything other than bus-connected processors [7]. Additionally, a feel of what is to be expected in realistic situations is obtained by rigorous simulation tests on randomly generated processor trees. The results show promise with respect to the problem of determining an optimum working subset of processors, i.e., a subset of processors that solve the problem in a minimum amount of time.

The paper is organized as follows: In Section 2 we present the related work in this area, while in Section 3 we present the DLT models and notations that are used in this paper. The two novel quantization algorithms are presented in Section 4 and studied with respect to the maximum execution penalty they introduce in Section 5. Finally, the combined performance of the algorithms exhibited on a number of rigorous simulation tests is shown in Section 6.

Section snippets

Related work

The load partitioning and distribution problem in data-parallel applications (also referred to as Locally Parallel Globally Serial [27] or fine-grained), has been attacked by both static [27] and dynamic [22], [25] approaches. This problem is also of prime interest for parallel compiler and systolic architecture designers [27]. Static approaches have received much attention not only because of the occasional inefficiency of dynamic approaches to deliver adequate performance, but also because

DLT models and notations

The sections that follow are based on the communication and computation delay models presented in [3]. These models which are summarized below, are generalizations of the models used in the literature. Suppose that there is a node X connected to the load originating node Y. X receives a portion part_X of the load L. For the case of a single installment distribution and given that

(a)
there is no time gap between the various phases, i.e. no idle period between the distribution, processing and

Quantization algorithms

For analytical ease, we assume that the processing load quantum, i.e., a non-divisible load unit (basic granule size that can be assigned to a processor), is equal to unity, and so the total load $L \in ℵ$ . Given that X is an internal node of a processor tree, then for each of the nodes, Z ∈ {X} ⋃ {Y : parent (Y) = X}, the following notations are used:

•
D is the load that should be assigned to a subtree rooted at X according to DLT (obviously, for the root of the tree D = L).
•
part_Z is the part of D that should be

Worst case behaviour of algorithm Quantify for a single installment strategy on a single level tree

In the case of a single level tree, Quantify modifies the load assigned to a node X so that ∣Q_X − D_X∣ < 1. This condition does not hold though for n-depth trees, as errors accumulate during progression of Quantify from the root to the leaves.

Apart from the delay models, one has to take into account the communication capabilities of the load originating processor P₀. The problem is not whether P₀ can communicate and compute at the same time, as this is irrelevant to the P₀ schedule disturbance

Simulation results

In order to evaluate the effects of quantization on the schedules delivered by a DLT approach, we performed a number of simulations for (a) single installment query and image processing applications on random processor trees and (b) multi-installment load processing on random bus-connected machines. The above references translate to specific choices for the computational and communicational parameters, reflecting typical behaviour for the corresponding problem categories, and are not related to

Conclusions

The contributions of this paper are of significant value to the domain of DLT as well as to applications that employ the divisible load paradigm. In reality, all applications work with integer quantum load portions. Predominantly, most of the multi-computer systems employ an equal load sharing strategy, which is best, if communication delays are negligible owing to high-speed links. However, in a heterogeneous environment, communication delays are often non-negligible and add considerable

References (36)

B. Veeravalli et al.
Scheduling divisible loads with processor release times and finite size buffer capacity constraints in bus networks
Journal of Parallel and Distributed Systems
(2002)
J. Blazewicz et al.
Scheduling divisible jobs on hypercubes
Parallel Computing
(1995)
J. Blazewicz et al.
Distributed processing of divisible jobs with communication startup costs
Discrete Applied Mathematics
(1997)
J. Blazewicz et al.
Scheduling a divisible task in a two-dimensional toroidal mesh
Discrete Applied Mathematics
(1999)
D. Ghose et al.
Distributed computation with communication delays: asymptotic performance analysis
Journal of Parallel and Distributed Computing
(1994)
D. Ghose et al.
Load partitioning and trade-off study for large matrix–vector computations in multicast bus networks with communication delays
Journal of Parallel and Distributed Computing
(1998)
R. Agrawal et al.
Partitioning techniques for large-grained parallelism
IEEE Transactions on Computers
(1988)
G.D. Barlas, Compression Algorithms for One-Dimensional Semiperiodical Biomedical Signals and Methods for their...
G.D. Barlas
Collection-aware optimum sequencing of operations and closed-form solutions for the distribution of a divisible load on arbitrary processor trees
IEEE Transactions on Parallel & Distributed Systems
(1998)
G.D. Barlas, B. Veeravalli, Distributed Video Servers: Delivery of continuous-media documents using unreliable...

S. Bataineh et al.

Closed form solutions for bus and tree networks of processors load sharing a divisible job

IEEE Transactions on Computers

(1994)

B. Veeravalli et al.

Optimal sequencing and arrangement in distributed single-level tree networks with communication delays

IEEE Transactions on Parallel and Distributed Systems

(1994)

B. Veeravalli et al.

Scheduling Divisible Loads in Parallel and Distributed Systems

(1996)

B. Veeravalli et al.

Access time minimization for distributed multimedia applications

B. Veeravalli et al.

On the influence of start-up costs in scheduling divisible loads on bus networks

IEEE Transactions on Parallel and Distributed Systems

(2000)

B. Veeravalli et al.

Sub-optimal solutions using integer approximation techniques for scheduling divisible loads on distributed bus networks

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans

(2000)

B. Veeravalli et al.

Efficient partitioning and scheduling of computer vision and image processing data on bus networks using divisible load analysis

Image and Vision Computing

(2000)

S.K. Chan et al.

Large matrix–vector products on distributed bus networks with communication delays using the divisible load paradigm: performance analysis and simulation

Mathematics and Computers in Simulation

(2001)

Cited by (0)

View full text

Quantized load distribution for tree and bus-connected processors

Abstract

Introduction

Section snippets

Related work

DLT models and notations

Quantization algorithms

Worst case behaviour of algorithm Quantify for a single installment strategy on a single level tree

Simulation results

Conclusions

Journal of Parallel and Distributed Systems

Parallel Computing

Discrete Applied Mathematics

Discrete Applied Mathematics

Journal of Parallel and Distributed Computing

Journal of Parallel and Distributed Computing

Partitioning techniques for large-grained parallelism

IEEE Transactions on Computers

Collection-aware optimum sequencing of operations and closed-form solutions for the distribution of a divisible load on arbitrary processor trees

IEEE Transactions on Parallel & Distributed Systems

Closed form solutions for bus and tree networks of processors load sharing a divisible job

IEEE Transactions on Computers

Optimal sequencing and arrangement in distributed single-level tree networks with communication delays

IEEE Transactions on Parallel and Distributed Systems

Scheduling Divisible Loads in Parallel and Distributed Systems

Access time minimization for distributed multimedia applications

On the influence of start-up costs in scheduling divisible loads on bus networks

IEEE Transactions on Parallel and Distributed Systems

Sub-optimal solutions using integer approximation techniques for scheduling divisible loads on distributed bus networks

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans

Efficient partitioning and scheduling of computer vision and image processing data on bus networks using divisible load analysis

Image and Vision Computing

Large matrix–vector products on distributed bus networks with communication delays using the divisible load paradigm: performance analysis and simulation

Mathematics and Computers in Simulation