Task scheduling in multiprocessing systems using duplication

https://doi.org/10.1016/j.sysarc.2007.09.004Get rights and content

Abstract

Task scheduling continues to be one of the most challenging problems in both parallel and distributed computing environments. In this paper, we present a task scheduling algorithm, which uses duplication, to optimally schedule any application represented in the form of a directed acyclic graph (DAG). It has a time complexity of O(d|V|3), where ∣V∣ represents the number of tasks and d the maximum indegree of tasks.

Introduction

The main focus of task scheduling is to find an optimal schedule which minimizes the parallel execution time of an application. An application can be broken down into a set of tasks. We represent an application using a directed acyclic graph (DAG) G defined by the tuple (V,E,τ,c), where V, the nodes, represents the set of tasks and E, the edges, represents the communication between tasks. For any task niV,τ(ni) denotes the computational cost. An edge from ni to nj, where ni,njV, represents the precedence relationship between these two tasks. The communication cost from ni to nj is denoted by cni,nj. If (ni,nj)E, then ni is called the immediate predecessor of nj and nj is called the immediate successor of ni. Given a task na, its set of immediate predecessors, denoted by ipred(na), is defined as {nj|(nj,na)E}. A task na is called a join task, if it has two or more immediate predecessors. The immediate successors of na is denoted by isucc(na) and is defined as {nj|(na,nj)E}. D(na) is the set of all the descendants of task na, and is defined as njisucc(na)({nj}D(nj)). It is assumed that there is one entry task nα and one exit task nω for the DAG, and the number of processors available is ∣V∣. The task scheduling algorithms have applications in distributed as well as grid computing environments [1], [2].

Task scheduling is one of the most challenging problems in both parallel and distributed computing environments. Whether duplication techniques are employed or not, scheduling is a well-known NP-complete problem [6], [7], [8]. In order to deal with this problem, many heuristic methods have been proposed [2], [1], [10]. Some of these methods produce an optimal schedule only when some constraints are satisfied. For example, for task graphs with small communication delays, Colin and Chretienne [4] proposed a (Lower Bound) LWB algorithm that gives an optimal solution, provided that the following condition is always true in the DAGminniipred(na)τ(ni)maxniipred(na)cni,na,for any join task na. That is, for any join task na, the computational cost of each immediate predecessor is not less than the largest incoming communication cost with respect to na. Task duplication can reduce the completion time of a task, since there will be no communication cost if the predecessors of the task are duplicated on the same processor.

Improvements by Darbha and Agrawal [5] on the LWB algorithm’s optimality criterion allowed for the inclusion of DAGs with wider communication and computational costs, in their task duplication based scheduling (TDS) algorithm. This optimality criterion is as follows: if join task na has k immediate predecessors, and immediate predecessors ni and nj are the highest and second highest values with respect to ready times or data arrival times of na, then an optimal schedule is guaranteed with the following earliest start time (est) and earliest completion time (ect)est(na)=0ifna=nαmax{ect(ni),(ect(nj)+cnj,na)}otherwiseandect(na)=est(na)+τ(na)provided that one of the following conditions are metifest(ni)est(nj),τ(ni)cnj,naifest(ni)<est(nj),τ(ni)(cnj,na+est(nj)-est(ni)).Park and Choe [9] proposed an improved TDS algorithm, which we will refer to as the ITDS algorithm in this paper. The ITDS algorithm is guaranteed to obtain, for any task na, the earliest start time, est(na) and the earliest completion time ect(na). The ITDS algorithm approach is as follows: for any task ni, there is a corresponding cluster C(ni) which is obtained by a merging operation on its immediate predecessors’ clusters. Therefore, if task na has k immediate predecessors, n1,n2,,nk,with corresponding clusters C(n1),C(n2),,C(nk), sorted in order of ready times (the ready time of n1 with respect to na is defined as rtm(n1,na) and is equal to ect(n1)+cn1,na, i.e rtm(n1,na)rtm(n2,na)rtm(nk,na), thenC(na)={na}ifna=nα¯i=1lminC(ni)¯{na}otherwiseest(na)=0ifna=nαβ(lmin,na)otherwiseandect(na)=est(na)+τ(na)where lmin is the smallest index between 1 and k such thatβ(lmin,na)=min1lkβ(l,na),andβ(l,na)=maxct¯i=1lC(ni),rtm(nl+1,na),andrtm(nk+1,na)=0.However, est(na) and ect(na) are optimal only if the condition given below is satisfied; that is, if the communication overhead is relatively highminneC(na)ect(ne)+maxnhisucc(ne)C(na)cne,nh+τ(nh)+njD(nh)C(na)τ(nj)ct(C(na))The optimality check is done as follows. For every task in C(na), find its earliest completion time. Next, for each immediate successor of this task found in C(na), find the maximum of: the communication cost to, and computational of, the immediate successor; plus the computational cost of each descendant of this immediate successor also found in C(na). Add this maximum to the earliest start time; if the smallest of these totals is greater than or equal to the completion time of C(na), then C(na) is optimal.

The ITDS algorithm has several drawbacks. Firstly, it does not always produce an optimal solution, since the optimality criteria, for a join task, is not taken into account during scheduling. Secondly, it seems that it does not specify how to obtain an optimal solution, if the solution provided is not optimal. Finally, the condition that the communication cost must be relatively high is not clearly defined, and there are occasions when the ITDS algorithm finds an optimal schedule with tasks having a communication cost as low as one (1) unit. This paper presents a task scheduling algorithm which uses duplication, to optimally schedule any application represented in the form of a directed acyclic graph (DAG).

Ruan et al. [11] proposed an algorithm which took into account the drawbacks mentioned above, and reported further improvements over the ITDS algorithm. However, Ruan et al’s algorithm does not always produce an optimal solution. Firstly, their approach assumes that the earliest start time for a join task na is always found in the cluster with the longest ready time, C(n1). The example in Fig. 1 illustrates why their approach does not always produce an optimal solution.

For the DAG given in Fig. 1a, Fig. 1b shows the clusters C(n4) and C(n5) which are the immediate predecessors of task n6, and each having a completion time of 10. Since C(n4) has the longest ready time to task n6, it should be considered as a possible candidate with respect to where n6 should be scheduled. With Ruan et al’s approach [11], C(n4) is considered as the only candidate. Therefore, to reduce the communication cost of 10 from C(n5), n5 is merged with C(n4). The algorithm then proceeds to check the immediate predecessor of task n5. It can be seen that rtm(n3,n5)=14, but the start time of task n5 on C(n5) is 10, therefore this leads to n3 having to be merged with C(n4). In Fig. 1c (i), the algorithm proposed by Ruan et al. [11] finds est(n6)=18. However, in contrast, if cluster C(n5) is considered, as shown in Fig. 1c (ii), it can be shown that est(n6)=15 as follows. First, n4 is merged with C(n5) and then task n2 is considered. Since rtm(n2,n4)=10 and the start time of task n4 on C(n5) is also 10, there is no need to schedule n2 on C(n5). Secondly, Ruan et al’s algorithm [11] does not always take advantage of the opportunity to reduce the start time when a task ni and its immediate successor nj are scheduled on different processors. In the above example, if rtm(n3,n5)=11 instead of 14, there would have been no change to the earliest start time of task n5 on C(n6) with Ruan et al’s approach and would have remained as 13. Clearly est(n5)=11 on C(n6), if rtm(n3,n5) is considered. It may also be noted that since the tasks scheduled on a particular processor are not always sorted by task numbers, it appears that the complexity of Ruan et al’s algorithm [11] will degenerate to O(d|V|3) in the worst-case.

The algorithm proposed in this paper does not require any assumption in terms of the computation and communication costs of the tasks in the DAG, unlike those used in [5], [9]. Also at the same time, unlike the algorithm of Ruan et al. [11], the proposed algorithm will always produce the shortest schedule. It is shown that the proposed algorithm has a time complexity of O(d|V|3), where ∣V∣ represents the number of tasks and d the maximum indegree of tasks. The rest of this paper is organized as follows: The basis of the algorithm is described in Section 2. The proposed algorithm is shown in Section 3. In Section 4 we establish that the algorithm is optimal. Section 5 provides the complexity analysis on the algorithm and how it compares to other algorithms. Section 6 provides an illustrative example of the proposed algorithm. Section 7 presents a comparative study of the proposed algorithm with other works. Finally, Section 8 concludes the paper.

Section snippets

Notation and terminology

The proposed algorithm uses the following notation and terminology:

    Pni

    processor where task ni is the last task to be completed on it

    est(Pni)

    earliest start time of task ni on processor Pni

    est(ni)

    earliest start time of task ni and is the same as est(Pni)

    ct(Pni)

    completion time of task ni on processor Pni

    ect(ni)

    earliest completion time of task ni and is equal to ct(Pni)

    st(nr,Pni)

    start time of task nr on processor Pni

    ct(nr,Pni)

    completion time of task nr on processor Pni

    rtm(ni,na)

    ready time of task ni

The algorithm

A detailed description of the algorithm is given below.

  • Algorithm Scheduling (V,E,τ,c)

  • // Input: A DAG (V,E,τ,c), with entry task nα

  • // Output: The earliest start time est(na) and the earliest completion time ect(na) for each

  • // task naV. It also produces, for each processor Pi,1in, a list of tasks to be

  • // scheduled on it.

  • Pnα{nα};

  • est(nα)0;

  • SetOfReadyTasks  Topological ordering of nodes in the DAG

  • SetOfReadyTasksSetOfReadyTasks-{nα}

  • while (|SetOfReadyTasks|>0) do

  •    nanexttaskinSetOfReadyTasks

  •    Schedule

Optimality condition

In this section we establish that the algorithm schedule_task() is optimal.

Theorem 1

For any join task na with k immediate predecessor tasks n1,n2,,nk where rtm(ni,na)rtm(ni+1,na) for i=1,2,,k-1, the proposed algorithm constructs Pna correctly and that the est(na) can be obtained using the following equationest(na)=max1lk[({{{{H1M̲Pn2|na}M̲Pn3|na}}M̲Pnl|na}),rtm(nl+1,na)]when(HlM̲Pnl+1|na)rtm(nl+1,na)where(HlM̲Pnl+1|na)=min[F(Hl,Pnl+1,1,na),F(Pnl+1,Hl,1,na)]andct(Pnk+1)=rtm(nk+1,na)=0H1=Pn1(HkM̲

Complexity analysis

Using a breath-first search to obtain a ready task, the complexity of the ready task search is O(|E|+|V|). This search is done by algorithm Scheduling() which also calls procedure schedule_tasks() for the calculation of the earliest start time of each task ni, est(ni). The procedure schedule_tasks() is executed ∣V∣ times. Let d be the maximum indegree of a task, then any join task ni has at most d immediate predecessors. Although d|V|-1, for all practical purposes d|V|. The procedure schedule_

An illustrative example

For a given DAG, Fig. 6a–d shows how the proposed algorithm schedules each task optimally. The est(n1) is 0. The values est(n2),est(n3),est(n4), and est(n5), the immediate successors of n1, are easily calculated since they have only one immediate predecessor, therefore ct(Pn1),ct(Pn2),ct(Pn3),ct(Pn4),ct(Pn5) are 2, 6, 4, 6 and 3 respectively as shown in Fig. 6b. Next we consider task n6 which is a join task and has three immediate predecessors. These immediate predecessors, in order of

Performance and comparison

The schedule length, also referred to as the makespan (see, e.g [2], [11]) is the main performance measure of a scheduling algorithm. However, since a large set of task graphs with different properties is used in our experiments, it becomes necessary to normalize the schedule length with respect to the critical path. This normalized value of the schedule length is referred to as the schedule length ratio (SLR). The SLR value is defined as follows:SLR=makespanniCPMINτ(ni)The minimum critical

Conclusion

In this paper, a task scheduling algorithm which uses duplication, to optimally schedule any application represented in the form of a directed acyclic graph (DAG), was proposed. The proposed algorithm does not require any assumption in terms of the computation or communication costs of the tasks, unlike those in [5], [9]. Moreover, the proposed algorithm always finds the best possible schedule, which may not always be the case with Ruan et al’s algorithm [11]. The proposed algorithm produces an

Acknowledgements

The authors wish to thank the anonymous referees for their comments and suggestions on an earlier version of the manuscript which have greatly enhanced the readability of the paper. The authors also wish to thank Prof. L. Moseley for his interest and support in this work.

Pranay Chaudhuri received his B.Sc. (Physics) and B.Tech. (Electronics) from Calcutta University. He obtained his M.E. and Ph.D., both in Computer Science & Engineering, from Jadavpur University, Calcutta. At present he is holding the position of Professor of Computer Science and Head of the Department of Computer Science, Mathematics and Physics at the University of the West Indies, Cave Hill Campus, Barbados. Professor Chaudhuri has held faculty positions at the Indian Institute of

References (11)

There are more references available in the full text version of this article.

Cited by (0)

Pranay Chaudhuri received his B.Sc. (Physics) and B.Tech. (Electronics) from Calcutta University. He obtained his M.E. and Ph.D., both in Computer Science & Engineering, from Jadavpur University, Calcutta. At present he is holding the position of Professor of Computer Science and Head of the Department of Computer Science, Mathematics and Physics at the University of the West Indies, Cave Hill Campus, Barbados. Professor Chaudhuri has held faculty positions at the Indian Institute of Technology, James Cook University of North Queensland, University of the New South Wales, and Kuwait University prior to joining the University of the West Indies in 2000. Professor Chaudhuri’s research interests include Parallel and Distributed Algorithms, Self-stabilization, and Graph Theory. He has published extensively in leading international journals and is the author of a book entitled, Parallel Algorithms: Design and Analysis (Prentice-Hall, 1992). Professor Chaudhuri is the recipient of several national and international awards for his research contributions. He is also the recipient of the Vice-Chancellor’s Award for Excellence 2007 at the University of the West Indies for all-round excellent performance in research and service to the University community.

Jeffrey Elcock received his B.Sc. (Mathematics and Computer Science) from the University of the West Indies, Cave Hill Campus, Barbados, and M.Sc. (Computation) from Oxford University. His research interests include Algorithms and Complexity, Distributed Systems and Grid Computing. Currently, Mr. Elcock is pursuing his doctoral research in Computer Science at the University of the West Indies.

A preliminary version of this paper was presented at the 2005 International Conference on Design, Analysis and Simulation of Distributed Systems (DASD 2005), San Diego, California, USA [3].

View full text