Job scheduling in heterogeneous distributed systems

https://doi.org/10.1016/S0164-1212(00)00098-4Get rights and content

Abstract

This paper investigates scheduling policies in a heterogeneous distributed system, where half of the total processors have double the speed of the others. Processor performance is examined and compared under a variety of workloads. Two job classes are considered. Programs of the first class are dedicated to fast processors, while second class programs are generic in the sense that they can be allocated to any processor. It was our intention to find a policy that increases overall system throughput by increasing the throughput of the generic jobs without seriously degrading performance of the dedicated jobs. However, simulation results indicate that each scheduling policy considered has its merits and the best performer tended to depend on the degree of multiprogramming.

Introduction

Distributed systems have become very popular due to a rapid decline in hardware costs. A distributed system is a collection of resources accessed by different users to satisfy their computing needs. Normally, distributed systems are heterogeneous; i.e., processors in the system operate at differing speeds.

In distributed systems, no processor should lie idle while others are overloaded. It is preferable that the workload be uniformly distributed over all of the processors.

Job scheduling is key to the efficient operation of distributed systems. However, most strategies proposed in the literature target only homogeneous processors (Karatza, 1991, Karatza, 1998). Therefore, research is needed to address scheduling in heterogeneous environments. Schedulers for heterogeneous systems have special needs. For example, jobs encounter different execution times on different processors.

Load distribution is necessary to more efficiently utilize computational power in a distributed system. Various load balancing and load sharing algorithms appear in the literature. In general, the purpose of load balancing is to divide work evenly among the processors, whereas the purpose of load sharing algorithms is to ensure that no processor remains idle when there are other heavily loaded processors in the system. With sender-initiated algorithms, load-distribution activity is initiated when an over-loaded node (sender) tries to send a task to another under-loaded node (receiver). In receiver-initiated algorithms, load-distribution is initiated by an under-loaded node (receiver), which requests a task from an over-loaded node (sender).

Distributed systems require that relevant information such as processor loads be automatically available to the operating system for use in the assignment decisions.

As mentioned earlier, scheduling policies that use information about the average behavior of the system and ignore the current state, are called static policies. Static policies may be either deterministic or probabilistic. Policies that react to the system state are called adaptive policies.

The principle advantage of static policies is simplicity, since they do not require the maintenance and processing of system state information. Adaptive policies tend to be more complex, mainly because they require information on the system's current state when making transfer decisions. However, the added complexity can significantly improve performance benefits over those achievable with static policies.

This paper studies the effects of job scheduling on the system and program performance of a distributed system where half of the total number of the processors has double the speed of the others. Half of the processes are dedicated to fast processors, while the remaining processes are generic in the sense that they can be individually allocated to any processor.

The model represents a distributed system of loosely coupled workstations communicating over a network. The simulated system represents a real system where a network of workstations is equipped with an I/O server and is extended with a new set of fast workstations and a new fast I/O server. The fast workstations are intended for research purposes, while undergraduate students will use the slow processors to complete their project work. However, undergraduate students may be permitted to use the fast workstations when their project due date is near and the fast processors are not being heavily used for research.

This paper investigates probabilistic, deterministic and adaptive policies. The results are obtained using simulation techniques.

In the probabilistic case, the scheduling policy is described by state independent branching probabilities. Dedicated jobs are dispatched randomly to fast processors with equal probability while the generic jobs are randomly dispatched to slow processors.

In the deterministic case, the routing decision is based on system state. Two different policies are examined for this case. In both policies, the dedicated jobs join the shortest of the fast processor queues. However, the first policy requires that generic jobs join the shortest queue of the slow processors while the second policy assigns generic jobs to the (slow or fast) processor expected to offer the least job response time. However, when a generic job is assigned to a fast processor, job start time depends on an aging factor.

In the adaptive case, variations of the three scheduling policies described above are proposed. When fast processors become idle and generic jobs are waiting in slow processor queues, jobs migrate from heavily loaded slow processor queues to idle processors. This is a receiver initiated adaptive load sharing method. This balances the generic job load and can improve overall system performance.

Jobs transferred to remote processors incur communication costs. In this model, only queued jobs are transferred. We believe that the average transfer costs for non-executing jobs, although certainly non-negligible is quite low in comparison to average job processing costs.

An abundance of research on heterogeneous systems is available in the literature. Cow and Kohler (1979) studied the load balancing in heterogeneous multi-processor systems and introduce an approximate numerical method for analyzing models using deterministic routing policies. Shenker and Weinrib (1989) studied the optimal control of heterogeneous queuing systems. Their heuristic policies are close to optimal. Bonomi and Kumar (1990) studied the adaptive optimal load balancing in non-homogeneous multi-server systems. Mirchandaney et al. (1990) analyzed the performance characteristics of simple load balancing heuristics for heterogeneous computing systems. Maheswaran et al. (1999) studied the dynamic mapping heuristics for one class of independent tasks using heterogeneous computing systems. Topcuoglu et al. (1999) proposed two heuristics for scheduling directed acyclic weighted task graphs on a bounded number of heterogeneous processors. All of these consider open models.

Scheduling policies for distributed systems and multi-programmed parallel systems usually only consider the influence of scheduling policy on processor performance. They normally do not explicitly model I/O handling, even though it can significantly influence overall system performance. Process scheduling is not an isolated problem. It is only one of many services provided by an operating system. A solution to the scheduling problem must be integrated with solutions to other problems, including I/O management. Different parts of the system must be integrated together to create a cohesive whole in a way that makes sense. Rosti et al. (1998) studied large-scale parallel computer systems and suggested that the overlapping of I/O demands of some jobs with the computation demands of other jobs offer a potential improvement in performance.

A closed queuing network model of a heterogeneous distributed system is studied in Karatza (1994). The distributed system has two processors and one I/O unit. The following two job classes are examined: jobs dedicated to the fast processor and generic jobs that can be executed on either processor. For generic jobs, two scheduling policies are possible. One policy is probabilistic while the other is deterministic, assigning jobs to the processor that offers the least expected response time. No migration of jobs is considered in this system.

In this work, a closed queuing network model is also considered consisting of P distributed processors and two I/O units. Both 16 and 32 processor cases were implemented to prove the system is scalable. The goal is to achieve high system performance and fairness in the execution of both job classes. On one hand, it seems as though dedicated jobs should not monopolize the fast processors. On the other hand, dedicated jobs should not be overtaken arbitrarily by the generic jobs. The performance of different scheduling policies is compared for various degrees of multiprogramming (various system loads). To our knowledge, no similar job scheduling analysis has appeared in research literature.

This paper represents an experimental study in the sense that the results are obtained from simulated runs instead of from the measurement of real systems. Nevertheless, the results are of practical value, and all algorithms are practical because they can be implemented. Although absolute performance predictions for specific systems and workloads are not derived, the relative performance of the different job scheduling algorithms across a broad range of workloads is studied. Also, the influence on performance of changes in the workload is analyzed.

For simple idealized systems, performance models can be mathematically analyzed by using queuing theory to obtain performance measures. For complex systems, analytical modeling typically requires simplifying assumptions, and such assumptions might have unforeseeable influence on the results. Therefore, most research efforts have been devoted either to conducting an approximate analysis, developing tractable models for special cases, or conducting simulations. In our work we choose the latter because it is possible to simulate system under study in a direct manner, thus lending credibility to the results. Detailed simulation models help determine performance bottlenecks in architecture and assist in refining the system configuration.

The structure of the paper is as follows. Section 2.1 specifies system and workload models, Section 2.2 describes the task scheduling policies and Section 2.3 presents the metrics employed in assessing the performance of the scheduling policies. Model implementation and input parameters are described in Section 3.1 while the results of the simulation experiments are presented and analyzed in Section 3.2. The final section offers conclusions and provides suggestions for further research.

Section snippets

System and workload models

A closed queuing network model for distributed systems is considered. There are P heterogeneous and independent processors each of which is served by its own queue. The system is examined for P=16, and P=32, where P is the number of processors. This is a representative model for many current existing networks of workstations. The results for the two systems, shown later in Fig. 2, Fig. 3, Fig. 4, Fig. 5, Fig. 6, Fig. 7, Fig. 8, Fig. 9, indicate that the results for other numbers of processors

Model implementation and input parameters

The queuing network model is simulated with discrete event simulation models (Bolch et al., 1998, Law and Kelton, 1991) using the independent replication method. For every mean value, a 95% confidence interval was evaluated. All confidence intervals are less than 5% of the mean values.

The system under consideration is balanced:

  • P=16 casem1=1.0,m2=2.0,k1=0.125,k2=0.250.

  • P=32 casem1=1.0,m2=2.0,k1=0.0625,k2=0.125.

The degree of multiprogramming N is set at P, 2P, 3P, 4P, 5P, so that in both cases of

Conclusions and further research

This paper studies job scheduling in heterogeneous distributed systems. Two job classes are considered. The objective is to obtain good overall system performance while maintaining the fairness of individual job classes. Simulation is used to generate comparative results.

Six scheduling policies are considered (Pr, PrM, SQ, SQM, LERT–MW, and LERT–MWM) for various degrees of multiprogramming N.

Simulation reveals the following:

  • As far as overall performance is concerned, the SQ and SQM methods

Acknowledgements

My thanks to the anonymous referees who reviewed the first version of this paper and provided me with useful comments and suggestions.

Helen D. Karatza is an Assistant Professor at the Department of Informatics at the Aristotle University of Thessaloniki, Greece. Her research interests mainly include performance evaluation of parallel and distributed systems, multiprocessor scheduling and simulation. Her email and web address are <karatza@csd.auth.gr> and <www. csd.auth.gr/ ∼karatza>

References (12)

  • R. Mirchandaney et al.

    Adaptive load sharing in heterogeneous systems

    Journal of Parallel and Distributed Computing

    (1990)
  • G. Bolch et al.

    Queueing Networks and Markov Chains

    (1998)
  • F. Bonomi et al.

    Adaptive optimal load balancing in a nonhomogeneous multiserver system with a central job scheduler

    IEEE Transactions on Computers

    (1990)
  • Y.-C. Cow et al.

    Models for dynamic load balancing in a heterogeneous multiple processor system

    IEEE Transactions on Computers

    (1979)
  • H.D. Karatza

    Simulation study of load balancing and multitasking in a homogeneous distributed system model

    Computer System Science and Engineering Journal

    (1991)
  • H.D. Karatza

    Simulation study of load balancing in a heterogeneous distributed system model

    International Journal of Modelling and Simulation

    (1994)
There are more references available in the full text version of this article.

Cited by (0)

Helen D. Karatza is an Assistant Professor at the Department of Informatics at the Aristotle University of Thessaloniki, Greece. Her research interests mainly include performance evaluation of parallel and distributed systems, multiprocessor scheduling and simulation. Her email and web address are <karatza@csd.auth.gr> and <www. csd.auth.gr/ ∼karatza>

View full text