Nonclairvoyantly scheduling power-heterogeneous processors

https://doi.org/10.1016/j.suscom.2011.05.007Get rights and content

Abstract

We show that a natural nonclairvoyant online algorithm for scheduling jobs on a power-heterogeneous multiprocessor is bounded-speed bounded-competitive for the objective of flow plus energy.

Introduction

Many computer architects believe that architectures consisting of heterogeneous processors/cores will be the dominant architectural design in the future [10], [19], [18], [23], [24]. The main advantage of a heterogeneous architecture, relative to an architecture of identical processors, is that it allows for the inclusion of processors whose design is specialized for particular types of jobs, and for jobs to be assigned to a processor best suited for that job. Most notably, it is envisioned that these heterogeneous architectures will consist of a small number of high-power high-performance processors for critical jobs, and a larger number of lower-power lower-performance processors for less critical jobs (see Fig. 1 for a visual representation of such an architecture). Naturally, the lower-power processors would be more energy efficient in terms of the computation performed per unit of energy expended, and would generate less heat per unit of computation. For a given area and power budget, heterogeneous designs can give significantly better performance for standard workloads [10], [23]. Evaluations in [18] suggest a figure of 40% better performance, and evaluations in [24] suggest a figure of 70% better performance. Moreover, even processors that were designed to be homogeneous, are increasingly likely to be heterogeneous at run time [10]: the dominant underlying cause is the increasing variability in the fabrication process as the feature size is scaled down (although run time faults will also play a role). Since manufacturing yields would be unacceptably low if every processor/core was required to be perfect, and since there would be significant performance loss from derating the entire chip to the functioning of the least functional processor (which is what would be required in order to attain processor homogeneity), some processor heterogeneity seems inevitable in chips with many processors/cores.

The position paper [10] argues for the fundamental importance of research into scheduling policies for heterogeneous processors, and identified three fundamental challenges in scheduling heterogeneous multiprocessors: (1) the OS must discover the status of each processor, (2) the OS must discover the resource demand of each job, and (3) given this information about processors and jobs, the OS must match jobs to processors as well as possible. The contribution of this paper is probably best summarized as an initial step in a theoretical worst-case investigation of challenge (2) and (3) in tandem. To explain this contribution however, it is necessary to first review the results in [16], which in some sense completely solved challenge (3) from a worst-case theoretical perspective.

[16] introduced the following model, building on an earlier model in [6]. There are a collection of m processors, with processor i having a known collection of allowable speeds si,1, …, si,f(i) and associated powers Pi,1, …, Pi,f(i). A set of jobs arrive online over time. Job j arrives in the system at its release time rj. Job j has an associated size pjR>0, as well as an i mportance/weight wjR>0. An online scheduler has two component policies:

  • Job selection: Determines which job to run on each processor at any time.

  • Speed scaling: Determines the speed of each processor at each time.

The objective considered in [16] is that of weighted flow plus energy. A job of size p takes p/s units of time to complete if run at speed s. The flow Fj of a job j is its completion time Cj minus its release time rj. The weighted flow for a job j is wjFj, and the weighted flow for a schedule is jwjFj. The energy of the schedule is the integral of the power consumed over time. The intuitive rationale for the objective of weighted flow plus energy can be understood as follows. Assume that the possibility exists to invest E units of energy to decrease the flow of jobs j1, …, jk by x1, …, xk respectively; then an optimal scheduler (with respect to the objective described above) would make such an investment if and only if i=1kwixiE. So the importance wj of job j can be viewed as specifying an upper bound on the amount of energy that the system is allowed to invest to reduce j’s flow time by one unit of time (assuming that this energy investment in running j faster does not change the flow time of other jobs)—hence jobs with higher weight are more important, since higher investments of energy are permissible to justify a fixed reduction in flow.

[16] considered the following natural online algorithm, consisting of three policies, which we will call GKP:

  • Job selection: On each processor, always run the highest density job assigned to that processor. The density of a job is its weight divided by its size.

  • Speed scaling: The speed of each processor is set so that the resulting power is the (fractional) weight of the unfinished jobs on that processor. The intuitive reason for this is that this guarantees that the total energy used will be identical to the weighted (fractional) flow time.

  • Assignment: When a new job arrives, it is greedily assigned to the processor that results in the least increase in the projected future (fractional) weighted flow, assuming the adopted speed scaling and job selection policies, and ignoring the possibility of jobs arriving in the future.

Note. In the above, the notions of fractional weight and fractional flow time are relevant only for the objective of w eighted flow plus energy. Since our results deal with the case when all weights are identical, one may ignore the fractional aspect to the above procedures.

[16] evaluated this algorithm using resource augmentation analysis [17], which is a type of worst-case comparative analysis, and which we now explain within the context of the type of problems that we consider here. An algorithm A is said to be c-competitive relative to a benchmark algorithm B if for all inputs I it is the case thatA(I)cB(I)where A(I) is the value of the objective of the schedule output by algorithm A on input I, and B(I) is the value of the objective on the benchmark schedule for input I. In other words, the competitiveness of c represents the worst-case error relative to the benchmark. The most obvious choice for the benchmark is probably the optimal schedule for each instance I. But since scheduling on identical processors with the objective of total flow (and even scheduling on a single processor with the objective of weighted flow), is a special case of the problem we consider, O(1)-competitiveness relative to the optimal schedule is not possible [22], [4].

This is a common phenomenon in online scheduling problems, and the standard remedy, called resource augmentation, is to assume that the online algorithm has slightly faster processors than what the optimal schedule can use. In our context, an algorithm A is σ-speed c-competitive if A, equipped with processors with speeds σ · si,1, …, σ · si,f(i), and associated powers Pi,1, …, Pi,f(i) is c-competitive relative to the benchmark of the optimal schedule for processors with speeds si,1, …, si,f(i), and associated powers Pi,1, …, Pi,f(i).

To understand the motivation for resource augmentation analysis, note that it is common for systems to posses the following (informally defined) threshold property: the input or input distributions can be parameterized by a load λ, and the system is parameterized by a capacity μ. The system then has the property that its QoS would be very good when the load λ is at most 90% of the system capacity μ, and it is horrible if λ exceeds 110% of μ. Fig. 2 gives such an example of the QoS curve for a system that has this kind of threshold property. Fig. 2 also shows the performance of an online algorithm A which compares reasonably well with the performance of an optimal algorithm. Notice however that the competitive ratio of A relative to the optimal is very large when the load is near capacity μ since there is a large vertical gap between the two curves at load values slightly below μ. In order to completely explain why the curves for A and optimal in Fig. 2 are “close”, we also need to measure the horizontal gap between curves. This intuitively measures the ratio of the maximum load for which A has good performance with respect to the (typically larger) maximum load for which the optimal can guarantee good performance. To this end, we would like to say something like A performs at most c times worse than optimal on inputs with σ times higher load. Notice that multiplying the load by a factor of σ is equivalent to slowing the system down by a factor of σ. This can be captured by a statement which says that A with an σ times faster processor is at most c times as bad as optimal.

The informal notion of an online scheduling algorithm A being “reasonable” is then generally formalized as A having O(1)-competitiveness for some small constant speed augmentation σ. Such a scheduling algorithm (when c is modest) would guarantee a system capacity of at least μ/σ, which is at least a constant fraction of the optimal capacity. The informal notion of an online scheduling algorithm being “good”, is then generally formalized as A has O(1)-competitiveness even if σ = 1 + ϵ is arbitrarily close to one. Such an algorithm is called scalable since it would guarantee a system capacity arbitrarily close to the optimal capacity, while also ensuring that the QoS remains comparable (to within a constant factor). For a more detailed elaboration, see [26], [25].

The main result in [16] was that the online algorithm GKP was scalable for weighted flow plus energy. The analysis in [16] extended theorems showing similar results for weighted flow plus energy on a uniprocessor [6], [3], and for weighted flow on a multiprocessor without power considerations [14].

At a high level, [16] shows that the natural greedy algorithm GKP has the best possible worst-case performance for challenge (3) from [10]. However note that the GKP algorithm is clairvoyant, that is, it needs to know the job sizes when jobs are released. In the GKP algorithm, the job selection policy needs to know the size of a job to compute its density, and the assignment policy must know the size and density to compute the future costs. Thus the GKP algorithm is not directly implementable as in general one can not expect the system to know job sizes when they are released.

Thus the natural question left open in [16] is what is the best possible nonclairvoyant scheduling algorithm. Nonclairvoyant algorithms do not require knowledge of a size of a job. This can be viewed as addressing challenges (2) and (3) from [10] in tandem (described in Section 1). We note that it is both practically natural, and mathematically necessary, to assume that the system does know the importance of each job.

In this paper, we make a first step toward addressing the open question of finding the best nonclairvoyant scheduling policy. In particular, we consider the simplification that each job has the same importance, or equivalently, we consider the objective of (unweighted) flow time plus energy. Our main result is that a natural nonclairvoyant algorithm is bounded-speed bounded-competitive for the objective of flow plus energy. More precisely, we show that this natural nonclairvoyant scheduling algorithm is (2 + ϵ)-speed O(1/ϵ3)-competitive for the objective of flow plus energy. So intuitively, if this scheduling algorithm is adopted then the system should have capacity at least approximately half of the optimal system capacity. So using the standard interpretation, this natural nonclairvoyant algorithm should be viewed as “reasonable”. Or at least the algorithm seems as reasonable as Equipartition (or equivalently Round Robin or Processor Sharing) is for the objective of minimizing average flow time (without energy considerations) as it is known that Equipartition is (2 + ϵ)-speed O(1/ϵ)-competitive in this context [15].

We now describe the component policies of our nonclairvoyant algorithm:

  • Speed scaling: A collection of processors and associated speed settings are selected so as to maximize the aggregate speed, subject to the constraints that the cardinality of the selected processors is at most the number of unfinished jobs, and the aggregate power is at most the number of unfinished jobs.

  • Job selection: The jobs share this processing power equally.

Intuitively, our speed scaling policy tries to maximize the effective aggregate speed of the assigned processors subject to the same maximum power constraint as in GKP and earlier algorithms. We show that this speed scaling policy can be implemented using a simple and efficient greedy algorithm. After setting the speeds of the machines in the above manner, our job selection policy is to equally share the total speed extracted across all machines. This can be achieved by suitably migrating the jobs on different machines, as long as the number of machines used is at most the number of unfinished jobs.

Note that in contrast to the GKP algorithm, this algorithm produces migratory schedules. That is, the same job may be run on different processors over time (however, no job is run on different machines at the same time). However, it is easy to see that job migration is an unavoidable consequence of nonclairvoyance, that is, any bounded-speed bounded-competitive nonclairvoyant algorithm must migrate jobs.

Let us now consider related scheduling problems where there are no power considerations. First let us assume a single processor. The online clairvoyant algorithm Shortest Remaining Processing Time (SRPT) is optimal for unweighted flow. The nonclairvoyant algorithm Shortest Elapsed Time First is scalable for unweighted flow [17]. The nonclairvoyant algorithm Equipartition is (2 + ϵ)-speed O(1/ϵ)-competitive [15] for unweighted flow. The algorithm Highest Density First (HDF) is scalable for weighted flow [8], and there is no online algorithm that has bounded competitiveness against the optimal schedule [4]. Now let us consider multiple identical processors. SRPT is O(log  n)-competitive against the optimal schedule, and no better competitiveness is achievable [22]. The clairvoyant algorithm HDF is scalable for weighted flow [11]. The nonclairvoyant algorithm Weighted LAPS, which is based on the algorithm LAPS in [15], is scalable for weighted flow, and this can be inferred from the problem being a special case of the problem of broadcast scheduling [7].

We now turn our attention to prior work on scheduling involving power management. For the case of a single processor with unbounded speed and a polynomially bounded power function P(s) = sα, [27] gave an efficient offline algorithm to find the schedule that minimizes average flow subject to a constraint on the amount of energy used, in the case that jobs have unit work. However, no such result involving an energy constraint is possible when we transition to online algorithms. Therefore, [1] introduced the objective of flow plus energy and gave a constant competitive algorithm for this objective in the case of unit work jobs. Subsequently, [9] gave a constant competitive algorithm for the objective of weighted flow plus energy. The competitive ratio was improved by [21] for the unweighted case using a potential function specifically tailored to integer flow. [5] extended the results of [9] to the bounded speed model, and [12] gave a nonclairvoyant algorithm that is O(1)-competitive.

Remaining on a single processor, [6] dropped the assumptions of unbounded speed and polynomially bounded power functions, and gave a 3-competitive algorithm for the objective of unweighted flow plus energy, and a 2-competitive algorithm for fractional weighted flow plus energy, when the power function could be arbitrary. The former analysis was subsequently improved to show 2-competitiveness [3].

Moving on to the setting of multiple machines, [20] considers the problem of minimizing flow plus energy on multiple homogeneous processors, where the allowable speeds range between zero and some upper bound, and the power function is polynomial. [20] shows that an algorithm that uses a variation of round robin for the assignment policy, and uses the job selection and speed scaling policies from [9], is scalable for this problem. [13] shows that bounded-competitiveness for the objective of flow plus energy is not achievable on multiprocessors if jobs can be run simultaneously on multiprocessors, and have varying speed-ups (i.e., jobs have different degrees of parallelism). [13] gives an optimally log competitive algorithm building on the results in [12].

A schedule specifies for each time and each processor, a speed for that processor and a job that each processor runs. We assume that no job may be run on more than one processor simultaneously. The speed is the rate at which work is completed; a job j with size pj run at a constant speed s completes in pj/s seconds. A job is completed when all of its work has been processed. The flow time of a job is the completion time of the job minus the release time of the job. The weighted flow of a job is the weight of the job times the flow time of the job (for our results, all jobs have the same weight, and we could think of the weights as being unit).

As noted in [6] we can interpolate the discrete speeds and powers of a processor to a piecewise linear function in the obvious way. See Fig. 3 for an illustration. To elaborate, let s1 and s2 be two allowable speeds for a processor, with associated powers P1 and P2. By time multiplexing the speeds s1 and s2 with proportion λ and 1  λ respectively (here λ  [0, 1]), one can effectively have a processor that runs at speed λs1 + (1  λ)s2 with power λP1 + (1  λ)P2. Note that then this is just the linear interpolation of the two points (s1, P1) and (s2, P2). As noted in [6] we may then assume without loss of generality that the power function P has the following properties: P(0) = 0, P is non-decreasing, and P is convex. We will use Pi to denote the resulting power function for processor i, and use Qi to denote Pi1; i.e., Qi(y) gives us the speed that we can run processor i at, if we specify a limit of y on the power. Since Pi is convex, Qi is concave, and we exploit this fact in our proofs.

Finally, let us quickly review the technique of amortized competitiveness analysis on a single processor, which we use in our proofs. Consider an objective G (in our setting, it is unweighted flow plus energy). Let GA(t) be the increase in the objective in the schedule for algorithm A at time t. So when G is unweighted flow plus energy, GA(t) is Pa(t) + na(t), where Pa(t) is the total power used by A at time t and na(t) is the number of unfinished jobs for A at time t. Let OPT be the optimal benchmark schedule we could like to compare against. Then, the algorithm A is said to be locally c-competitive if for all times t, if GA(t)  c · GOPT(t). While such a guarantee would immediately imply our results, it turns out to be often too strong a requirement for even simplified problems such as minimizing weighted flow time on a single fixed speed processor. Therefore, we resort to using the concept of amortized analysis. To prove A is (c + d)-competitive using an amortized local competitiveness argument, it suffices to give a potential function Φ(t) such that the following conditions hold (see for example [25]).

  • Boundary condition: Φ is zero before any job is released and Φ is non-negative after all jobs are finished.

  • Completion condition: Φ does not increase due to completions by either A or OPT.

  • Arrival condition: Φ does not increase more than d · OPT due to job arrivals.

  • Running condition: At any time t when no job arrives or is completed,GA(t)+dΦ(t)dtcGOPT(t)

The sufficiency of these conditions for proving (c + d)-competitiveness follows from integrating them over time.

Section snippets

The description of the algorithm

In this section we describe the nonclairvoyant algorithm (denoted by Alg) in greater detail. As mentioned in the introduction (Section 1.2), Alg consists of two components, (i) the speed scaling policy which at any time t determines the power to run each processor at, and (ii) the job selection policy which decides which job is run on which processor.

At any time instant t, let na(t) denotes the number of jobs which have been released but remain unfinished for our online algorithm Alg. Also let N

The analysis of Alg

In this section, we show using an amortized local-competitiveness analysis that the algorithm Alg is (2 + ϵ)-speed O(1/ϵ3)-competitive for the objective of flow plus energy. Our main proof in fact shows that the online algorithm Alg is (2 + ϵ)-speed O(1/ϵ2)-competitive relative to the GKP schedule. Indeed, [16] shows that the GKP algorithm is (1 + ϵ)-speed O(1/ϵ)-competitive against any feasible schedule, and in particular, OPT. Therefore, we could combine these two results to get (2 + ϵ)-speed O(1/ϵ3

Conclusion

The main result of this paper is to show that a natural nonclairvoyant algorithm is bounded-speed bounded-competitive for the objective of flow plus energy on power-heterogeneous processors. This paper is a first step towards determining the theoretically best nonclairvoyant algorithm for scheduling jobs of varying importance on power-heterogeneous processors. The obvious two possible next steps are to either find a scalable algorithm for flow plus energy, or find a bounded-speed

Anupam Gupta is an Associate Professor in the Computer Science Department at Carnegie Mellon University. His research interests are in the area of theoretical Computer Science, primarily in developing approximation algorithms for NP-hard optimization problems, and understanding the algorithmic properties of metric spaces. He is the recipient of an Alfred P. Sloan Research Fellowship, and the NSF Career award.

References (27)

  • Carl. Bussema et al.

    Greedy multiprocessor server scheduling

    Operations Research Letters

    (2006)
  • Stefano. Leonardi et al.

    Approximating total flow time on parallel machines

    Journal of Computer and Systems Sciences

    (2007)
  • Susanne. Albers et al.

    Energy-efficient algorithms for flow time minimization

    ACM Transactions on Algorithms

    (2007)
  • Lachlan L.H. Andrew et al.

    Optimality, fairness, and robustness in speed scaling designs

  • Lachlan L.H. Andrew et al.

    Optimal speed scaling under arbitrary power functions

    SIGMETRICS Performance Evaluation Review

    (2009)
  • Nikhil. Bansal et al.

    Weighted flow time does not admit O(1)-competitive algorithms

  • Nikhil. Bansal et al.

    Scheduling for speed bounded processors

  • Nikhil. Bansal et al.

    Speed scaling with an arbitrary power function

  • Nikhil Bansal, Ravishankar Krishnaswamy, and Viswanath Nagarjan, Better scalable algorithms for broadcast scheduling,...
  • Luca. Becchetti et al.

    Nonclairvoyant scheduling to minimize the total flow time on single and parallel machines

    Journal of the ACM

    (2004)
  • Nikhil. Bansal et al.

    Speed scaling for weighted flow time

    SIAM Journal on Computing

    (2009)
  • Fred A. Bower et al.

    The impact of dynamically heterogeneous multicore processors on thread scheduling

    IEEE Micro

    (2008)
  • Ho-Leung. Chan et al.

    Nonclairvoyant speed scaling for flow and energy

  • Cited by (4)

    • SME Web Energy Efficient Platform (SWEEP): A new architecture for a hybrid web server

      2013, Sustainable Computing: Informatics and Systems
      Citation Excerpt :

      Recent work has focused on using arrays of smaller processors to replace large processors and still achieving acceptable performance for some workloads with significantly reduced energy consumption [6]. Other recent work has focused on how systems consisting of heterogeneous processors and/or cores can reduce energy consumption where scheduling of jobs is a critical problem [3,17,31]. What has not been specifically addressed is the energy use of servers in small and medium enterprises (SMEs) where servers are often single computers with dedicated services and where clustering and virtualization methods used in data centers do not readily apply.

    • Robust online speed scaling with deadline uncertaint

      2018, Leibniz International Proceedings in Informatics, LIPIcs

    Anupam Gupta is an Associate Professor in the Computer Science Department at Carnegie Mellon University. His research interests are in the area of theoretical Computer Science, primarily in developing approximation algorithms for NP-hard optimization problems, and understanding the algorithmic properties of metric spaces. He is the recipient of an Alfred P. Sloan Research Fellowship, and the NSF Career award.

    Ravishankar Krishnaswamy is a graduate student in the Computer Science Department, Carnegie Mellon University, advised by Anupam Gupta. His research focuses on the design and analysis of approximation and online algorithms for network design, scheduling, and stochastic optimization problems. Earlier, he received his Bachelor of Technology from IIT Madras, India in 2007.

    Kirk Pruhs is a professor of Computer Science at the University of Pittsburgh. His primary research interests are in algorithmic problems related to resource management, scheduling, and sustainable computing.

    1

    Supported in part by NSF awards CCF-0448095 and CCF-0729022, and an Alfred P. Sloan Fellowship.

    2

    Supported in part by NSF grants CNS-0325353, IIS-0534531, and CCF-0830558, and an IBM Faculty Award.

    View full text