Nonclairvoyantly scheduling power-heterogeneous processors

doi:10.1016/j.suscom.2011.05.007

Sustainable Computing: Informatics and Systems

Volume 1, Issue 3, September 2011, Pages 248-255

https://doi.org/10.1016/j.suscom.2011.05.007 Get rights and content

Abstract

We show that a natural nonclairvoyant online algorithm for scheduling jobs on a power-heterogeneous multiprocessor is bounded-speed bounded-competitive for the objective of flow plus energy.

Introduction

Many computer architects believe that architectures consisting of heterogeneous processors/cores will be the dominant architectural design in the future [10], [19], [18], [23], [24]. The main advantage of a heterogeneous architecture, relative to an architecture of identical processors, is that it allows for the inclusion of processors whose design is specialized for particular types of jobs, and for jobs to be assigned to a processor best suited for that job. Most notably, it is envisioned that these heterogeneous architectures will consist of a small number of high-power high-performance processors for critical jobs, and a larger number of lower-power lower-performance processors for less critical jobs (see Fig. 1 for a visual representation of such an architecture). Naturally, the lower-power processors would be more energy efficient in terms of the computation performed per unit of energy expended, and would generate less heat per unit of computation. For a given area and power budget, heterogeneous designs can give significantly better performance for standard workloads [10], [23]. Evaluations in [18] suggest a figure of 40% better performance, and evaluations in [24] suggest a figure of 70% better performance. Moreover, even processors that were designed to be homogeneous, are increasingly likely to be heterogeneous at run time [10]: the dominant underlying cause is the increasing variability in the fabrication process as the feature size is scaled down (although run time faults will also play a role). Since manufacturing yields would be unacceptably low if every processor/core was required to be perfect, and since there would be significant performance loss from derating the entire chip to the functioning of the least functional processor (which is what would be required in order to attain processor homogeneity), some processor heterogeneity seems inevitable in chips with many processors/cores.

The position paper [10] argues for the fundamental importance of research into scheduling policies for heterogeneous processors, and identified three fundamental challenges in scheduling heterogeneous multiprocessors: (1) the OS must discover the status of each processor, (2) the OS must discover the resource demand of each job, and (3) given this information about processors and jobs, the OS must match jobs to processors as well as possible. The contribution of this paper is probably best summarized as an initial step in a theoretical worst-case investigation of challenge (2) and (3) in tandem. To explain this contribution however, it is necessary to first review the results in [16], which in some sense completely solved challenge (3) from a worst-case theoretical perspective.

[16] introduced the following model, building on an earlier model in [6]. There are a collection of m processors, with processor i having a known collection of allowable speeds s_i,1, …, s_i,f(i) and associated powers P_i,1, …, P_i,f(i). A set of jobs arrive online over time. Job j arrives in the system at its release time r_j. Job j has an associated size $p_{j} \in R_{> 0}$ , as well as an i mportance/weight $w_{j} \in R_{> 0}$ . An online scheduler has two component policies:

Job selection: Determines which job to run on each processor at any time.
Speed scaling: Determines the speed of each processor at each time.

The objective considered in [16] is that of weighted flow plus energy. A job of size p takes p/s units of time to complete if run at speed s. The flow F_j of a job j is its completion time C_j minus its release time r_j. The weighted flow for a job j is $w_{j} F_{j}$ , and the weighted flow for a schedule is $\sum_{j} w_{j} F_{j}$ . The energy of the schedule is the integral of the power consumed over time. The intuitive rationale for the objective of weighted flow plus energy can be understood as follows. Assume that the possibility exists to invest E units of energy to decrease the flow of jobs j₁, …, j_k by x₁, …, x_k respectively; then an optimal scheduler (with respect to the objective described above) would make such an investment if and only if $\sum_{i = 1}^{k} w_{i} x_{i} \geq E$ . So the importance $w_{j}$ of job j can be viewed as specifying an upper bound on the amount of energy that the system is allowed to invest to reduce j’s flow time by one unit of time (assuming that this energy investment in running j faster does not change the flow time of other jobs)—hence jobs with higher weight are more important, since higher investments of energy are permissible to justify a fixed reduction in flow.

[16] considered the following natural online algorithm, consisting of three policies, which we will call GKP:

Job selection: On each processor, always run the highest density job assigned to that processor. The density of a job is its weight divided by its size.
Speed scaling: The speed of each processor is set so that the resulting power is the (fractional) weight of the unfinished jobs on that processor. The intuitive reason for this is that this guarantees that the total energy used will be identical to the weighted (fractional) flow time.
Assignment: When a new job arrives, it is greedily assigned to the processor that results in the least increase in the projected future (fractional) weighted flow, assuming the adopted speed scaling and job selection policies, and ignoring the possibility of jobs arriving in the future.

Note. In the above, the notions of fractional weight and fractional flow time are relevant only for the objective of w eighted flow plus energy. Since our results deal with the case when all weights are identical, one may ignore the fractional aspect to the above procedures.

[16] evaluated this algorithm using resource augmentation analysis [17], which is a type of worst-case comparative analysis, and which we now explain within the context of the type of problems that we consider here. An algorithm A is said to be c-competitive relative to a benchmark algorithm B if for all inputs I it is the case that $A (I) \leq c \cdot B (I)$ where A(I) is the value of the objective of the schedule output by algorithm A on input I, and B(I) is the value of the objective on the benchmark schedule for input I. In other words, the competitiveness of c represents the worst-case error relative to the benchmark. The most obvious choice for the benchmark is probably the optimal schedule for each instance I. But since scheduling on identical processors with the objective of total flow (and even scheduling on a single processor with the objective of weighted flow), is a special case of the problem we consider, O(1)-competitiveness relative to the optimal schedule is not possible [22], [4].

This is a common phenomenon in online scheduling problems, and the standard remedy, called resource augmentation, is to assume that the online algorithm has slightly faster processors than what the optimal schedule can use. In our context, an algorithm A is σ-speed c-competitive if A, equipped with processors with speeds σ · s_i,1, …, σ · s_i,f(i), and associated powers P_i,1, …, P_i,f(i) is c-competitive relative to the benchmark of the optimal schedule for processors with speeds s_i,1, …, s_i,f(i), and associated powers P_i,1, …, P_i,f(i).

To understand the motivation for resource augmentation analysis, note that it is common for systems to posses the following (informally defined) threshold property: the input or input distributions can be parameterized by a load λ, and the system is parameterized by a capacity μ. The system then has the property that its QoS would be very good when the load λ is at most 90% of the system capacity μ, and it is horrible if λ exceeds 110% of μ. Fig. 2 gives such an example of the QoS curve for a system that has this kind of threshold property. Fig. 2 also shows the performance of an online algorithm A which compares reasonably well with the performance of an optimal algorithm. Notice however that the competitive ratio of A relative to the optimal is very large when the load is near capacity μ since there is a large vertical gap between the two curves at load values slightly below μ. In order to completely explain why the curves for A and optimal in Fig. 2 are “close”, we also need to measure the horizontal gap between curves. This intuitively measures the ratio of the maximum load for which A has good performance with respect to the (typically larger) maximum load for which the optimal can guarantee good performance. To this end, we would like to say something like A performs at most c times worse than optimal on inputs with σ times higher load. Notice that multiplying the load by a factor of σ is equivalent to slowing the system down by a factor of σ. This can be captured by a statement which says that A with an σ times faster processor is at most c times as bad as optimal.

The informal notion of an online scheduling algorithm A being “reasonable” is then generally formalized as A having O(1)-competitiveness for some small constant speed augmentation σ. Such a scheduling algorithm (when c is modest) would guarantee a system capacity of at least μ/σ, which is at least a constant fraction of the optimal capacity. The informal notion of an online scheduling algorithm being “good”, is then generally formalized as A has O(1)-competitiveness even if σ = 1 + ϵ is arbitrarily close to one. Such an algorithm is called scalable since it would guarantee a system capacity arbitrarily close to the optimal capacity, while also ensuring that the QoS remains comparable (to within a constant factor). For a more detailed elaboration, see [26], [25].

The main result in [16] was that the online algorithm GKP was scalable for weighted flow plus energy. The analysis in [16] extended theorems showing similar results for weighted flow plus energy on a uniprocessor [6], [3], and for weighted flow on a multiprocessor without power considerations [14].

At a high level, [16] shows that the natural greedy algorithm GKP has the best possible worst-case performance for challenge (3) from [10]. However note that the GKP algorithm is clairvoyant, that is, it needs to know the job sizes when jobs are released. In the GKP algorithm, the job selection policy needs to know the size of a job to compute its density, and the assignment policy must know the size and density to compute the future costs. Thus the GKP algorithm is not directly implementable as in general one can not expect the system to know job sizes when they are released.

Thus the natural question left open in [16] is what is the best possible nonclairvoyant scheduling algorithm. Nonclairvoyant algorithms do not require knowledge of a size of a job. This can be viewed as addressing challenges (2) and (3) from [10] in tandem (described in Section 1). We note that it is both practically natural, and mathematically necessary, to assume that the system does know the importance of each job.

In this paper, we make a first step toward addressing the open question of finding the best nonclairvoyant scheduling policy. In particular, we consider the simplification that each job has the same importance, or equivalently, we consider the objective of (unweighted) flow time plus energy. Our main result is that a natural nonclairvoyant algorithm is bounded-speed bounded-competitive for the objective of flow plus energy. More precisely, we show that this natural nonclairvoyant scheduling algorithm is (2 + ϵ)-speed O(1/ϵ³)-competitive for the objective of flow plus energy. So intuitively, if this scheduling algorithm is adopted then the system should have capacity at least approximately half of the optimal system capacity. So using the standard interpretation, this natural nonclairvoyant algorithm should be viewed as “reasonable”. Or at least the algorithm seems as reasonable as Equipartition (or equivalently Round Robin or Processor Sharing) is for the objective of minimizing average flow time (without energy considerations) as it is known that Equipartition is (2 + ϵ)-speed O(1/ϵ)-competitive in this context [15].

We now describe the component policies of our nonclairvoyant algorithm:

Speed scaling: A collection of processors and associated speed settings are selected so as to maximize the aggregate speed, subject to the constraints that the cardinality of the selected processors is at most the number of unfinished jobs, and the aggregate power is at most the number of unfinished jobs.
Job selection: The jobs share this processing power equally.

Intuitively, our speed scaling policy tries to maximize the effective aggregate speed of the assigned processors subject to the same maximum power constraint as in GKP and earlier algorithms. We show that this speed scaling policy can be implemented using a simple and efficient greedy algorithm. After setting the speeds of the machines in the above manner, our job selection policy is to equally share the total speed extracted across all machines. This can be achieved by suitably migrating the jobs on different machines, as long as the number of machines used is at most the number of unfinished jobs.

Note that in contrast to the GKP algorithm, this algorithm produces migratory schedules. That is, the same job may be run on different processors over time (however, no job is run on different machines at the same time). However, it is easy to see that job migration is an unavoidable consequence of nonclairvoyance, that is, any bounded-speed bounded-competitive nonclairvoyant algorithm must migrate jobs.

Let us now consider related scheduling problems where there are no power considerations. First let us assume a single processor. The online clairvoyant algorithm Shortest Remaining Processing Time (SRPT) is optimal for unweighted flow. The nonclairvoyant algorithm Shortest Elapsed Time First is scalable for unweighted flow [17]. The nonclairvoyant algorithm Equipartition is (2 + ϵ)-speed O(1/ϵ)-competitive [15] for unweighted flow. The algorithm Highest Density First (HDF) is scalable for weighted flow [8], and there is no online algorithm that has bounded competitiveness against the optimal schedule [4]. Now let us consider multiple identical processors. SRPT is O(log n)-competitive against the optimal schedule, and no better competitiveness is achievable [22]. The clairvoyant algorithm HDF is scalable for weighted flow [11]. The nonclairvoyant algorithm Weighted LAPS, which is based on the algorithm LAPS in [15], is scalable for weighted flow, and this can be inferred from the problem being a special case of the problem of broadcast scheduling [7].

We now turn our attention to prior work on scheduling involving power management. For the case of a single processor with unbounded speed and a polynomially bounded power function P(s) = s^α, [27] gave an efficient offline algorithm to find the schedule that minimizes average flow subject to a constraint on the amount of energy used, in the case that jobs have unit work. However, no such result involving an energy constraint is possible when we transition to online algorithms. Therefore, [1] introduced the objective of flow plus energy and gave a constant competitive algorithm for this objective in the case of unit work jobs. Subsequently, [9] gave a constant competitive algorithm for the objective of weighted flow plus energy. The competitive ratio was improved by [21] for the unweighted case using a potential function specifically tailored to integer flow. [5] extended the results of [9] to the bounded speed model, and [12] gave a nonclairvoyant algorithm that is O(1)-competitive.

Remaining on a single processor, [6] dropped the assumptions of unbounded speed and polynomially bounded power functions, and gave a 3-competitive algorithm for the objective of unweighted flow plus energy, and a 2-competitive algorithm for fractional weighted flow plus energy, when the power function could be arbitrary. The former analysis was subsequently improved to show 2-competitiveness [3].

Moving on to the setting of multiple machines, [20] considers the problem of minimizing flow plus energy on multiple homogeneous processors, where the allowable speeds range between zero and some upper bound, and the power function is polynomial. [20] shows that an algorithm that uses a variation of round robin for the assignment policy, and uses the job selection and speed scaling policies from [9], is scalable for this problem. [13] shows that bounded-competitiveness for the objective of flow plus energy is not achievable on multiprocessors if jobs can be run simultaneously on multiprocessors, and have varying speed-ups (i.e., jobs have different degrees of parallelism). [13] gives an optimally log competitive algorithm building on the results in [12].

A schedule specifies for each time and each processor, a speed for that processor and a job that each processor runs. We assume that no job may be run on more than one processor simultaneously. The speed is the rate at which work is completed; a job j with size p_j run at a constant speed s completes in p_j/s seconds. A job is completed when all of its work has been processed. The flow time of a job is the completion time of the job minus the release time of the job. The weighted flow of a job is the weight of the job times the flow time of the job (for our results, all jobs have the same weight, and we could think of the weights as being unit).

As noted in [6] we can interpolate the discrete speeds and powers of a processor to a piecewise linear function in the obvious way. See Fig. 3 for an illustration. To elaborate, let s₁ and s₂ be two allowable speeds for a processor, with associated powers P₁ and P₂. By time multiplexing the speeds s₁ and s₂ with proportion λ and 1 − λ respectively (here λ ∈ [0, 1]), one can effectively have a processor that runs at speed λs₁ + (1 − λ)s₂ with power λP₁ + (1 − λ)P₂. Note that then this is just the linear interpolation of the two points (s₁, P₁) and (s₂, P₂). As noted in [6] we may then assume without loss of generality that the power function P has the following properties: P(0) = 0, P is non-decreasing, and P is convex. We will use P_i to denote the resulting power function for processor i, and use Q_i to denote $P_{i}^{- 1}$ ; i.e., Q_i(y) gives us the speed that we can run processor i at, if we specify a limit of y on the power. Since P_i is convex, Q_i is concave, and we exploit this fact in our proofs.

Finally, let us quickly review the technique of amortized competitiveness analysis on a single processor, which we use in our proofs. Consider an objective G (in our setting, it is unweighted flow plus energy). Let G_A(t) be the increase in the objective in the schedule for algorithm A at time t. So when G is unweighted flow plus energy, G_A(t) is P_a(t) + n_a(t), where P_a(t) is the total power used by A at time t and n_a(t) is the number of unfinished jobs for A at time t. Let OPT be the optimal benchmark schedule we could like to compare against. Then, the algorithm A is said to be locally c-competitive if for all times t, if G_A(t) ≤ c · G_OPT(t). While such a guarantee would immediately imply our results, it turns out to be often too strong a requirement for even simplified problems such as minimizing weighted flow time on a single fixed speed processor. Therefore, we resort to using the concept of amortized analysis. To prove A is (c + d)-competitive using an amortized local competitiveness argument, it suffices to give a potential function Φ(t) such that the following conditions hold (see for example [25]).

Boundary condition: Φ is zero before any job is released and Φ is non-negative after all jobs are finished.
Completion condition: Φ does not increase due to completions by either A or OPT.
Arrival condition: Φ does not increase more than d · OPT due to job arrivals.
Running condition: At any time t when no job arrives or is completed, $G_{A} (t) + \frac{d Φ (t)}{d t} \leq c \cdot G_{O P T} (t)$

The sufficiency of these conditions for proving (c + d)-competitiveness follows from integrating them over time.

Section snippets

The description of the algorithm

In this section we describe the nonclairvoyant algorithm (denoted by Alg) in greater detail. As mentioned in the introduction (Section 1.2), Alg consists of two components, (i) the speed scaling policy which at any time t determines the power to run each processor at, and (ii) the job selection policy which decides which job is run on which processor.

At any time instant t, let n_a(t) denotes the number of jobs which have been released but remain unfinished for our online algorithm Alg. Also let N

The analysis of Alg

In this section, we show using an amortized local-competitiveness analysis that the algorithm Alg is (2 + ϵ)-speed O(1/ϵ³)-competitive for the objective of flow plus energy. Our main proof in fact shows that the online algorithm Alg is (2 + ϵ)-speed O(1/ϵ²)-competitive relative to the GKP schedule. Indeed, [16] shows that the GKP algorithm is (1 + ϵ)-speed O(1/ϵ)-competitive against any feasible schedule, and in particular, OPT. Therefore, we could combine these two results to get (2 + ϵ)-speed O(1/ϵ³

Conclusion

The main result of this paper is to show that a natural nonclairvoyant algorithm is bounded-speed bounded-competitive for the objective of flow plus energy on power-heterogeneous processors. This paper is a first step towards determining the theoretically best nonclairvoyant algorithm for scheduling jobs of varying importance on power-heterogeneous processors. The obvious two possible next steps are to either find a scalable algorithm for flow plus energy, or find a bounded-speed

Anupam Gupta is an Associate Professor in the Computer Science Department at Carnegie Mellon University. His research interests are in the area of theoretical Computer Science, primarily in developing approximation algorithms for NP-hard optimization problems, and understanding the algorithmic properties of metric spaces. He is the recipient of an Alfred P. Sloan Research Fellowship, and the NSF Career award.

References (27)

Carl. Bussema et al.
Greedy multiprocessor server scheduling
Operations Research Letters
(2006)
Stefano. Leonardi et al.
Approximating total flow time on parallel machines
Journal of Computer and Systems Sciences
(2007)
Susanne. Albers et al.
Energy-efficient algorithms for flow time minimization
ACM Transactions on Algorithms
(2007)
Lachlan L.H. Andrew et al.
Optimality, fairness, and robustness in speed scaling designs
Lachlan L.H. Andrew et al.
Optimal speed scaling under arbitrary power functions
SIGMETRICS Performance Evaluation Review
(2009)
Nikhil. Bansal et al.
Weighted flow time does not admit O(1)-competitive algorithms
Nikhil. Bansal et al.
Scheduling for speed bounded processors
Nikhil. Bansal et al.
Speed scaling with an arbitrary power function
Nikhil Bansal, Ravishankar Krishnaswamy, and Viswanath Nagarjan, Better scalable algorithms for broadcast scheduling,...
Luca. Becchetti et al.
Nonclairvoyant scheduling to minimize the total flow time on single and parallel machines
Journal of the ACM
(2004)

Nikhil. Bansal et al.

Speed scaling for weighted flow time

SIAM Journal on Computing

(2009)

Fred A. Bower et al.

The impact of dynamically heterogeneous multicore processors on thread scheduling

IEEE Micro

(2008)

Ho-Leung. Chan et al.

Nonclairvoyant speed scaling for flow and energy

Cited by (4)

SME Web Energy Efficient Platform (SWEEP): A new architecture for a hybrid web server
2013, Sustainable Computing: Informatics and Systems
Citation Excerpt :
Recent work has focused on using arrays of smaller processors to replace large processors and still achieving acceptable performance for some workloads with significantly reduced energy consumption [6]. Other recent work has focused on how systems consisting of heterogeneous processors and/or cores can reduce energy consumption where scheduling of jobs is a critical problem [3,17,31]. What has not been specifically addressed is the energy use of servers in small and medium enterprises (SMEs) where servers are often single computers with dedicated services and where clustering and virtualization methods used in data centers do not readily apply.
Web servers in Small and Medium Enterprises (SMEs) consume a significant amount of energy. Analysis of logs from an SME server has shown that the incoming request rate is on average low at less than 100 requests per minute with daily and weekly trends. We consider how a new hybrid SME web server architecture based on two co-located, mirrored platforms (a high-performance and high-power Master and a low-performance and low-power Assistant) can be architected to appear as a single system image to clients, significantly reduce energy consumption, and maintain an acceptable response time. A key contribution of our work is a new seamless method for switching between the two platforms in response to predicted request rates. Our new “macSwitch” method uses Gratuitous ARP and a handshake protocol between the Master and Assistant platforms – a separate load balancer is not needed, and requests are not lost during switching intervals. A prototype based on a Dell Pentium 4 – based PC and a Sheeva ARM – based plug computer was developed and evaluated using the Apache Benchmark (ab) and a workload based on the actual request log data from the main KETI web server. Using Exponentially Weighted Moving Average (EWMA) for predicting future request rates to trigger switching between the Assistant and Master, our experimental results show that for up to 50 requests per minute the hybrid system can sleep for 50% of the time with no increase in response time (compared to the response time of the Master only).
Power-aware speed scaling in processor sharing systems: Optimality and robustness
2012, Performance Evaluation
Adapting the speed of a processor is an effective method to reduce energy consumption. This paper studies the optimal way to scale speed to balance response time and energy consumption under processor sharing scheduling. It is shown that using a static rate while the system is busy provides nearly optimal performance, but having a wider range of available speeds increases robustness to different traffic loads. In particular, the dynamic speed scaling optimal for Poisson arrivals is also constant-competitive in the worst case. The scheme that equates power consumption with queue occupancy is shown to be 10-competitive when power is cubic in speed.
Robust online speed scaling with deadline uncertaint
2018, Leibniz International Proceedings in Informatics, LIPIcs
Robust online speed scaling with deadline uncertainty
2017, arXiv

Ravishankar Krishnaswamy is a graduate student in the Computer Science Department, Carnegie Mellon University, advised by Anupam Gupta. His research focuses on the design and analysis of approximation and online algorithms for network design, scheduling, and stochastic optimization problems. Earlier, he received his Bachelor of Technology from IIT Madras, India in 2007.

Kirk Pruhs is a professor of Computer Science at the University of Pittsburgh. His primary research interests are in algorithmic problems related to resource management, scheduling, and sustainable computing.

¹: Supported in part by NSF awards CCF-0448095 and CCF-0729022, and an Alfred P. Sloan Fellowship.

²: Supported in part by NSF grants CNS-0325353, IIS-0534531, and CCF-0830558, and an IBM Faculty Award.

View full text

Nonclairvoyantly scheduling power-heterogeneous processors

Abstract

Introduction

Section snippets

The description of the algorithm

The analysis of Alg

Conclusion

Operations Research Letters

Journal of Computer and Systems Sciences

Energy-efficient algorithms for flow time minimization

ACM Transactions on Algorithms

Optimality, fairness, and robustness in speed scaling designs

Optimal speed scaling under arbitrary power functions

SIGMETRICS Performance Evaluation Review

Weighted flow time does not admit O(1)-competitive algorithms

Scheduling for speed bounded processors

Speed scaling with an arbitrary power function

Nonclairvoyant scheduling to minimize the total flow time on single and parallel machines

Journal of the ACM

Speed scaling for weighted flow time

SIAM Journal on Computing

The impact of dynamically heterogeneous multicore processors on thread scheduling

IEEE Micro

Nonclairvoyant speed scaling for flow and energy