Elsevier

Information Systems

Volume 83, July 2019, Pages 118-125
Information Systems

Fairness in dataflow scheduling in the cloud

https://doi.org/10.1016/j.is.2019.03.003Get rights and content

Highlights

  • Fairness for the scheduling of multiple dataflows on the Cloud where cost is crucial.

  • Heuristic for Pareto-efficient solutions with respect to makespan, cost and fairness.

  • Impact of the prioritization scheme and the pruning method used on the skyline.

Abstract

Expensive dataflow queries which may involve large-scale computations operating on significant volumes of data are typically executed on distributed platforms to improve application performance. Among these, cloud computing has emerged as an attractive option for users to execute dataflows allowing them to select proper configurations (e.g., number of machines) to achieve desired trade-offs between execution time and monetary cost. Discovering dataflow schedules that exhibit the best trade-offs within a plethora of potential solutions can be challenging, especially in a heterogeneous environment where resource characteristics like performance and price can be varied. To increase resource utilization, users may also submit multiple dataflows for execution concurrently. Traditionally, building fair schedules (schedules where the slowdown of all dataflows due to resource sharing is similar) while achieving good performance is a major concern. However, considering fairness in the cloud computing setting where monetary cost is part of the optimization objectives significantly increases the difficulty of the scheduling problem. This paper proposes an algorithm for the scheduling of multiple dataflows on heterogeneous clouds that identifies Pareto-optimal solutions (schedules) in the three-dimensional space formed from the different trade-offs between overall execution time, monetary cost and fairness. The results show that in most cases the proposed approach can provide solutions with fairer schedules without significantly impacting the quality of the execution time to monetary cost skyline compared to the state of the art where the fairness of a solution is not taken into account.

Introduction

Big data applications may require the execution of expensive queries with the processing of large volumes of data. Such dataflow queries (or dataflows) can be modeled as a Directed Acyclic Graph (DAG) to describe operators (large-scale computations) and data flow dependencies between them. In a quest for performance optimization, distributed systems have been extensively used for the execution of dataflows allowing users to take advantage of their degree of parallelism and run independent operators (operators without any data dependencies between them) simultaneously. On that account, optimization of application performance has been a major focus of research on dataflow scheduling.

With cloud computing gaining popularity for the execution of complex, scalable applications by offering a flexible environment to provision resources on demand on a pay-per-use basis, monetary cost has become an equally important optimization objective to consider when selecting from the number of resources on which to schedule a dataflow. Optimization objectives may often be conflicting resulting in a large space of solutions with diverse trade-offs. For example, using a large number of resources will usually lead to better performance at the expense of a higher monetary cost, compared to executing all operators in a single resource. Often, a slightly longer execution time may be tolerated when it comes with significant cost savings. Exploring configurations with good trade-offs between the conflicting objectives is the aim of multi-objective query optimization (MOQO). The complexity of the MOQO problem may further increase with cloud providers offering heterogeneous resources which may differ in their characteristics in terms of price and performance; different combinations of resource types, each with a different number of virtual machines (VMs), can be chosen. Although leading to a significantly larger space of alternative solutions compared to the homogeneous case, heterogeneous configurations may be preferred in several cases exhibiting better trade-offs.

In a distributed environment, a user may also submit for execution several independent dataflow queries (DAGs) at the same time to share resources between them. Such DAGs can be interleaved at a single execution schedule to better utilize the resources, exploiting idle slots for the execution of different DAGs. However, as these DAGs compete for the same resources, their performance may be affected compared to the scenario of isolated execution (each single DAG being executed alone), e.g. by assigning several operators to later slots or using less resources in a DAG. When scheduling multiple dataflows, providing fair schedules (schedules where all co-scheduled DAGs experience similar slowdown due to resource sharing) while achieving overall good performance are important optimization objectives [1], [2]. The structure and characteristics between different DAGs may significantly differ. For example, several DAGs may have a larger degree of parallelism, a smaller number of operators or much shorter operator runtimes. Such factors need to be considered when determining the order that operators from different DAGs are scheduled as they can significantly impact the fairness achieved in a schedule [3]. Achieving fairness in the Cloud when monetary cost is an additional concern in the optimization problem for the scheduling of multiple dataflows makes decision making even more difficult.

In this work, we present “Homogeneous to Heterogeneous Dataflow Scheduling with Fairness” (HHDS-F), a fairness-aware scheduling algorithm for the execution of multiple dataflows on heterogeneous clouds. The algorithm tries to identify Pareto-optimal trade-offs between overall execution time, monetary cost and fairness exploring the solution space in an efficient way. To the best of our knowledge, this is the first work that deals with the scheduling of multiple dataflows to achieve Pareto-efficient solutions in the three-dimensional space formed by the different trade-offs between overall execution time, monetary cost and fairness.

The main contributions of this work are the following:

  • We develop a two-stage heuristic for the scheduling of multiple dataflows on heterogeneous cloud resources to investigate Pareto-efficient solutions with respect to overall execution time, monetary cost and fairness.

  • We present a novel pruning method to select a number of representative solutions distributed along the Pareto curve by favoring points at sharper parts of the curve and leaving fewer points at flatter parts.

  • We present a prioritization scheme to rank operators that come from DAGs with different characteristics and determine the order they are scheduled so that fairness can be promoted.

  • We provide an experimental evaluation and comparison with the state of the art to show the effectiveness of the proposed approach to provide a better and more divergent skyline of solutions using realistic dataflows.

The remainder of the paper is organized as follows. Related work is presented in Section 2. Problem description follows in Section 3. The proposed approach is described in Section 4. The experimental evaluation and its results are discussed in Section 5. Finally, Section 6 concludes the paper.

Section snippets

Related work

DAG scheduling [4] on distributed environments like grids and clouds has been extensively studied, traditionally aiming at optimizing execution time or monetary cost [5], [6], [7], [8], [9], [10], [11]. Several single-objective or constrained multi-objective optimization algorithms have been proposed [5], [8], [9], while others [6], [12] formulate the multi-objective optimization problem as a weighted single-objective problem incorporating user preference parameters into the objective function

Problem description

Varying the assignments of dataflow operators may result in solutions with divergent trade-offs, among which several solutions may be better than others with respect to one or more objectives. A single optimal solution that outperforms all others may not exist, as the objectives may be conflicting. Following the principle of Pareto dominance [23], the set of non-dominated solutions comprises a Pareto front (skyline). This work focuses on the scheduling of multiple dataflows on the Cloud which

Algorithm description

In this section a multi-objective scheduling algorithm, Homogeneous to Heterogeneous Dataflow Scheduling with Fairness (HHDS-F), that iteratively builds a skyline of solutions for the execution of a set of dataflows on suitable VMs is presented. The algorithm extends HHDS in [28] to compute Pareto-efficient solutions in the three dimensional space formed by the different trade-offs between execution time, monetary cost and unfairness for multiple dataflows. HHDS-F uses a new ranking scheme to

Experimental evaluation

In this section, the skyline obtained using the proposed algorithm HHDS-F is evaluated and compared with the state of the art based on simulations for two different dataflow families.

Conclusion

The work in this paper addressed the problem of scheduling multiple dataflows on heterogeneous clouds to identify Pareto-optimal schedules in the solution space formed by the different trade-offs between overall execution time, monetary cost and fairness. The proposed algorithm extends previous work by incorporating a ranking scheme to prioritize between operators that belong to different dataflows and a pruning method to account for fairness in addition to execution time and monetary cost

References (34)

  • DurilloJ.J. et al.

    MOHEFT: a multi-objective list-based method for workflow scheduling

  • ProdanR. et al.

    Bi-criteria scheduling of scientific grid workflows

    IEEE T-ASE

    (2010)
  • LiJ. et al.

    Cost-conscious scheduling for large graph processing in the cloud

  • CanonL.-C.

    MO-Greedy: an extended beam-search approach for solving a multi-criteria scheduling problem on heterogeneous machines

  • LinX. et al.

    Selecting stars: the k most representative skyline operators

  • ValkanasG. et al.

    Skyline ranking à la IR.

  • YuZ. et al.

    A planner-guided scheduling strategy for multiple workflow applications

  • 1

    Ilia Pietri is currently working at Intracom S.A. Telecom Solutions, Greece.

    View full text