DEFT: Dynamic Fault-Tolerant Elastic scheduling for tasks with uncertain runtime in cloud

doi:10.1016/j.ins.2018.10.020

Information Sciences

Volume 477, March 2019, Pages 30-46

https://doi.org/10.1016/j.ins.2018.10.020 Get rights and content

Abstract

With the widespread use of clouds, the reliability and efficiency of cloud have been the main concerns of the service providers and users. Thus, fault tolerance has become a hotspot in both industry and academia, especially for real-time applications. To achieve fault tolerance in cloud, a great number of in-depth researches have been conducted. Nevertheless, for addressing the issue of fault tolerance, few studies have taken into account the uncertainty of task runtime, which is however more practical and really needs urgent attention. In this paper, we introduce the uncertainty to our task runtime estimation model and we propose a fault-tolerant task allocation mechanism that strategically uses two fault tolerant task scheduling models while the uncertainty is considered. Moreover, we employ the overlapping mechanism to improve the resource utilization of cloud. Based on the two mechanisms, we propose an innovative Dynamic Fault-Tolerant Elastic scheduling algorithm-DEFT for scheduling real-time tasks in the cloud where the system performance volatility should be considered. The purpose of DEFT is to achieve both fault tolerance and resource utilization efficiency. We compare DEFT with three baseline algorithms: NDRFT, DRFT, and NWDEFT. The results from our extensive experiments on the workload of the Google tracelogs show that DEFT can guarantee fault tolerance while achieving high resource utilization.

Introduction

Cloud computing has become a popular computing paradigm for on-demand provisioning of computing resources to dynamic applications [19]. Running applications on virtual resources, notably virtual machines, is an effective solution for cost-efficiency and scalability [20]. In practice, many applications in various fields, such as astronomy, financial transaction, and physics, are faced with the problem of the ever-growing data and high computing complexity. For these scientific applications, cloud can effectively meet the requirements of high-performance computing. However, as the complexity of large-scale systems increases, resource failure has become a major challenge of cloud [25]. As reported in [6], for a cloud consisting of 10,000 super reliable servers (MTBF of 30 years), there will be at least one failure per day. Moreover, every year, about 5% disks drive die and severs crash at least twice. What is worse, for low production costs, cheap commodity hardwares are often used in data centers, which leads to an increase of the resource failure probability. Therefore, it is critical to provide fault tolerance in cloud, especially for the real-time applications. For real-time applications, scheduling is most pertinent to fault tolerance.

Fault tolerance scheduling is to map tasks to computing instances and ensure tasks to be finished before the deadline even in the presence of hardware and software failures. So far, two basic scheduling schemes are widely used in handling faults in the distribution systems: replication and resubmission [18]. Replication allocates multiple backups of a task to different computing instances. Resubmission will re-execute tasks on different computing instances when faults occur. However, distincting from the traditional distributed systems, cloud has its unique features, which increases the flexibility as well as the complexity of scheduling. In general, there are three main challenges: (1) The host crash will result in the failure of multiple computing instances (e.g., VMs); (2) The actual operation time of a task is in great volatility; (3) To minimize resource consumption while ensuring task reliability, the trade-off between resubmission and replication should be considered. The main contributions of this work are summarized as follows:

•
We propose an effective runtime estimation approach for tasks running on virtual machine; it can indicate the task runtime in the form of probability, which can reflect the volatility of task processing.
•
We propose a fault-tolerant mechanism that strategically and dynamically opts between the traditional resubmission and replication schemes to achieve fault tolerance while at a low cost of resources.
•
We introduce the overlapping mechanism to the proposed fault-tolerant model. On the one hand, it extends the traditional overlapping mechanism, and on the other hand, it can achieve the trade-off between reliability and resource optimization.
•
We conduct extensive simulation experiments. Compared with the three baseline algorithms, the reliability, effectiveness, and robustness of the proposed algorithm are verified.

The rest of this paper is organized as follows. Section 2 briefly reviews the related work. Section 3 presents an overview of our target system and the design problem. In Section 4, we discuss the scheduling strategy and the optimization of resource utilization, based on which, the DEFT scheduling algorithm is developed in Section 5. The experimental evaluation is presented in Section 6. The conclusions and future work are given in Section 7.

Section snippets

Related work

Scheduling is regarded as an important method that can take advantage of cloud computing, so a large number of papers have studied the task scheduling in the cloud [12], [15], [23]. In [27], Xiao et al. proposed a cost-aware scheduling method for the big data processing. In [31], Zhang et al. proposed a deep learning model for predicting cloud workload which can improve the task scheduling efficiency. Moreover, in [30], they proposed a deep computation scheduling method for the industrial IoT

System framework

Fig. 1 shows the overview of our target system. The data center consists of a set of hosts and each host can provide a number of computing instances (namely, virtual machines). The task flows sent by the users are queued and waiting to be dispatched to the computing instances of the data center. The system scheduler consists of a task scheduler and a performance monitor. The monitor provides the system performance status. The task scheduler schedules tasks according to the feedback from the

Task scheduling and resource allocation

The key issue of fault tolerance is to assign tasks to the appropriate virtual machines while ensuring that resource allocations meet fault-tolerant constraints. Since the failure of one host will cause all VMs on it fail, the primary and the backup of one task should be allocated to the VMs on different hosts for fault tolerance when the replication method is adopted. For the resubmission method, if the initial task is aborted due to the host failure, the re-submitted one should be scheduled

Dynamic Fault-Tolerant Elastic scheduling algorithm – DEFT

In this section, based on the fault-tolerant mechanisms discussed above, we design an innovative Dynamic Fault-Tolerant Elastic scheduling algorithm-DEFT for real-time task scheduling that takes into account the volatility of system performance. DEFT uses heuristic approaches to optimize the resource utilization as long as the fault tolerance is guaranteed. It consists of three major parts: scheduling strategy selection, resource allocation for primaries and resource allocation for backups. For

Performance evaluation

In order to verify the performance of DEFT, we conduct the experiments on the CloudSim by using the Google tracelogs. We quantitatively compare DEFT with three baseline algorithms: Non Dynamic Fault Tolerance by Replication (NDRFT), Dynamic Fault Tolerance by Replication (DRFT), and Non Weak Fault Tolerance of DEFT (NWDEFT). The three baseline algorithms are briefly described as follows:

•
NDRFT is derived from the replication method for fault tolerance, which has been widely accepted in academia.

Conclusions and future work

In this paper, we focus on the fault tolerance when the uncertainty of task runtime is considered. We propose an efficient fault-tolerant scheduling algorithm DEFT, which extends the traditional fault tolerance model. Meanwhile, we apply the overlapping mechanism to the fault tolerance model, which can effectively achieve the tradeoff between reliability and resource optimization in cloud. Through modeling the task execution time with uncertainty and comprehensive employment of replication and

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grants 61872378, 61572511, and 91648204, in part by Science Fund for Distinguished Young Scholars in Hunan Province under grant 2018JJ1032, in part by the China Postdoctoral Science Foundation under Grant 2016M602960 and 2017T100796.

References (35)

R. Al-Omari et al.
Efficient overloading techniques for primary-backup scheduling in real-time systems
J. Parallel Distrib. Comput.
(2004)
A. Beloglazov et al.
Energy-aware resource allocation heuristics for efficient management of data centers for cloud computing
Future Gener. Comput. Syst.
(2012)
H. Chen et al.
Towards energy-efficient scheduling for real-time tasks under uncertain cloud computing environment
J. Syst. Softw.
(2015)
S. Fu
Failure-aware resource management for high-availability computing clusters with distributed virtual machines
J. Parallel Distrib. Comput.
(2010)
S. Ghosh et al.
Fault-tolerance through scheduling of aperiodic tasks in hard real-time multiprocessor systems
IEEE Trans. Parallel Distrib. Syst.
(1997)
J. Rao et al.
Qos guarantees and service differentiation for dynamic cloud applications
IEEE Trans. Netw. Serv. Manag.
(2013)
M.Y. Tsai et al.
Heuristic scheduling strategies for linear-dependent and independent jobs on heterogeneous grids
International Conference on Grid and Distributed Computing, Gdc 2011, Held As
(2011)
L. Wu et al.
SLA-Based admission control for a software-as-a-service provider in cloud computing environments
J. Comput. Syst. Sci.
(2012)
W. Xiao et al.
Cost-aware big data processing across geo-distributed datacenters
IEEE Trans. Parallel Distrib. Syst.
(2017)
Y. Xu et al.
A hybrid chemical reaction optimization scheme for task scheduling on heterogeneous computing systems
IEEE Trans. Parallel Distrib. Syst.
(2015)

S. Antony et al.

Task scheduling algorithm with fault tolerance for cloud

International Conference on Computing Sciences

(2012)

H. Chen et al.

Uncertainty-aware online scheduling for real-time workflows in cloud service environment

IEEE Trans. Serv. Comput.

(2018)

J.S. Dean

Designs, lessons and advice from building large distributed systems

Ladis

(2009)

Y. Ding et al.

Fault-tolerant elastic scheduling algorithm for workflow in cloud systems

Inf. Sci.

(2017)

R. Jeyarani et al.

Design and implementation of an efficient two-level scheduler for cloud computing environment

IEEE/ACM International Conference on Cluster, Cloud and Grid Computing

(2010)

M. Jia et al.

Heuristic offloading of concurrent tasks for computation-intensive applications in mobile cloud computing

Computer Communications Workshops

(2014)

M. Lin et al.

Hybrid genetic algorithms for scheduling partially ordered tasks in a multi-processor environment

International Conference on Real-Time Computing Systems and Applications

(1999)

Cited by (37)

TERMS: Task management policies to achieve high performance for mixed workloads using surplus resources
2022, Journal of Parallel and Distributed Computing
Citation Excerpt :
For example, [43] provided FESTAL, a resource provisioning mechanism with fault-tolerant elastic scheduling algorithms for real-time tasks. DEFT was proposed to schedule real-time tasks in the cloud where the system performance volatility should be considered, achieving both fault tolerance and resource utilization efficiency [46]. FASTER, a dynamic fault-tolerant scheduling algorithm for real-time workflows was provided to improve resource utilization and schedulability [50]. [29]
Resource contentions and performance interferences can lead to workload performance degradation in mixed-workload deployment clusters. Previous work guarantees the resource requirements of latency-sensitive tasks and reduces performance losses to batch jobs by reclaiming surplus resources from over-provisioned tasks. While the fragmentation of resources leads to a mismatch between provisioned resources and task requirements, resulting in high operation overheads and losses of task fairness. This paper proposes TERMS, the task management policies based on task relevance, resource distribution, and task fairness to achieve efficient and low-cost task management. TERMS mainly includes three types of management policies. The task scheduling policy can schedule new tasks according to task relevance. Task selection strategies select tasks for resource provisioning and task resumption based on resource requirements and task fairness. If necessary, the node selection strategy can be used to choose befitting target nodes based on task relevance and node resource information for task migration when eliminating straggler tasks. Evaluation results show that TERMS can further improve the performance of latency-sensitive services and batch jobs, reduce management overheads, and avoid operation failures.
A stochastic algorithm for scheduling bag-of-tasks applications on hybrid clouds under task duration variations
2022, Journal of Systems and Software
Hybrid cloud computing, which typically involves a hybrid architecture of public and private clouds, is an ideal platform for executing bag-of-tasks (BoT) applications. Most existing BoT scheduling algorithms ignore the uncertainty of task execution times in practical scenarios and schedule tasks by assuming that the task durations can be determined accurately in advance, often leading to the violation of the deadline constraint. In view of this fact, this paper devotes to maximizing the profit of the private cloud provider while guaranteeing the quality-of-service provided by the cloud platform, through designing an effective stochastic BoT scheduling algorithm based on the distribution of task duration variations. With the varying task execution times modeled as random variables, we formulate a stochastic scheduling framework that incorporates a probabilistic constraint upon the makespans of BoT applications. The resultant stochastic optimization model is capable of characterizing the complete distribution information of makespan variations and satisfying the deadline constraint in a probabilistic sense. We further design an immune algorithm-based metaheuristic to solve this stochastic optimization problem. Simulations results justify that our proposed algorithm outperforms several competing algorithms in maximizing the cloud provider’s profit while satisfying the user-specified deadline constraint under the impact of uncertain task durations.
An efficient interval many-objective evolutionary algorithm for cloud task scheduling problem under uncertainty
2022, Information Sciences
Task scheduling is an important research direction in cloud computing. The current research on task scheduling considers mainly the design of scheduling strategies and algorithms and rarely gives attention to the influences of uncertain factors, such as the network bandwidth and millions of instructions per second (MIPS), on the scheduling process. The network bandwidth and MIPS directly affect the performance of a virtual machine (VM), which further influences the scheduling performance. In this paper, uncertain factors are transformed into interval parameters. The make-span, scheduling cost, load balance, and task completion rate are simultaneously considered in the scheduling process. Then, an interval many-objective cloud task scheduling optimization (I-MCTSO) model is designed to simulate real cloud computing task scheduling. To implement this model, an interval many-objective evolutionary algorithm (InMaOEA) is proposed. An interval credibility strategy is employed to improve the convergence performance. The hyper-volume and degree of overlap based on the interval congestion distance strategy are used to increase the population diversity. Simulation results demonstrate the effectiveness and superior performance of InMaOEA in comparision with other algorithms. The proposed approaches can provide decision-makers with an efficient allocation plan for cloud task scheduling.
Real-time and dynamic fault-tolerant scheduling for scientific workflows in clouds
2021, Information Sciences
Citation Excerpt :
For example, scientific workflow applications (e.g., physics, bioinformatics, astronomy, numerical weather forecast, etc.) prefer to be deployed on the clouds to decrease execution time and cost [3–5]. Although cloud computing brings great benefits for executing scientific workflows, the cloud suffers from multiple types of resource failure, such as the permanent failure of hosts (HPF) [6,7], the transient failure of hosts (HTF) [8,9], and the transient failure of VM (VMTF) [10,11]. It is reported that 0.01% of reliable hosts will be failed every day, and about 1–5% of hard disks die and 2–4% of physical servers crash each year [6].
Cloud computing has become a popular technology for executing scientific workflows. However, with a large number of hosts and virtual machines (VMs) being deployed, the cloud resource failures, such as the permanent failure of hosts (HPF), the transient failure of hosts (HTF), and the transient failure of VMs (VMTF), bring the service reliability problem. Therefore, fault tolerance for time-consuming scientific workflows is highly essential in the cloud. However, existing fault-tolerant (FT) approaches consider only one or two above failure types and easily neglect the others, especially for the HTF. This paper proposes a Real-time and dynamic Fault-tolerant Scheduling (ReadyFS) algorithm for scientific workflow execution in a cloud, which guarantees deadline constraints and improves resource utilization even in the presence of any resource failure. Specifically, we first introduce two FT mechanisms, i.e., the replication with delay execution (RDE) and the checkpointing with delay execution (CDE), to cope with HPF and VMTF, simultaneously. Additionally, the rescheduling (ReSC) is devised to tackle the HTF that affects the resource availability of the entire cloud datacenter. Then, the resource adjustment (RA) strategy, including the resource scaling-up (RS-Up) and the resource scaling-down (RS-Down), is used to adjust resource demands and improve resource utilization dynamically. Finally, the ReadyFS algorithm is presented to schedule real-time scientific workflows by combining all the above FT mechanisms with RA strategy. We conduct the performance evaluation with real-world scientific workflows and compare ReadyFS with five vertical comparison algorithms and three horizontal comparison algorithms. Simulation results confirm that ReadyFS is indeed able to guarantee the fault tolerance of scientific workflow execution and improve cloud resource utilization.
Adaptive priority-based data placement and multi-task scheduling in geo-distributed cloud systems
2021, Knowledge-Based Systems
Citation Excerpt :
This scheme reduced the average response time through task classification and improved the resource utilization rate by a flexible resource supply mechanism. Yan et al. [29] proposed a dynamic fault-tolerant elastic scheduling algorithm based on task uncertainty. This algorithm focused on the real-time nature of tasks and could achieve fault tolerance and improved resource utilization.
With the rapid development and the widespread use of cloud computing in various applications, the number of users distributed in different regions has grown exponentially. Therefore, the Geo-distributed cloud systems have become a research hotspot and big data processing technology has also emerged. Nowadays, the most widely used big data processing framework is Spark. However, massive amounts of data are generated every moment, and the processing procedure becomes more and more complex, the execution efficiency of Spark has been greatly affected. In the Spark frame of geo-distributed cloud systems, aiming at the data placement problem, the data placement strategy based on RDD dynamic weight is introduced. The target node is selected with a strong computation capacity to place the data. Aiming at the problems of multi-task scheduling, a task will be scheduled to a node whose computation capacity can satisfy the requirement of this task. And then considering job classification and computing node performance, the optimized task scheduling strategy is in traduced. Experiments show that our algorithms can effectively adjust the weight of node data placement according to the actual task execution information, and shorten the task execution time.
A cloud resource management framework for multiple online scientific workflows using cooperative reinforcement learning agents
2020, Computer Networks
Citation Excerpt :
There are several researches in the literature which have tried to target one or more objectives of task scheduling from users' or service providers' perspectives, or both of them. As some examples, makespan and cost have been minimized in [15], a deadline and budget constrained task scheduling approach have been presented in [22], efficiency in cost and energy consumption was the main objective of the deadline constrained task scheduling method of [23], Security issues have been considered in the energy-efficient strategies of the task scheduling model of [24], and a dynamic fault-tolerant approach for uncertain task scheduling is proposed in [25] to enhance reliability of cloud environment. From the cloud service providers' point of views, load balancing, resource utilization, and energy efficiency are the most important objectives which should be targeted in task scheduling.
Cloud is a common distributed environment to share strong and available resources to increase the efficiency of complex and heavy calculations. In return for the cost paid by cloud users, a variety of services have been provided for them, the quality of which has been guaranteed and the reliability of their corresponding resources have been supplied by cloud service providers. Due to the heterogeneity of resources and their several shared applications, efficient scheduling can increase the productivity of cloud resources. This will reduce users’ costs and energy consumption, considering the quality of service provided for them. Cloud resource management can be conducted to obtain several objectives. Reducing user costs, reducing energy consumption, load balancing of resources, enhancing utilization of resources, and improving availability and security are some of the key objectives in this area. Several methods have been proposed for cloud resource management, most of which are focused on one or more aspects of these objectives of cloud computing. This paper introduces a new framework consisting of multiple cooperative agents, in which, all phases of the task scheduling and resource provisioning is considered and the quality of service provided to the user is controlled. The proposed integrated model provides all task scheduling and resource provisioning processes, and its various parts serve the management of user applications and more efficient use of cloud resources. This framework works well on dependent simultaneous tasks, which have a complicated process of scheduling because of the dependence of its sub-tasks. The results of the experiments show the better performance of the proposed model in comparison with other cloud resource management methods.

View all citing articles on Scopus

View full text

DEFT: Dynamic Fault-Tolerant Elastic scheduling for tasks with uncertain runtime in cloud

Abstract

Introduction

Section snippets

Related work

System framework

Task scheduling and resource allocation

Dynamic Fault-Tolerant Elastic scheduling algorithm – DEFT

Performance evaluation

Conclusions and future work

Acknowledgments

J. Parallel Distrib. Comput.

Future Gener. Comput. Syst.

J. Syst. Softw.

J. Parallel Distrib. Comput.

IEEE Trans. Parallel Distrib. Syst.

IEEE Trans. Netw. Serv. Manag.

J. Comput. Syst. Sci.

IEEE Trans. Parallel Distrib. Syst.

IEEE Trans. Parallel Distrib. Syst.

Task scheduling algorithm with fault tolerance for cloud

International Conference on Computing Sciences

Uncertainty-aware online scheduling for real-time workflows in cloud service environment

IEEE Trans. Serv. Comput.

Designs, lessons and advice from building large distributed systems

Ladis

Fault-tolerant elastic scheduling algorithm for workflow in cloud systems

Inf. Sci.

Design and implementation of an efficient two-level scheduler for cloud computing environment

IEEE/ACM International Conference on Cluster, Cloud and Grid Computing

Heuristic offloading of concurrent tasks for computation-intensive applications in mobile cloud computing

Computer Communications Workshops

Hybrid genetic algorithms for scheduling partially ordered tasks in a multi-processor environment

International Conference on Real-Time Computing Systems and Applications