Elsevier

Information Sciences

Volume 477, March 2019, Pages 30-46
Information Sciences

DEFT: Dynamic Fault-Tolerant Elastic scheduling for tasks with uncertain runtime in cloud

https://doi.org/10.1016/j.ins.2018.10.020Get rights and content

Abstract

With the widespread use of clouds, the reliability and efficiency of cloud have been the main concerns of the service providers and users. Thus, fault tolerance has become a hotspot in both industry and academia, especially for real-time applications. To achieve fault tolerance in cloud, a great number of in-depth researches have been conducted. Nevertheless, for addressing the issue of fault tolerance, few studies have taken into account the uncertainty of task runtime, which is however more practical and really needs urgent attention. In this paper, we introduce the uncertainty to our task runtime estimation model and we propose a fault-tolerant task allocation mechanism that strategically uses two fault tolerant task scheduling models while the uncertainty is considered. Moreover, we employ the overlapping mechanism to improve the resource utilization of cloud. Based on the two mechanisms, we propose an innovative Dynamic Fault-Tolerant Elastic scheduling algorithm-DEFT for scheduling real-time tasks in the cloud where the system performance volatility should be considered. The purpose of DEFT is to achieve both fault tolerance and resource utilization efficiency. We compare DEFT with three baseline algorithms: NDRFT, DRFT, and NWDEFT. The results from our extensive experiments on the workload of the Google tracelogs show that DEFT can guarantee fault tolerance while achieving high resource utilization.

Introduction

Cloud computing has become a popular computing paradigm for on-demand provisioning of computing resources to dynamic applications [19]. Running applications on virtual resources, notably virtual machines, is an effective solution for cost-efficiency and scalability [20]. In practice, many applications in various fields, such as astronomy, financial transaction, and physics, are faced with the problem of the ever-growing data and high computing complexity. For these scientific applications, cloud can effectively meet the requirements of high-performance computing. However, as the complexity of large-scale systems increases, resource failure has become a major challenge of cloud [25]. As reported in [6], for a cloud consisting of 10,000 super reliable servers (MTBF of 30 years), there will be at least one failure per day. Moreover, every year, about 5% disks drive die and severs crash at least twice. What is worse, for low production costs, cheap commodity hardwares are often used in data centers, which leads to an increase of the resource failure probability. Therefore, it is critical to provide fault tolerance in cloud, especially for the real-time applications. For real-time applications, scheduling is most pertinent to fault tolerance.

Fault tolerance scheduling is to map tasks to computing instances and ensure tasks to be finished before the deadline even in the presence of hardware and software failures. So far, two basic scheduling schemes are widely used in handling faults in the distribution systems: replication and resubmission [18]. Replication allocates multiple backups of a task to different computing instances. Resubmission will re-execute tasks on different computing instances when faults occur. However, distincting from the traditional distributed systems, cloud has its unique features, which increases the flexibility as well as the complexity of scheduling. In general, there are three main challenges: (1) The host crash will result in the failure of multiple computing instances (e.g., VMs); (2) The actual operation time of a task is in great volatility; (3) To minimize resource consumption while ensuring task reliability, the trade-off between resubmission and replication should be considered. The main contributions of this work are summarized as follows:

  • We propose an effective runtime estimation approach for tasks running on virtual machine; it can indicate the task runtime in the form of probability, which can reflect the volatility of task processing.

  • We propose a fault-tolerant mechanism that strategically and dynamically opts between the traditional resubmission and replication schemes to achieve fault tolerance while at a low cost of resources.

  • We introduce the overlapping mechanism to the proposed fault-tolerant model. On the one hand, it extends the traditional overlapping mechanism, and on the other hand, it can achieve the trade-off between reliability and resource optimization.

  • We conduct extensive simulation experiments. Compared with the three baseline algorithms, the reliability, effectiveness, and robustness of the proposed algorithm are verified.

The rest of this paper is organized as follows. Section 2 briefly reviews the related work. Section 3 presents an overview of our target system and the design problem. In Section 4, we discuss the scheduling strategy and the optimization of resource utilization, based on which, the DEFT scheduling algorithm is developed in Section 5. The experimental evaluation is presented in Section 6. The conclusions and future work are given in Section 7.

Section snippets

Related work

Scheduling is regarded as an important method that can take advantage of cloud computing, so a large number of papers have studied the task scheduling in the cloud [12], [15], [23]. In [27], Xiao et al. proposed a cost-aware scheduling method for the big data processing. In [31], Zhang et al. proposed a deep learning model for predicting cloud workload which can improve the task scheduling efficiency. Moreover, in [30], they proposed a deep computation scheduling method for the industrial IoT

System framework

Fig. 1 shows the overview of our target system. The data center consists of a set of hosts and each host can provide a number of computing instances (namely, virtual machines). The task flows sent by the users are queued and waiting to be dispatched to the computing instances of the data center. The system scheduler consists of a task scheduler and a performance monitor. The monitor provides the system performance status. The task scheduler schedules tasks according to the feedback from the

Task scheduling and resource allocation

The key issue of fault tolerance is to assign tasks to the appropriate virtual machines while ensuring that resource allocations meet fault-tolerant constraints. Since the failure of one host will cause all VMs on it fail, the primary and the backup of one task should be allocated to the VMs on different hosts for fault tolerance when the replication method is adopted. For the resubmission method, if the initial task is aborted due to the host failure, the re-submitted one should be scheduled

Dynamic Fault-Tolerant Elastic scheduling algorithm – DEFT

In this section, based on the fault-tolerant mechanisms discussed above, we design an innovative Dynamic Fault-Tolerant Elastic scheduling algorithm-DEFT for real-time task scheduling that takes into account the volatility of system performance. DEFT uses heuristic approaches to optimize the resource utilization as long as the fault tolerance is guaranteed. It consists of three major parts: scheduling strategy selection, resource allocation for primaries and resource allocation for backups. For

Performance evaluation

In order to verify the performance of DEFT, we conduct the experiments on the CloudSim by using the Google tracelogs. We quantitatively compare DEFT with three baseline algorithms: Non Dynamic Fault Tolerance by Replication (NDRFT), Dynamic Fault Tolerance by Replication (DRFT), and Non Weak Fault Tolerance of DEFT (NWDEFT). The three baseline algorithms are briefly described as follows:

  • NDRFT is derived from the replication method for fault tolerance, which has been widely accepted in academia.

Conclusions and future work

In this paper, we focus on the fault tolerance when the uncertainty of task runtime is considered. We propose an efficient fault-tolerant scheduling algorithm DEFT, which extends the traditional fault tolerance model. Meanwhile, we apply the overlapping mechanism to the fault tolerance model, which can effectively achieve the tradeoff between reliability and resource optimization in cloud. Through modeling the task execution time with uncertainty and comprehensive employment of replication and

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grants 61872378, 61572511, and 91648204, in part by Science Fund for Distinguished Young Scholars in Hunan Province under grant 2018JJ1032, in part by the China Postdoctoral Science Foundation under Grant 2016M602960 and 2017T100796.

References (35)

  • S. Antony et al.

    Task scheduling algorithm with fault tolerance for cloud

    International Conference on Computing Sciences

    (2012)
  • H. Chen et al.

    Uncertainty-aware online scheduling for real-time workflows in cloud service environment

    IEEE Trans. Serv. Comput.

    (2018)
  • J.S. Dean

    Designs, lessons and advice from building large distributed systems

    Ladis

    (2009)
  • Y. Ding et al.

    Fault-tolerant elastic scheduling algorithm for workflow in cloud systems

    Inf. Sci.

    (2017)
  • R. Jeyarani et al.

    Design and implementation of an efficient two-level scheduler for cloud computing environment

    IEEE/ACM International Conference on Cluster, Cloud and Grid Computing

    (2010)
  • M. Jia et al.

    Heuristic offloading of concurrent tasks for computation-intensive applications in mobile cloud computing

    Computer Communications Workshops

    (2014)
  • M. Lin et al.

    Hybrid genetic algorithms for scheduling partially ordered tasks in a multi-processor environment

    International Conference on Real-Time Computing Systems and Applications

    (1999)
  • Cited by (37)

    • TERMS: Task management policies to achieve high performance for mixed workloads using surplus resources

      2022, Journal of Parallel and Distributed Computing
      Citation Excerpt :

      For example, [43] provided FESTAL, a resource provisioning mechanism with fault-tolerant elastic scheduling algorithms for real-time tasks. DEFT was proposed to schedule real-time tasks in the cloud where the system performance volatility should be considered, achieving both fault tolerance and resource utilization efficiency [46]. FASTER, a dynamic fault-tolerant scheduling algorithm for real-time workflows was provided to improve resource utilization and schedulability [50]. [29]

    • Real-time and dynamic fault-tolerant scheduling for scientific workflows in clouds

      2021, Information Sciences
      Citation Excerpt :

      For example, scientific workflow applications (e.g., physics, bioinformatics, astronomy, numerical weather forecast, etc.) prefer to be deployed on the clouds to decrease execution time and cost [3–5]. Although cloud computing brings great benefits for executing scientific workflows, the cloud suffers from multiple types of resource failure, such as the permanent failure of hosts (HPF) [6,7], the transient failure of hosts (HTF) [8,9], and the transient failure of VM (VMTF) [10,11]. It is reported that 0.01% of reliable hosts will be failed every day, and about 1–5% of hard disks die and 2–4% of physical servers crash each year [6].

    • Adaptive priority-based data placement and multi-task scheduling in geo-distributed cloud systems

      2021, Knowledge-Based Systems
      Citation Excerpt :

      This scheme reduced the average response time through task classification and improved the resource utilization rate by a flexible resource supply mechanism. Yan et al. [29] proposed a dynamic fault-tolerant elastic scheduling algorithm based on task uncertainty. This algorithm focused on the real-time nature of tasks and could achieve fault tolerance and improved resource utilization.

    • A cloud resource management framework for multiple online scientific workflows using cooperative reinforcement learning agents

      2020, Computer Networks
      Citation Excerpt :

      There are several researches in the literature which have tried to target one or more objectives of task scheduling from users' or service providers' perspectives, or both of them. As some examples, makespan and cost have been minimized in [15], a deadline and budget constrained task scheduling approach have been presented in [22], efficiency in cost and energy consumption was the main objective of the deadline constrained task scheduling method of [23], Security issues have been considered in the energy-efficient strategies of the task scheduling model of [24], and a dynamic fault-tolerant approach for uncertain task scheduling is proposed in [25] to enhance reliability of cloud environment. From the cloud service providers' point of views, load balancing, resource utilization, and energy efficiency are the most important objectives which should be targeted in task scheduling.

    View all citing articles on Scopus
    View full text