DEFT: Dynamic Fault-Tolerant Elastic scheduling for tasks with uncertain runtime in cloud
Introduction
Cloud computing has become a popular computing paradigm for on-demand provisioning of computing resources to dynamic applications [19]. Running applications on virtual resources, notably virtual machines, is an effective solution for cost-efficiency and scalability [20]. In practice, many applications in various fields, such as astronomy, financial transaction, and physics, are faced with the problem of the ever-growing data and high computing complexity. For these scientific applications, cloud can effectively meet the requirements of high-performance computing. However, as the complexity of large-scale systems increases, resource failure has become a major challenge of cloud [25]. As reported in [6], for a cloud consisting of 10,000 super reliable servers (MTBF of 30 years), there will be at least one failure per day. Moreover, every year, about 5% disks drive die and severs crash at least twice. What is worse, for low production costs, cheap commodity hardwares are often used in data centers, which leads to an increase of the resource failure probability. Therefore, it is critical to provide fault tolerance in cloud, especially for the real-time applications. For real-time applications, scheduling is most pertinent to fault tolerance.
Fault tolerance scheduling is to map tasks to computing instances and ensure tasks to be finished before the deadline even in the presence of hardware and software failures. So far, two basic scheduling schemes are widely used in handling faults in the distribution systems: replication and resubmission [18]. Replication allocates multiple backups of a task to different computing instances. Resubmission will re-execute tasks on different computing instances when faults occur. However, distincting from the traditional distributed systems, cloud has its unique features, which increases the flexibility as well as the complexity of scheduling. In general, there are three main challenges: (1) The host crash will result in the failure of multiple computing instances (e.g., VMs); (2) The actual operation time of a task is in great volatility; (3) To minimize resource consumption while ensuring task reliability, the trade-off between resubmission and replication should be considered. The main contributions of this work are summarized as follows:
- •
We propose an effective runtime estimation approach for tasks running on virtual machine; it can indicate the task runtime in the form of probability, which can reflect the volatility of task processing.
- •
We propose a fault-tolerant mechanism that strategically and dynamically opts between the traditional resubmission and replication schemes to achieve fault tolerance while at a low cost of resources.
- •
We introduce the overlapping mechanism to the proposed fault-tolerant model. On the one hand, it extends the traditional overlapping mechanism, and on the other hand, it can achieve the trade-off between reliability and resource optimization.
- •
We conduct extensive simulation experiments. Compared with the three baseline algorithms, the reliability, effectiveness, and robustness of the proposed algorithm are verified.
The rest of this paper is organized as follows. Section 2 briefly reviews the related work. Section 3 presents an overview of our target system and the design problem. In Section 4, we discuss the scheduling strategy and the optimization of resource utilization, based on which, the DEFT scheduling algorithm is developed in Section 5. The experimental evaluation is presented in Section 6. The conclusions and future work are given in Section 7.
Section snippets
Related work
Scheduling is regarded as an important method that can take advantage of cloud computing, so a large number of papers have studied the task scheduling in the cloud [12], [15], [23]. In [27], Xiao et al. proposed a cost-aware scheduling method for the big data processing. In [31], Zhang et al. proposed a deep learning model for predicting cloud workload which can improve the task scheduling efficiency. Moreover, in [30], they proposed a deep computation scheduling method for the industrial IoT
System framework
Fig. 1 shows the overview of our target system. The data center consists of a set of hosts and each host can provide a number of computing instances (namely, virtual machines). The task flows sent by the users are queued and waiting to be dispatched to the computing instances of the data center. The system scheduler consists of a task scheduler and a performance monitor. The monitor provides the system performance status. The task scheduler schedules tasks according to the feedback from the
Task scheduling and resource allocation
The key issue of fault tolerance is to assign tasks to the appropriate virtual machines while ensuring that resource allocations meet fault-tolerant constraints. Since the failure of one host will cause all VMs on it fail, the primary and the backup of one task should be allocated to the VMs on different hosts for fault tolerance when the replication method is adopted. For the resubmission method, if the initial task is aborted due to the host failure, the re-submitted one should be scheduled
Dynamic Fault-Tolerant Elastic scheduling algorithm – DEFT
In this section, based on the fault-tolerant mechanisms discussed above, we design an innovative Dynamic Fault-Tolerant Elastic scheduling algorithm-DEFT for real-time task scheduling that takes into account the volatility of system performance. DEFT uses heuristic approaches to optimize the resource utilization as long as the fault tolerance is guaranteed. It consists of three major parts: scheduling strategy selection, resource allocation for primaries and resource allocation for backups. For
Performance evaluation
In order to verify the performance of DEFT, we conduct the experiments on the CloudSim by using the Google tracelogs. We quantitatively compare DEFT with three baseline algorithms: Non Dynamic Fault Tolerance by Replication (NDRFT), Dynamic Fault Tolerance by Replication (DRFT), and Non Weak Fault Tolerance of DEFT (NWDEFT). The three baseline algorithms are briefly described as follows:
- •
NDRFT is derived from the replication method for fault tolerance, which has been widely accepted in academia.
Conclusions and future work
In this paper, we focus on the fault tolerance when the uncertainty of task runtime is considered. We propose an efficient fault-tolerant scheduling algorithm DEFT, which extends the traditional fault tolerance model. Meanwhile, we apply the overlapping mechanism to the fault tolerance model, which can effectively achieve the tradeoff between reliability and resource optimization in cloud. Through modeling the task execution time with uncertainty and comprehensive employment of replication and
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grants 61872378, 61572511, and 91648204, in part by Science Fund for Distinguished Young Scholars in Hunan Province under grant 2018JJ1032, in part by the China Postdoctoral Science Foundation under Grant 2016M602960 and 2017T100796.
References (35)
- et al.
Efficient overloading techniques for primary-backup scheduling in real-time systems
J. Parallel Distrib. Comput.
(2004) - et al.
Energy-aware resource allocation heuristics for efficient management of data centers for cloud computing
Future Gener. Comput. Syst.
(2012) - et al.
Towards energy-efficient scheduling for real-time tasks under uncertain cloud computing environment
J. Syst. Softw.
(2015) Failure-aware resource management for high-availability computing clusters with distributed virtual machines
J. Parallel Distrib. Comput.
(2010)- et al.
Fault-tolerance through scheduling of aperiodic tasks in hard real-time multiprocessor systems
IEEE Trans. Parallel Distrib. Syst.
(1997) - et al.
Qos guarantees and service differentiation for dynamic cloud applications
IEEE Trans. Netw. Serv. Manag.
(2013) - et al.
Heuristic scheduling strategies for linear-dependent and independent jobs on heterogeneous grids
International Conference on Grid and Distributed Computing, Gdc 2011, Held As
(2011) - et al.
SLA-Based admission control for a software-as-a-service provider in cloud computing environments
J. Comput. Syst. Sci.
(2012) - et al.
Cost-aware big data processing across geo-distributed datacenters
IEEE Trans. Parallel Distrib. Syst.
(2017) - et al.
A hybrid chemical reaction optimization scheme for task scheduling on heterogeneous computing systems
IEEE Trans. Parallel Distrib. Syst.
(2015)
Task scheduling algorithm with fault tolerance for cloud
International Conference on Computing Sciences
Uncertainty-aware online scheduling for real-time workflows in cloud service environment
IEEE Trans. Serv. Comput.
Designs, lessons and advice from building large distributed systems
Ladis
Fault-tolerant elastic scheduling algorithm for workflow in cloud systems
Inf. Sci.
Design and implementation of an efficient two-level scheduler for cloud computing environment
IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Heuristic offloading of concurrent tasks for computation-intensive applications in mobile cloud computing
Computer Communications Workshops
Hybrid genetic algorithms for scheduling partially ordered tasks in a multi-processor environment
International Conference on Real-Time Computing Systems and Applications
Cited by (37)
TERMS: Task management policies to achieve high performance for mixed workloads using surplus resources
2022, Journal of Parallel and Distributed ComputingCitation Excerpt :For example, [43] provided FESTAL, a resource provisioning mechanism with fault-tolerant elastic scheduling algorithms for real-time tasks. DEFT was proposed to schedule real-time tasks in the cloud where the system performance volatility should be considered, achieving both fault tolerance and resource utilization efficiency [46]. FASTER, a dynamic fault-tolerant scheduling algorithm for real-time workflows was provided to improve resource utilization and schedulability [50]. [29]
A stochastic algorithm for scheduling bag-of-tasks applications on hybrid clouds under task duration variations
2022, Journal of Systems and SoftwareReal-time and dynamic fault-tolerant scheduling for scientific workflows in clouds
2021, Information SciencesCitation Excerpt :For example, scientific workflow applications (e.g., physics, bioinformatics, astronomy, numerical weather forecast, etc.) prefer to be deployed on the clouds to decrease execution time and cost [3–5]. Although cloud computing brings great benefits for executing scientific workflows, the cloud suffers from multiple types of resource failure, such as the permanent failure of hosts (HPF) [6,7], the transient failure of hosts (HTF) [8,9], and the transient failure of VM (VMTF) [10,11]. It is reported that 0.01% of reliable hosts will be failed every day, and about 1–5% of hard disks die and 2–4% of physical servers crash each year [6].
Adaptive priority-based data placement and multi-task scheduling in geo-distributed cloud systems
2021, Knowledge-Based SystemsCitation Excerpt :This scheme reduced the average response time through task classification and improved the resource utilization rate by a flexible resource supply mechanism. Yan et al. [29] proposed a dynamic fault-tolerant elastic scheduling algorithm based on task uncertainty. This algorithm focused on the real-time nature of tasks and could achieve fault tolerance and improved resource utilization.
A cloud resource management framework for multiple online scientific workflows using cooperative reinforcement learning agents
2020, Computer NetworksCitation Excerpt :There are several researches in the literature which have tried to target one or more objectives of task scheduling from users' or service providers' perspectives, or both of them. As some examples, makespan and cost have been minimized in [15], a deadline and budget constrained task scheduling approach have been presented in [22], efficiency in cost and energy consumption was the main objective of the deadline constrained task scheduling method of [23], Security issues have been considered in the energy-efficient strategies of the task scheduling model of [24], and a dynamic fault-tolerant approach for uncertain task scheduling is proposed in [25] to enhance reliability of cloud environment. From the cloud service providers' point of views, load balancing, resource utilization, and energy efficiency are the most important objectives which should be targeted in task scheduling.