In-depth analysis of fault tolerant approaches integrated with load balancing and task scheduling

Mushtaq, Sheikh Umar; Sheikh, Sophiya; Idrees, Sheikh Mohammad; Malla, Parvaz Ahmad

doi:10.1007/s12083-024-01798-5

In-depth analysis of fault tolerant approaches integrated with load balancing and task scheduling

Open access
Published: 17 October 2024

Volume 17, pages 4303–4337, (2024)
Cite this article

Download PDF

You have full access to this open access article

Peer-to-Peer Networking and Applications Aims and scope Submit manuscript

In-depth analysis of fault tolerant approaches integrated with load balancing and task scheduling

Download PDF

Sheikh Umar Mushtaq¹,
Sophiya Sheikh¹,
Sheikh Mohammad Idrees² &
…
Parvaz Ahmad Malla¹

1023 Accesses
1 Citation
Explore all metrics

Abstract

One of the most important and frequently reported issues in cloud computing is fault tolerance. Implementing Fault Tolerance (FT) in cloud computing is challenging due to the diverse architecture and the complex interrelationships of system resources. The primary objective of this article is to critically review and analyze the fault-tolerant models with two other related aspects, i.e., load balancing and scheduling which is the peak need of the time and was not adequately addressed in the recent related surveys. In this paper, we present the systematic and comparative analysis of these hybrid models highlighting their limitations in different parameters, cases, and scenarios. Our analysis reveals that Proactive, Reactive, and Resilient approaches are commonly utilized to address system failure in the cloud. Also, it was found that a thorough study of intelligent fault tolerance approaches, also known as resilient fault tolerance, was overseen to determine their efficacy over conventional approaches. Additionally, the survey includes the discussion part which presents a unique in-depth analysis of hybrid fault tolerant approaches with respect to the handling of different faults and parameters. To illustrate the reviewed observations, a detailed statistical analysis has been conducted and presented graphically to provide insights into the study and simultaneously highlight further research in this area. Our analysis includes the critical role of these hybrid fault-tolerant models in accomplishing high accessibility and reliability in emerging computing systems thereby providing valuable insights for future researchers of the field. We have also provided a broad roadmap that charts strategies for facing the discussed cloud challenges. The study provides valuable contributions to the field.

A Systematic Overview of Fault Tolerance in Cloud Computing

Fault Tolerance in Cloud Computing- An Algorithmic Approach

Thorough Understanding of Existing Fault-Tolerant Techniques for Task Scheduling in Cloud Computing

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Over the last 10 years, the use of Cloud has grown substantially. More facilities are incorporated into the cloud environment and are allowed to be accessed by everyone globally. Likewise, Cloud Computing companies such as IBM, Yahoo, Amazon, and Google are providing global access to services to customers [1]. Moreover, these are metered services which we commonly term subscriptions, and are frequently applied in the Software as a Service (SaaS) delivery simulation [2].

The cloud environment consists of two components i.e., the frontend, and the backend. The front end is the main interface on the consumer side and is accessed through different networks over the internet [3]. The Backend side particularly deals with the CSP (Cloud Service Provider) and provides services by utilizing data center resources. In these data centers, different physical machines known as servers are being stored. Multiple virtual copies of these physical machines can be created using the virtualization process. Virtualization deals with and handles multiple upcoming requests for a particular application/service across the globe. The different shareable resources can be Applications, Software, Hardware, etc.

In cloud architecture, there are mainly three services [4], Infrastructure (IaaS), Software (SaaS), and Platform as a Service (PaaS) [5, 6]. There may be chances of faults in all these three layers in a similar way while providing user services. Therefore, the detection and removal of faults is necessary for obtaining the best possible reliability as presented in [7, 8]. Moreover, the deficiencies in the infrastructure of the cloud yield a direct impact on resource reliability and availability [4]. These deficiencies need to be critically analyzed and treated to boost reliability and robustness. DNN (Deep Neural Network), a powerful deep learning tool exhibits is a promising solution for this [9]. Fault Tolerance is a significant technique that can notice, locate, and recover from faults and failures in the cloud environment. It makes the cloud more robust and enhances the efficiency of the environment [10]. Mainly, fault tolerance falls into two sub-areas i.e., Hardware Fault Tolerance and Software Fault Tolerance [11].

On the other hand, scheduling tasks appropriately is vital in delivering critical and essential services of the cloud. The ineffective scheduling of tasks increases the task execution time and waiting time. Besides, insignificant load balancing results in the under and over-utilizing of resources where the under-utilization of resources can lead to the wastage of resources, and over-utilization of resources can degrade the performance of cloud systems. Henceforth, proficient load distribution is essential to boost the performance of cloud-based applications.

There is a fundamental need to incorporate load balancing and scheduling in efficient fault-handling mechanisms due to architectural challenges in the cloud system. Therefore, this paper conducts a hybrid review employing fault tolerance with scheduling, load balancing, and analysis of QoS (Quality of Service) parameters optimization. This comprehensive review primarily centers on three core classifications of fault tolerance techniques, namely Reactive, Proactive, and Resilient Approaches. The Reactive Procedures are the conventional techniques of fault tolerance that include replication, detection, checkpointing/restarting, and recovery. In the Proactive Methods, the system is prevented from reaching a defective state that includes monitoring, prediction, and pre-emption. The actions are taken to minimize the defects, and thereby the failure condition is avoided. The Resilient Methods have shown a recent take-off in the literature and indicate a potential trajectory for the future of fault tolerance in cloud environments. This is because these methods are grounded on artificial cleverness and ML [10]. Besides, simulation toolkits play an analytical role in evaluating settings of cloud computing. These toolkits allow us to simulate and evaluate the cloud set-ups cost-efficiently without the requirement for massive infrastructure. Some of the most effective and powerful simulators have been discussed in [12]. Comparative analysis has been performed in [13] among various simulators concerning to various parameters to determine the features and functions of each toolkit.

1.1 Research methodology and data analysis

This section focuses on the setting of the methods that are used to perform the qualitative opinion of the literature in the review and the sources of considered state-of-the-art works. It also includes the incorporated methodology for the proposed research. In the end, we specified our significant contributions to this review.

The selection and elimination of the published articles were determined based on some aspects. The related articles were selected after analyzing the abstract, and afterward, a critical review/analysis was performed. The selection of the papers was achieved based on the standard in the database and the article itself. Furthermore, the inclusion was done based on the following conditions.

a. Searching strategy
A systematic survey of fault tolerance with efficient scheduling and load distribution techniques proposed in the literature was conducted through well-known sources.

Several search keywords include Cloud Resources, Fault-tolerance, Task Scheduling, Load Balancing, QoS Parameters, Resource Optimization, failure in a cloud, essential cloud services, cloud architecture, scheduling techniques, etc., used in this study.
b. Duration and validity of study
- This review research mostly incorporates articles from 2009 to 2023 from well-believed journals, books, and conferences.
- The statistics of the considered year for publications from 2009 to 2023 are depicted in Fig. 1.
- Very few studies are included from 2007 and 2008.
- The selected duration is chosen to capture a comprehensive range of data such as technical progressions, economic sequences, and policy variations, and confirm data availability pertinent to our study that replicates the evolution, progression, and trends applicable to our study objectives.
Fig. 1
Percentage of the included papers (2009 to 2023)
Full size image
c. Language and selection/inclusion criteria
- The decision for the language criterion was specified as English. Because English is considered as the primary language for scholarly and intellectual publications particularly in the fields of computer science and distributed computing. Regulating criteria to English articles ensured that we selected high-quality and broadly recognized studies, smoothing a thorough and appropriate review.
- The primary priority was given to hybrid fault tolerance approaches including either scheduling or load balancing.
- Hybrid fault tolerance approaches optimize some other QoS parameters as well. Figure 2 presents the detailed inclusion and exclusion of the studies.
Fig. 2
Showing the Methodology of Inclusion and exclusion criteria of the studies
Full size image
d. Data processing and analysis
- The data was initially organized into Excel and prepared for analysis.
- Data categorization was made based on different QoS parameters, the environment used for simulation, types of faults considered, and other thematic considerations. This categorization helps us to analyze the literature more clearly and precisely.
- The qualitative information was obtained by considering diverse Quality of Service (QoS) metrics, types of faults addressed, and the range of simulation environments utilized across a timeframe.
- Furthermore, the analysis also highlights the various fault tolerance methods employed in the existing literature.
e. Synthesis of the analysiss
- For meaningful conclusions and insights, the data was observed based on the objectives of the study.
- The patterns and relationships among the various studies were discussed for comparison and assessment.
f. Quality assessment and validation procedure
The presented Methodology Adapted for this study can be summarized in four stages:
- Originally, the related articles were searched through the related keywords.
- Some articles were selected based on title, standards, and optimization parameters.
- Selected articles were gone through the abstract, and further inclusion and exclusion were performed.
- Finally, inclusive articles were extensively reviewed, analyzed, and incorporated into this survey.

1.2 Motivation

Faults can lead to malfunctions that worsen a system's overall performance. Failures result in the breakdown/shutdown of a system, but occasionally, flaws cause performance to decline rather than the entire shutdown of the system. Various fault tolerance solutions can be employed to address different types of defects, such as network, physical, and process problems. However, it is crucial to achieve without comprehending the existence of the issue inside the architecture and the damage that the system flaw produced. Cloud is made up of comprises levels, each of which takes services from the layers below it. The failure at any layer has the potential to contaminate the layer right above it. Since faults at any one layer may affect the services that any of the layers provide. Thus, for high-performance computers, the appropriate fault tolerance system is needed to effectively handle these faults. The faults should be managed critically and dynamically to make the cloud environment more efficient and intelligent. Besides, in the cloud, efficient task scheduling leads to the maximum utilization of virtual machines, reducing operational costs, thereby revealing enhancements in the QoS parameters and eventually improving overall performance. Also, load balancing techniques need to be addressed comprehensively in different environments like static, dynamic, and nature-inspired cloud environments. Moreover, it is essential to thoroughly examine load-balancing techniques across various settings, including static, dynamic, and nature-inspired environments.

Various methods have been suggested in academic literature to address this concern and multifarious reviews are available in the literature for future researchers. While studying the existing surveys, it was observed that the surveys are not thorough enough, wide-ranging, and sufficient in certain ways. Although the authors in [14] have presented a comprehensive survey about fault tolerance, this survey does not focus on other aspects of the cloud like efficient load balancing and scheduling. In [15], an immense survey focused on scheduling but lacked emphasis on fault handling and load distribution. Besides, [16] presented a vast survey focusing on load balancing across cloud resources but lacking in fault handling and cloud optimization. Similarly, [17] also provides a survey emphasizing fault tolerance frameworks, however, fails to significantly enhance the performance of the cloud environment. In [18], only considering fault-tolerant approaches does not give prominence to major cloud aspects such as scheduling and load balancing. Similarly, the most recent survey presented in [19] focused on both scheduling and fault tolerance but no ways for optimal load distribution. Additionally, the observations presented in [20] were limited to a few aspects concerning fault-handling techniques, and only crash and byzantine fault models were considered. Also, there is no consideration of QoS parameters. Similarly, the recent survey was presented in [21] but was found limited to reliability. In other words, these reviews were not significantly focused on the discussed issues of the cloud related to fault tolerance with scheduling/load balancing simultaneously. After this comprehensive analysis, it was observed that none of the mentioned surveys offer extensive consideration of the above-mentioned scenarios of cloud computing. The QoS and other important aspects related to the clouds' fault tolerance concerns are focused on by the researchers in the existing surveys but are very limited. This renders the current review inadequate for analyzing the current art in cloud systems. Hence, there is a dire need to present a survey focusing on reliability-related aspects of the cloud. Therefore, we got motivated and moved to present this systematic and hybrid review. In this survey, we try to discover and explore the site of hybrid fault tolerance models that will focus not only on traditional fault tolerance techniques but also integrate some other important cloud aspects like scheduling/load balancing. This integration helps us to highlight the likely applications, challenges, and incipient trends.

The hybrid models in fault tolerance with load balancing and scheduling extend to several advantages over single scheme approaches.

Illustrative example:
Consider the scenario, where the CSP hosts several services and applications for its clients, utilizing solely fault tolerance mechanisms (single-model schemes). In often cases, fault tolerance frequently results in redistributing the workload from faulty virtual machines (VMs) to the unaffected VMs. This redistribution often upsets the load equilibrium between VMs, which leads to an unequal workload distribution and a deterioration in overall service performance. However, if CSP implements the hybrid model which integrates multiple reliability measures would enhance reliability and provide robust services to the clients. In our example, if CSPs employ the hybrid model that performs load balancing after fault tolerant measures. This will help CSPs to simultaneously minimize the risks of non-uniform load distributions and other overheads associated with fault tolerance and progress the QoS.
Besides, to make this emerging domain more observable for future researchers, there is a need to analyze the up-to-date methods concerning these factors [10]. This review is also inspired by peer surveys of the existing literature along with their limitations. Moreover, it represented the analysis of some important aspects of the existing literature such as QoS, static/dynamic, environmental setup used, fault tolerance approaches, and fault models, and presented the results in the graphical visualization form. The analysis provided offers a comprehensive perspective on the existing research efforts that have been the focal point of existing studies. The overall comparison of the top-cited surveys with the proposed survey is also illustrated in the subsequent sections.

1.3 Our contribution and features of the study

The primary contributions of this survey include:

This article presents an in-depth examination of the cloud environment. The main faults and fault taxonomy in cloud systems are also discussed in detail.
Various researchers have already addressed fault tolerance and load balancing mechanisms, however, much of their work has focused on the employment of either fault tolerance or load balancing separately. The presented survey incorporates a review of fault tolerance with two other related aspects, i.e., load balancing and scheduling which is the peak need of the time and was found missing in the current surveys.
Moreover, Tables 1 and 2 present a comparative analysis of our contribution with the recent and current top-cited studies respectively.
The survey has been presented in two categories i.e., Fault tolerance with Scheduling and Fault tolerance with Load balancing.
The generalized problem formulation of fault tolerance has also been presented to understand the workings of fault tolerance using the replication technique.
We further outlined the difficulties associated with ensuring fault tolerance integrated with scheduling and load balancing in cloud computing systems and comprised a thorough examination of common problems faced. It will assist future researchers to promptly recognize or understand the problems related to the study.
The study also presents feasible graphical observations about the literature such as parameters optimized, faults model addressed, the environmental tool used, etc. These detailed observations are presented separately for both categories and were not found in the existing surveys to the best of our knowledge. A dedicated discussion and observation section is designed for that purpose.
This hybrid review aids in investigating the potential challenges of hybrid fault-tolerant models and provides a detailed roadmap for future research directions. The aim is to enhance migration methods, thereby mitigating failures among nodes.
Moreover, the overall study provides a platform for future researchers to analyze the current state of the art regarding considered issues and find the appropriate future research problems.
In the end, there is a dedicated section highlighting the future research directions of the problem.

Table 1 Comparative analysis related to the contribution of the top-cited study and the proposed study

Full size table

Table 2 Enlightenment of reactive fault-tolerant techniques

Full size table

1.4 Organization of the paper

The following specified structure is adhered to for this research review article:

The overall structure of this research review article is as specified. Section 2.2.3 presents the detailed introduction of the study having its subsection from 1.1 to 1.7. These subsections encompass the Research Methodology and Data Analysis, Motivation with an illustrative example elucidating how hybrid frameworks can benefit CSPs, and the Authors' Contribution. Moreover, Section 2.2.3 also focuses on the significance of fault tolerance in the cloud, encompassing a taxonomy of faults, errors, and failures, along with delineating the challenges associated with fault tolerance in dedicated subsections. It also delineates the specifics of scheduling, load balancing, and fault tolerance in the pursuit of reliable cloud services. Additionally, it formulates the problem associated with fault tolerance in this context. The detailed survey literature with the comparative analysis is elaborated on in Section 2. Section 3 depicts the discussions and observations from the existing reviewed literature while presenting the overall analysis of fault tolerance with both scheduling and load balancing in the dedicated sections. The forthcoming directions with open issues and future works in the related research are highlighted in Section 4. Additionally, the methodical roadmap for open challenges is also included in Section 4. Finally, Section 5 concludes the whole study. The organization of the presented study has been presented in Fig. 3.

1.5 Fault tolerance in cloud computing

Faults in any resource may affect the task execution time and QoS parameters of the cloud, which will eventually reduce the deed of the system. The efficient fault tolerance policy helps to identify and overcome errors in the cloud architecture, and thereby the performance metrics are boosted. The fault tolerance capability should be considered with other techniques like scheduling and load balancing for the effective performance of the system. Moreover, the load balancing and scheduling approaches should do their respective standardizes along with fault tolerance. In case of a crash or connection error, the system should be capable enough to provide an alternative VM to handle these failures for smooth and uninterrupted task execution. Because these crashes in any nodes will affect the efficiency of the entire system. Therefore, handling faults enhances the utility of the system to accomplish the tasks precisely and accurately resolving the occurrence of internal defects [38]. An inclusion of fault tolerance with other reliability-related techniques like scheduling and load balancing will make the cloud environment more efficient, specifically for the real-time and dynamic processing of tasks [39]. Hence, fault tolerance is a major aspect that ensures robustness, reliability, and other performance metrics in the cloud environment [40, 41].

1.5.1 Fault, error, and failure taxonomies

The fault is the condition of the system when it loses the ability to function for an expected output due to an unexpected condition or defect in any of the internal or external components. The main faults within the cloud environment are enumerated as follows: [42]:

The Network Faults: These defects arise due to network interruption in any connection, nodes, cluster, etc., [43, 44].
The Physical Faults: When any of the hardware resources like CPU, memory, storage, etc., fails, these types of faults will occur. The power failure also gives rise to these types of faults [42].
The Process Faults: These are the common faults in a cloud environment that occur because of the unavailability of any resource, software, etc., [43].
The Service Expiry Fault: This type of fault arises if the service clock of the resource run out when the application is in use [43].
The Media Fault: Any crash in the media of the cloud will lead to these types of faults [39].
The Processor Faults: This type of fault mainly occurs because of malfunctioning in the operating system [45].
The Restrictions Faults: This type of fault occurs when any fault arises and is unnoticed or ignored by the controlling or any other responsible agent [17].
The Parametric Faults: If the optimizing parameters are ambiguous or do not differ and remain unexplained, this type of fault occurs [17].
The Time Restriction Faults: These faults occur when the particular application is not completed by the predefined deadline [17].
The fault tolerance mechanism makes the cloud environment efficient by providing necessary services even in case of failure of one or multiple components [46, 47]. If there is any kind of fault in the system, it leads to error, and error, in turn, culminates in failure.
Fault: The abnormal state of any coordination when assigned tasks cannot be performed. Usually, the fundamental cause of this state is the presence of some bugs in single/multiple components of the system [26, 29, 30, 48,49,50]. Faults are categorized into various groups, as depicted in Fig. 4.
Error: A system experiencing faults may transition into an error state. Compromised performance due to errors can subsequently result in incomplete or complete failure of the system. Errors have been classified into the following categories, as shown in Fig. 5.
Failure: The presence of an error can take the system to the failure state and it has a absolute effect on the user. Moreover, the failure is recognized by the user by seeing the incorrect output of the system [25, 26, 30]. The failures have been classified into the following categories, as exhibited in Fig. 6.

1.6 General fault-tolerance challenges in cloud computing

Ensuring a fault-tolerant cloud environment involves evaluating numerous challenges. Some of these challenges are discussed below:

Task and failure heterogeneity: The cloud utilizes different hardware and operating systems simultaneously and considers the underlying heterogeneous frameworks [51]. Resultantly, in handling the heterogenous type of faults, and eventually increasing the complexity to overcome them.
Automation: The extensive use of VMs in the cloud environment is increasing exponentially and managing these platforms in real time is more difficult. Therefore, there is a good need to automate fault tolerance strategies for complex networks [15].
Cloud halts: The main plan of fault tolerance is to provide uninterrupted service altogether in case of any service interruption or malfunction of any host server or network system. The Service Level Agreements [26] for all companies should be prepared accordingly.
Retrieval Points and Recovery Time Objectives targeting: This Point is established to preserve the set of track records that may be at risk of loss in the event of a server error [14]. On the other hand, Recovery Time is the time required by the procedure to get back on track or running after the failure [52]. The main aim is to decrease RPO (Retrieval Point Objectives) and RTO (Recovery Time Objectives) at the minimum possible rate [10].
Cloud Workload: Cloud workloads are the specific applications-related tasks/services or specific amounts of work executed on a cloud resource. The workloads could be of two types, i.e., Enabled, and Native loads. The Native workloads are also labeled as “born on the web” and are entirely cloud-developed applications. On the other hand, an enabled workload pertains to the computational tasks generated by cloud applications. Moreover, the Proactive and resilient approaches seem relevant [53] to fill the fault tolerance conditions of both Active and Native concepts [10].

1.7 Measures for effective cloud reliability- a need for the hybrid framework

The claim for the cloud computing standard has enlarged intensely in the past few years as it allows the dynamic fetching and renouncing of computing resources that too in a device-independent and cost-effective manner with slight effort or communication from the service provider. Despite lots of enhancements in the cloud, it is still prone to many system failures which results in growing apprehension regarding the reliability of cloud public services. Reliability is the way of measuring the efficiency of the system and its value can be adjusted accordingly after performing computation where the default reliability is 100% [54]. The conditions of reliability must be met for stable and efficient processing of the cloud. It is also one of the critical Quality of Service constraints. Moreover, optimized QoS parameters play an important role in effective and adequate resource allocation and have been extensively inspected in Cloud Computing standards. These parameters are used to consider the efficiency of various Scheduling, Load Balancing, or Fault Tolerance techniques in the cloud.

1.7.1 Cloud scheduling approach

Cloud scheduling is performed by mapping the incoming task to the most suitable available VM. The objective of ascertaining the sequence in which events or tasks should be executed in the cloud and simultaneously analyzing the required QoS parameters is termed Scheduling. Cloud Scheduling mainly includes the following:

Prediction of future incoming workloads and Normalizing the QoS parameters.

Selection of the most optimal VM and executing the particular task via, Heuristic/Meta-Heuristic algorithms.

Generally, the VM/task scheduling is done in two ways:

On-Demand Scheduling: This scheduling considers the dynamic cloud workloads on demand and VMs are provided quickly by cloud service providers as required. However, it may lead to the problem of workload dispersal. In other words, multiple tasks may be processed by a single VM at a time (Over-provisioning Problem) resulting in degrading the performance of the system.

Long-Term Reservation: This scheduling reserves the resources for the long term. However, providing many VMs can lead to Under-provisioning problems in some situations.

These Under and Provisioning problems may cause the wastage of VMs and task execution time, and thereby the overall cost of services may increase. Hence, a well-organized and effective provisioning technique is essential that examines and schedules the cloud workloads efficiently. Figure 7 explains the process of VM Provisioning and Scheduling (VPS) [55].

The main aims of VM provisioning are:

Fulfill the User’s demand without SLA violation.
Prior prediction of user requirements based on incoming workload size.

In cloud provisioning, the SLA is settled between the end users and Cloud Service Providers after fully analyzing the incoming workloads. Before scheduling (mapping) the incoming workload (applications/tasks) to the particular VM/resources, the running VMs are monitored regularly for load estimation [56]. If the VM is found overutilized, then that particular VM is disabled temporarily for any future assignments and these VMs are not allocated immediately after mapping. Afterward, the task executing capability of the VM is also tested before any further allocation. This study also contains a review of various research papers focusing on the principles of load-balancing and scheduling. In the cloud, efficient scheduling of jobs is the main factor ensuring high-performance applications. However, in the cloud, scheduling not only has to pact with the dynamism and the widespread nature of the cloud, but it should also consider the optimization of other important parameters. The matching of tasks to the corresponding machines and scheduling the organization of execution of these tasks refers to mapping. Efficient mapping minimizes the total execution time of the meta-task. The meta-task is identified as a collected work of independent tasks having no inter-task dependencies. The mapping of such meta-tasks is being achieved statically (i.e., offline or in an analytical manner). The general problem of optimally mapping tasks to machines is NP-complete [57]. Task scheduling [58] is the fundamental step of VM management in the cloud. Task scheduling can be of two types: Static and Dynamic Scheduling.

1.7.2 Load balancing approaches

Load balancing is among the chief requirements of a cloud environment. Load balancing usually shifts the load from the highly loaded VM to the minimum loaded VM to ensure the uniform dispersal of load among VMs. It aimed to share the workload among computational resources to maintain load equilibrium and allow each resource to function within its designated efficiency threshold. The uneven distribution of load among VMs affects the improvement of response time, interaction overhead, output, and resource utilization of the system [31]. Furthermore, it improves VM availability and maintains reliability. Besides, the load can be balanced by implanting resource redundancy that fulfills scalability. Numerous strategies have been intended by researchers to attain the finest load balancing. Some of the advantages that inspire the implementation of load balancing in the Cloud are as follows:

Efficient VM Utilization in a Cloud Environment: In the cloud, VMs may be inadequately loaded further the general performance of the system will be affected. Moreover, the selected competitive VM can be highly utilized while the other VM may remain idle throughout the process and the underutilized VM may wait for a task. This scenario results in higher processing time and maximizes waiting time. To overcome such inconsistencies, VM utilization needs to be efficient by optimally balancing the load among resources.
Adequate Load Distribution: Ample Load distribution is necessary to attain the best possible performing of the system. It leads to utilizing the maximum computing capability of a particular VM and parallel task execution. Likewise, it ensures an adequate load allocated to every single VM according to its capacity in all conditions. It is necessary to dispense workload among all VMs uniformly according to their processing capacities to diminish the task execution time to the meanest possible value.
Minimization of Response Time: Inappropriate load distribution leads to several disparities resulting in higher response time which eventually results in an inconsistent state of the system. Thus, it is crucial to realize optimal load balancing to minimize the response time and achieve enhanced system throughput.

Besides, In the Cloud, VM can work independently or collectively as per the requirement and nature of the task. Each VM is capable of processing workload as per its processing capabilities. The prime target of load balancing is to achieve a balanced distribution of workloads among the available VMs. Typically, load-balancing algorithms comprise two elementary policies, i.e., the transfer policy and the location policy [59]. The transfer policy identifies whether the VM is overloaded or not. The dynamic system aspects are also addressed by this policy. The transfer policy also elects the necessity to introduce the load migration for the system. This policy determines when a node is ready to function as a transmitter based on workload evidence i.e., transfers a task to another VM. It further determines when a node acts as a recipient and retrieves a task from another VM.

However, the location policy decides on a suitable under-loaded or over-loaded VM. It locates corresponding VMs and allows them to send or receive workload between them to expand the total performance of a system. Later policies are further categorized as receiver-initiated, sender-initiated, or symmetrically initiated. The location policy chooses an alternative VM for the task migration. If the VM is identified as a qualified receiver, it further searches for a qualified sender VM and vice versa. Upon a virtual machine's eligibility as a transmitter or receiver, a selection policy will be implemented to determine which job in the queue should be moved first [31]. Based on the information and implementation used by these two policies, load balancing mechanisms are also classified as mentioned below [60]:

Static Load Balancing
Dynamic Load Balancing
Adaptive Load Balancing
Periodic Load Balancing
Non-Periodic Load Balancing
Advance Load Balancing

Generally speaking, load-balancing algorithms can also be categorized as hierarchical, decentralized, or centralized depending on where migration decisions are prepared [61, 62].

1.7.3 Fault-tolerant approaches

Cloud is a dynamic system that supports several dispersed resources (VMs) that are heterogeneous and complete millions of user tasks. Nevertheless, this VM has the flexibility to join or exit the system at any given time. Thus, achieving fault tolerance is a critical issue in such dynamic systems [63]. Additionally, the execution of a fault-tolerant system also leads to the optimizations of various QoS parameters and cloud characteristics. Therefore, significant benefits can be attained. It also assures task execution on time, in case of any unexpected scenarios like failure, resource disconnection from the system, task migration, any other unanticipated user operation, etc. Moreover, while numerous previous studies have tackled fault tolerance and task allocation, only a limited number have examined issues at the processor level. In recent literature, a handful of works have delved into extensive research on scheduling and load balancing while incorporating fault tolerance [17]. The concept of abstraction has been split into different layers, i.e., Infrastructure as a Service, Platform as a Service, and Software as a Service layer. There is a necessity to implement appropriate fault tolerance techniques for fault diagnosis to determine several faults in these service levels. This research article includes various fault diagnosis methods corresponding to these service layers, along with fault categories. The defects in any layer can have an impact on its top layer because of the layer interrelationships [17].

Moreover, to reach higher levels of strength in cloud computing, the failures need to be accessed and handled effectively [26, 29, 48, 64]. Extensive work has been proposed in the literature to make the cloud fault-proof. Some approaches proposed in the literature can be labeled as mentioned in Fig. 8.

Reactive fault tolerance

Once a defect has occurred, reactive fault tolerance is applied. Using this approach, we can decrease the impact of the fault in the cloud and thereby increase the system's robustness and reliability [46, 48]. The focus is on the device recovering in case of failure inside the system [10]. Furthermore, data replication and data transfer are used for restoration [65]. These approaches address Byzantine Faults, Crash faults, Hardware faults, and Host failure. Different fault-tolerant techniques that utilize a reactive approach are planned in Table 2.

Proactive fault-tolerance

This strategy provides pre-planned alternative solutions for the process of handling faults; therefore, fault prediction is proactive. Moreover, the faulty component is substituted with an alternative component runtime to avoid recovery from errors and faults [4, 46, 47, 66]. This approach provides the effectiveness of cost with maximum efficiency and reliability of the system [27] and addresses Software and Parametric faults. Some of the proposed proactive fault-tolerant techniques in the literature are listed in Table 3.

Table 3 Enlightenment of proactive fault-tolerant techniques

Full size table

Resilient fault-tolerance

These techniques have some similarities with the Proactive approach. The defects are forecasted, and the effects are prevented or moderated by applying some methodologies. The forecasting utilizes some intelligent learning, which makes Resilient techniques different from Proactive ones. These approaches are adopted for general faults. In this strategy, the system is continuously monitored for faults, which makes it adaptive fault tolerance [10]. Some of the proposed Resilient fault-tolerant techniques in the literature are presented in Table 4.

Table 4 Enlightenment of resilient fault-tolerant techniques

Full size table

In general, the reactive strategy does not require enforcement of any qualification mechanism in the system till the fault occurs. Efforts are being made to moderate the injurious effects after the detection of faults in the system [74]. In a Proactive strategy, the system is in continuous tracking to analyze the faults and eliminate them before they appear. The device state is continuously screened to predict the fault occurrence in advance. In Resilient strategies, the system operates even in the presence of faults, and the faults will be removed in the given timeframe. The related pros and cons of these tactics are presented in Tables 5 and 6, respectively.

Table 5 Pros of fault-tolerant strategies

Full size table

Table 6 Cons of fault-tolerant strategies

Full size table

1.7.4 General problem formulation for fault tolerance using replication

General problem formulation for fault tolerance using replication

Problem Statement: Problem formulation that focuses on the importance of fault tolerance in the circumstances of clouds.

Problem Scope: The fault tolerance in the cloud is addressed for continuous service delivery even in the event of failures or breakdowns.

Objectives: The main goal is to reduce fault-related service interruptions and downtime to maximize cloud service availability. Additionally, increasing resource utilization, loss of data, and maintaining SLA thresholds are also included in the formulation.

Problem Constraints: To guarantee that the efficiency effect of services is provided as needed. The fault tolerance techniques should add as little overhead as possible. Moreover, the solution should apply to the related computational resources.

Parameters: The parameters manipulated during fault tolerance are MTTF (Mean Time to Failure), MTBF (Mean Time Between Failure), MTTR (Mean Time To Reappear), etc. However, the parameters that are optimized are average resource utilization, makespan, recovery rate, failure rate, success rate, etc. There can be some decision parameters in fault tolerance such as selection of alternative resources, fault detection algorithm, recovery mechanism, etc.

Problem Formulation: For fault tolerance in real-time systems, two important sets can be considered i.e., tasks set (T), and VM set (V). T: {t₁, t₂…t_n}, indicating that n real-time tasks at any instance in the Cloud environment. For each actual-time task {t_i | t_i ∈ T}, t_I has some set of attributes associated with it such as arrival time, dimensions, expected execution time, anticipated finish time, anticipated harvest time, deadline-limit, etc. Deadline and harvest time can be related to each other as follows:

$$Exp\; HT=D-Min\; PT$$

V: { v₁, v₂…v_m}, indicating that m number of accessible VMs in the Cloud environment.

For each accessible VM {v_i | v_i ∈ V}, v_I has some set of attributes associated with it such as vm_id, capacity, cluster, etc.

Fault tolerance can be achieved by using any of the fault-tolerant approaches. Here we are utilizing the replication Fault tolerant technique. Here, the scheduler should possess the capability to generate the required amount of replicas separately for every real-time task.

For each {t_i | t_i ∈ T}

Enable the scheduler to generate replicas

Allocated VM to each replica,

Calculate the expected finish time f_i,j,k for a given replica by the following equation:

$${\text{F}}_{\text{i},\text{j},\text{k}}=\text{A}\left({\text{t}}_{\text{i}}\right)+\text{w}\left({\text{r}}_{\text{i}}\right)+\text{e}\left({\text{r}}_{\text{i},\text{j},\text{k}}\right)$$

Where, i,j, and k represent the key of the original real-time task, the key of the current replica, and the key of the allotted VM, respectively. A is the arrival time for the real-time task, w is the waiting time of the replica, and e is the expected execution time of the replica over the allotted VM.

Further, e(r_i,j,k) is computed by the following equation:

$$\text{e}\left({\text{r}}_{\text{i},\text{j},\text{k}}\right)=\frac{task\;dimensions}{computional\;power\;of\;alloted\;VM}$$

After e(r_i,j,k) expires, the following condition is evaluated for every real-time task.

If ∀ replica(t_i) = failed

Mark t_i “failed”

Else Mark t_i “Succeeded”

Additionally, a reservation mechanism can also be used to achieve Fault tolerance where we reserve the VM in advance which will be allocated in case of fault.

Estimation Metrics: It comprises the estimation of some optimization parameters like recovery time, reached reliability, and effectiveness of resource use for both fault and regular operating conditions.

2 Related literature

The advancement in cloud computing technology has reformed the approach computing assets are provisioned, utilized, and managed. Cloud computing offers a vast array of services that are flexible, scalable, and cost-effective. To improve the utilization of cloud resources, various dynamic resource allocation algorithms have been intended in the works. However, ensuring fault-tolerant scheduling and load balancing is a critical challenge that needs to be addressed to provide uninterrupted services in the cloud. Virtual machine reservation is one of the promising approaches that can mitigate these challenges by allocating reserved resources for fault tolerance and load balancing.

2.1 Scheduling with fault-tolerance

Efficient scheduling in the cloud provides optimization of various Quality of Service parameters, especially task completion time. Besides, scalability, availability, security, and fault tolerance are the key features of cloud services. Instead of the complete breakdown of the system, the faults in the cloud lead to performance degradation only. Without fault-tolerant scheduling when one or more components of the system fail, the task execution, waiting time, response time, etc. may increase. This leads to enhanced throughput as well. However, Fault tolerance provides an alternative way for the process completion even if some of the resources may not work properly [46, 64]. Few works of literature have proposed fault-tolerant scheduling algorithms with optimized parameters. Recently, in [75], the Dynamic Clustering Cuckoo Whale Optimization Algorithm (DCCWOA) has been suggested for supporting effective fault-tolerant scheduling in the cloud. The algorithm was tested for varying the tasks between 100 to 1000 with 8 virtual machines. The problem of fault tolerance was also investigated in [76], and a greedy-based best fit decreasing (GBFD) algorithm was proposed for increasing the success rate of task execution along with optimization of other parameters. The model was valued with numerous loads of PUMA datasets. Additionally, the computational complexity was claimed to be O(nm) where n is the VM number in the data center, and m represents computing nodes. In [77], authors proposed GWO (Grey Wolf Optimization)—based Task Scheduling evaluated on the 1000MI task dataset. Fault handling is carried out in the proposed work with efficient task scheduling by employing the task resubmission technique. Extending the chain of work and solving the problems of dependability relationships, learning automata was used and a self-adapting scheduling strategy namely, ADATSA was proposed in [78]. The model was experimentally evaluated on 53 servers with 3 Master nodes and 50 slaves. The complexity was proposed to be O(NK) + O(MS) where N represents cluster nodes, K represents resource category, M is average tasks on a node, and S is average state transitions. In [79], a Fault-Tolerant Hybrid Resource Allocation Model (FTHRM) was recommended which confirms fault tolerance and minimized Turn-around-Time (TAT). The proposed model employs a prior reservation process to distribute resources to the respective tasks, ensuring the guaranteed execution of tasks. Resource reservation is also enabled for time slots with resource organization as needed by the task set with adjusting VM heterogeneity. In case of resource failure, alternative resources are being supplied where the most preferred resource has had the least former workload and the smallest execution time. The authors in [80] presented the framework for adaptive scheduling and fragmentation of tasks namely (WSADF) Workflow-scheduling applying -adaptable and dynamic-fragmentation which initially creates the fragments concerned with the number of VMs in the fragmentation phase and later the scheduling phase pick out the VMs concerned to reduce the usage of bandwidth. WSADF was evaluated on the workload ranging from 25 to 1000 and VMs ranging from 5 to 25. While making the task scheduling adaptable to both heterogeneity and homogeneous environments, CPSO and FIPS were proposed in [81]. The proposed task scheduling was evaluated on 30 servers under 1000 iterations. In this chain to integrate localized edge clouds with publicly accessible clouds and enhance scheduling effectiveness and scalability, a hierarchy-based edge cloud concept was introduced in [82]. Additionally, FTDS, a failure rescue technique is suggested to address the fears that arise while mobile apps are being executed. For evaluation, the workflow was taken from 10 to 70 applications while taking the length of the workflow from 10 to 60. Besides, some of the SLA (Service level agreement) parameters like, CPU necessity, system bandwidth, and memory need to be considered with appropriate scheduling. In this regard, the pre-emption-based algorithm was proposed in [83] which pre-empts the resources from the low-priority task to the high-priority task in case of unavailability of the resources and provides reservation of resources reflecting numerous SLA parameters for facility deployment. The evaluations were carried out via 4 cloud simulations by performing 10 consecutive runs and 60 requests having 10 to 15 subtasks. The cost and deadline of the tasks are considered for defining the priority of the tasks. Moreover, it provides a dynamic resource provisioning and effective fault tolerance process. In this chain, a fault-tolerance aware task scheduling scheme was proposed in [84] namely Checkpointed League Championship Algorithm (CPLCA). This algorithm provides fault tolerance using the checkpointing strategy along with task migration and was evaluated by using workload in the form of Standard Workload Format accessible via the San Diego Supercomputer Center (SDSC). Efficient scheduling and fault handling mutually may ensure task execution and thereby fulfill the real-time environment of the cloud. However, heterogeneous systems and their complexities are increasing dramatically leading to failures. These failures can be eliminated by implementing efficient scheduling approaches. Therefore, the task scheduling problem on heterogeneous systems was addressed in [85]. Being an NP-hard problem, a heuristic algorithm Deadline Based Scheduling Algorithm (DBSA) was proposed to resolve it. The DBSA approach dynamically estimates the figure of permanent tolerating failures by calculating the makespan first till the system tolerates a fixed number of failures. Afterward continuously comparing the makespan with the specified deadline to get the successive number of tolerating failures. The model was evaluated in the workload ranges from 20 to 100 with 4 and 8 VMs. Gaussian Elimination, Fast Fourier Transformation, and Molecular Dynamics Code are used as a kind of application graphs for testing. Finally, the task is mapped to the appropriate processor without violating precedence constraints. Further, in [86] Cost-effective, NNCA_PSO was proposed by modifying Particle Swarm Optimisation (PSO). During evaluations, the workload was varied from 70 to 560 and VMs were used from 4 to 8. Further, the Advance Reservation Fault Tolerance Model (ARFTM) was proposed in [87] which maps the tasks using MCT and tolerates faults using the advance reservation technique. ARFTM was evaluated by varying the workload from 1 to 300.

However, in [88], the fact that “the network bandwidth is limited” and the scheduling policies should decrease the bandwidth usage in cloud computing was considered. Moreover, the author proposes a data locality-based task scheduling approach, i.e., the Balance Reduce Algorithm (BAR). It will reduce network access and thereby reduce bandwidth usage and job completion time while not specifying the type and nature of workload used for evaluation. Furthermore, an improved Balance Reduce Algorithm was proposed with an improvement in machine failure handling. Later in [41], fault tolerance-based scheduling was proposed namely the Dynamic Clustering League Championship algorithm (DCLCA) to reduce the premature failure of the tasks. The model was evaluated in two scenarios where a parallel workload archive containing 73,496 tasks in the form of Standard Workload Format accessible via the San Diego Supercomputer Center (SDSC) was used in the first scenario. In the second scenario, workloads were produced as of the CloudSim’s Workload PlanetLab. All the surveyed methods are brief in Table 7.

Table 7 Comparative analysis of recent scheduling-based fault tolerance algorithms

Full size table

Scheduling and fault tolerance frameworks

Various scheduling and fault tolerance frameworks are recommended in the literature. In this section, these frameworks are surveyed and presented. Comparative analyses of different scheduling and fault tolerance frameworks are presented in Table 8.

Table 8 Comparative analysis of various Fault tolerance and scheduling frameworks

Full size table

2.1.1 Proactive-based scheduling and fault tolerance framework

In this approach, the system can handle any disruptions or interruptions. The state of the system is monitored continuously for breakdowns and failure. Some of the proactive-based scheduling and fault tolerance frameworks found in the literature are mentioned below:

SHelp [91]: This approach was proposed by improving the existing framework namely, ASSURE [100] which was implemented at the rescue points. ASSURE searches the rescue point by traversing the rescue-trace graph while in SHelp each rescue point is assigned some zero-initialized weight. For a particular rescue point, the weight increases proportionally as the rescue point is applied. Whenever a fault occurs, the searching of these rescue points is performed in the decreasing order of their weight.
PFHC [92]: This is a proactive framework of fault tolerance proposed for HPC (High-Performance Computing) applications in the cloud. This framework works on three chief modules: Node Monitoring Module is prepared with some special Lm-sensors [101, 102] to perform periodic monitoring for several parameters such as fan speed, CPU temperature, etc., for wellness. The fault Tolerance Module comes into action in case of the occurrence of faults. It is responsible for providing the information to the resource provider for an alternative resource and migrating the requests to the new resource. The Controller Module is installed at every node. The fault-tolerant policy is implemented by this module. The module is also responsible for the actual-time migration of VMs.
WSRC [93]: This framework contains a module namely, a failure detector that checks the Virtual Machine Manager (VMM) periodically for any kind of variations such as delay in response time or mismanagement of memory. If any fluctuation is found, the VM running status is saved and VMM is repaired using the rejuvenation technique. Rejuvenation generally leads to high overheads however, WSRC uses variable time rejuvenation to control overheads.
SRFSC [94]: The software rejuvenation technique was used in this framework. This framework primarily works in three phases: In the first phase, the packet that has the information about CPU and VM’s memory usage is received by Aging Failure Detection. The another step is the evaluation of VM for failing grades. This step is known as Aging Degree Evaluation which itself is done in two main aspects: CPU/Memory usage and Packet arrival, i.e., whether the packet arrived before/after expectations or the packet has been lost. In the third and final step, the decision of Rejuvenation is taken. Here, the VMs are migrated to any other native VM and the original VM is rebooted.
FTDG [95]: FTDG is a fault-tolerant framework where the pre-emptive relocation is being achieved. The architecture of this framework mainly comprises of four functioning spaces. User Space is used by the user to submit their data flows. Graph Space transforms the submitted user data into Direct Acyclic Graphs (DAG). Moreover, the DAG is analyzed for the critical and non-critical paths. In Storm Space, Scheduling and fault tolerance mechanisms are applied. Hardware Space contains various data center resources.

2.1.2 Reactive-based scheduling and fault tolerance frameworks

In such frameworks, the faults are handled once they occur. Unlike proactive approaches, monitoring of system behavior is not required in such frameworks. Some of the Reactive-based scheduling and fault tolerance frameworks found in the literature are mentioned below.

AFTRC [96]: In (Adaptive Fault Tolerance in Real-time Cloud Computing), the received tasks are held in some input buffer and the task execution will be accomplished on a First Come First Serve basis. This model also consists of the other modules. The Acceptance Test (AT) is the module that checks the results of each embedded algorithm for accuracy and verifies the results. The Time Checker (TC) checks whether the result is obtained within the deadline or not. If not, then the specific task is sent backside to the input buffer. The Reliability Assessor (RA) adjusts the reliabilities of VMs based on obtained results. The decision Mechanism (DM) takes the highest reliable node and selects the output from that.
BFTCloud [97]: This framework uses replication techniques and completes the user requests timely even in the presence of faults. The amount of replicas/nodes is utilized by employing the failure probability of all nodes. The failure likelihood of the replica group should constantly be less than the top-level failure likelihood. The functioning of BFTCloud framework mainly works in five phases: Primary Selection: In this phase, the basal node is designated based on the rating by adding the priority weight and QoS value assigned to each node. The highest rating value node will be chosen as the primary node. Replica Selection: In this phase, the number of replicas is selected by observing the QoS of every node from the viewpoint of both the primary node and the cloud module. The new QoS is calculated, and again rating will be done. Request Execution: This phase allows the nodes to complete the request and react to the cloud module accordingly. The cloud, in turn, checks the consistency of the obtained results based on different cases [17]. If the results are consistent, then the primary replica is assigned to the next request. Primary Updating: In case of a fault in the primary replica, this phase informs all other replicas to select the alternative. Replica Updating: This phase removes the faulty replica and adds the new nodes to decrease the failure probability.
FESTAL [98]: It is a fault-tolerant scheduling framework where the primary backup technique is realized to handle the faults. In this framework, the user tasks are queue up in some input buffer and assigned to the schedular having three controllers, i.e., Resource Controller, Backup Copy Controller, and Real-time Controller.

The Backup Copy Controller takes the backup. Afterward, the Resource Controller explores the two VMs, that can complete the task within the deadline. Based on the search results, two decisions can be made.

In case the two corresponding VMs are found, both task instances are scheduled on the respective VMs.
In case no such VM is found, the task is rejected.

In this framework, "If the anticipated end time is less than or identical to the task time-limit, a passive backup is utilized; otherwise, an active backup is employed.

2.1.3 Resilient-based scheduling and fault tolerance frameworks

These techniques have some similarities to the proactive approach. Moreover, the defects are forecasted, and the effects are prevented/moderated by applying some methodologies. The forecasting uses some intelligent learning that makes resilient techniques different from proactive ones. Compared to conventional fault tolerance techniques, resilient fault tolerance provides increased durability and adaptability in the event of system breakdowns. Some of the advantages of resilient fault tolerance over traditional fault tolerance are:

Dynamic environment:
Resilient systems can bounce back from errors without sacrificing functionality because they can dynamically adjust according to shifting circumstances. They are made to respond quickly to changing threats and difficulties. However, conventional fault tolerance techniques could find it difficult to adjust to sudden or rapid shifts in the environment. They might not react to new kinds of errors as well since they frequently rely on predetermined rules.
Recovery:
Often, automated recovery mechanisms found in resilient systems are capable of promptly detecting and fixing errors without the need for human interaction. This reduces the effect on Coordinated functions and decreases downtime. On the other hand, to recover from errors, traditional approaches might need more manual intervention as compared to Resilient ones. This could result in longer time frames for recovery and a higher chance of service interruption.
Real-time track reporting:
Sophisticated analytics and tracking techniques offer practical observations on the system’s performance and are frequently integrated into resilient systems. Furthermore, active defect identification and prevention are also possible by these techniques. Unlikely, conventional approaches might be less successful in locating and addressing errors as they depend on frequent checks or event-generated reactions.
Optimization:
Resilient systems maximize the use of the available resources during fault recovery, guaranteeing that resources are distributed effectively to sustain critical operations. Besides, traditional techniques could use expensive strategies, which could result in more inefficiency and lower effectiveness of the system all around.
Flexibility and adaptability:
Improved adaptability and flexibility are frequently displayed by resilient designs, enabling them according to changing demands and adjust resources upward or downward in response to consumption. However, traditional approaches could find it difficult to adjust dynamically or regulate shifting demands, which could result in inefficiencies during times of high consumption.

Put it all up, systems can recover swiftly and efficiently in dynamic contexts due to resilient fault tolerance, which provides a more systematic and flexible approach to addressing failures. When compared to conventional fault tolerance techniques, this strategy frequently results in increased overall performance, decreased interruptions, and enhanced system efficiency. In this context, EFTT (Efficient Fault Tolerance Technique) is a type of resilient-based approach. In [99], the author used Machine Learning to handle faults and generate solutions for FT. ML was, nevertheless, applied as a sub-constituent of the general FT solution. Some solutions have intensively employed ML to forecast faults using a set of specified variables. Many applications have been working with ML while handling hardware faults. Here, artificial intelligence, or machine learning, is used to create a system that can operate autonomously like a human without the need for human concern. Machine learning procedures can be used to increase a system's reliability even in the case of fault tolerance. Such fault tolerance techniques are known as Resilient Fault-Tolerant Techniques as discussed in Section 5.3. Machine learning techniques are typically used in proactive approaches, predicting failures before they happen by using historical system data. The Resilient techniques for fault tolerance are the emerging ones because the ML accesses data and even can learn from data. One of the similar learning methods namely, reinforcement learning was applied in [103] that studies the fitness of VMs in cloud environments. By using such types of learning, every VM participates in the learning process independently. As recommended in [104], fault tolerance in a distributed or parallel learning system is achieved by constantly tracking the input parameters in the server. Here, the entire system returns to the most recent checkpoint following an error. Checkpoints are not performed at every stage by such systems, even in the presence of a high number of calls and activity in the network. Forecasting defects are well-known in fault identification and handling, as stated in [105]. Quick error detection can prevent more serious system failures. Numerous processes make up this operation, and some of the most recent research investigations include model-based approaches that are quantitative, model-based approaches that are qualitative, and history-based. Apart from reinforcement learning, unsupervised learning is an additional technique for pattern recognition in the data without predefined output [106]. Such techniques do not allow for the estimation of the outcome since unsupervised learning lacks an output target. Instead, algorithms have chosen to depend on their expertise to pull out as much detail as they can from the data. The deep learning techniques were proposed in [107] as a rapid way to identify multicriteria errors in complicated industrialized analysis. Fault tolerance can benefit from the application of such AI-related techniques.

Fault Induction: In this Resilient technique, failures are managed by making assumptions based on the reaction of the system. Moving forward in this technique, [108] proposed that a hybrid energy system be practically used to apply the multi-source power administration technique. The analysis shows how to improve fault tolerance, scalability, efficiency, and dependability. The concepts proposed in [99] are being used by some of the most well-known firms in the world, including Google and Amazon, to increase their fault tolerance. Here the authors have employed the software namely gameday. GameDay is software intended to highlight significant shortcomings in methods for finding flaws and dependencies between different components of the system. In a GameDay scenario, team members from every level of the business must collaborate to find a solution. In the repeatable tests if everything went perfectly, then the GameDay activity will be considered successful. Similarly, the authors in [109], employed game theory and declared that the kind of smart grid operator will swiftly supply electricity through a dispersed system. Additionally, several classifiers have been compared for the metrics like accuracy and fault predictions in [110].

2.2 Load balancing with fault tolerance

Load balancing with fault tolerance is a significant dispute in cloud computing. The efficient load balancing techniques require the inclusion of fault tolerance capacity as well. It enables the system to distribute the load to all the available nodes uniformly and simultaneously deals with detecting and removing the faults to maximize the performance and efficiency of the cloud environment. Various algorithms are surveyed and presented in Table 9. The authors have introduced Honeybee Inspired-Load Balancing (HBI-LB) in [111], a reliable and nature-inspired Fault Tolerant load-balancing approach. The assigned tasks in the suggested method were in the range of 100 to 500 in number and 2000 to 10000 in length. Further 10 and 15 fog centers and fog nodes were utilized respectively. The information of scheduling the other in-progress tasks about the status and load on the resources is given by other assigned tasks in the same way as the honeybees inform buddies about their position. Besides, in [112], the Proactive and Reactive Fault Tolerance Framework (PFTF) was proposed with ECB (Elastic Cloud Balancer). It avoids the situation in the cloud where some nodes are idle or minimum loaded, and some are overloaded. The proposed ECB enhances the scheduling quality in combination with the Job Shop Scheduling by considering and optimizing QoS parameters. The model was evaluated by taking the tasks in the range of 9 to 13 and task size in the range of 1000 MB to 8000 MB. Additionally, due to the dynamic nature of cloud infrastructure, real-time features such as availability and reliability need to be achieved. In this chain, Proactive Load Balancing Fault Tolerance (PLBFT) was proposed in [113] as an efficient fault-tolerant load-balancing model evaluated on the private cloud platform. This model relies on migrating the faulty VM to another destination host directly. Besides, the load in the destination VM is managed (in case of overload in the destination VM) before migrating the defective VM there. Furthermore, this approach shows high reliability as compared to other similar techniques. Load balancing and fault tolerance techniques are designed to provide highly reliable and available services. For further growth in the availability of cloud services, a combination of load-balancing and fault-tolerant techniques has been proposed [114]. The proposed model is highly reliable in case of task failure while taking the task number between 13 to 18, task execution time between 1 to 9, and task priority between 1 to 3 with four VMs. Moreover, in [115], Deadline Pre-emptive Scheduling (DBPS) was proposed based on cloud partitioning where the fault tolerance has been achieved by Throttled Load Balancing for Cloud (TLBC). The model was tested on a workload of 10 to 300 while not specifying the number of VMs. However, a Machine learning-based approach was proposed in [99], namely, Fault-tolerance Load Balancing (FTLB), which embeds fault tolerance in load balancing with the optimization of other QoS parameters. The evaluation was performed using 100 computing cycles on three VMs. Furthermore, an Integrated Virtualized Failover strategy (IVFS) like AFTRC was proposed in [116]. It employs replication and checkpoint-restart in which Cloud Load Balancer (CLB) was added to AFTRC, and checkpointing was carried out by implementing the Reward Renewal Process (RRP) [117]. Once the load was received, it was transferred to CLB by the Cloud Controller (CC). The main job of CLB was to replicate the load on some suitable VM based on load information in case of failure.

Table 9 Comparative analysis of different proposed fault tolerance and load balancing algorithms

Full size table

The comparative analysis of different fault tolerance-based load-balanced algorithms is presented in Table 10. These algorithms were proposed to distribute the workload regardless of faults across the nodes, i.e., having the capacity to handle the faults.

Table 10 Comparative analysis of fault-tolerant-based load-balancing algorithms

Full size table

3 Discussions and observations

The presented survey summarizes the focus of researchers on distinct hybrid fault tolerance-related frameworks. The main emergent and developing methods of fault tolerance in a cloud environment are categorized into three different categories: Reactive Methods, Proactive Methods, and Resilient Methods. The survey was conducted on two main hybrid fault-tolerant categories, i.e., scheduling with fault tolerance and load balancing with fault tolerance. On surveying, several observations were gathered and listed below.

3.1 Statistics of hybrid survey of scheduling and fault tolerance algorithms

While dealing with hybrid frameworks of scheduling and fault tolerance, researchers have focused on all three fault tolerance approaches, i.e., Reactive, Proactive, and Resilient. However, it is observed that more emphasis on Proactive and less on Resilient ones. The related statistics of these approaches are depicted in Fig. 9.

Moreover, different techniques such as Replication, Migration, and Rejuvenation have also been employed while dealing with this hybrid framework. Replication techniques are mainly used for reactive approaches. On the other hand, Migration and Rejuvenation techniques are utilized for proactive approaches. It is also observed from the literature that replication and migration techniques were more frequently used to address the faults in the cloud. Moreover, self-healing and checkpoint restart techniques are used by the SHelp framework. The statistics of different approaches employed for Reactive, Proactive, and Resilient strategies in this hybrid framework are depicted in Fig. 10.

It is also noticed from the presented survey that different types of faults have been handled by using hybrid fault-tolerant scheduling and load-balancing frameworks. Moreover, it was observed that software faults, hardware faults, parametric faults, and crashes were resolved using a proactive approach. The reactive approach addressed configuration faults, parametric faults, byzantine faults, participant faults, and host failures. Likewise, resilient approaches are utilized to manage general faults. Additionally, the overall statistics of different faults handled by considered hybrid frameworks are depicted in Fig. 11.

The statistics of the fault models focused on the surveyed articles show that researchers are more motivated towards software faults but the transient, intermittent, and permanent faults are found to be less in the eyes of the researchers. For several strong reasons, addressing these kinds of faults is essential in distributed systems/applications. First, proactive steps to guarantee system resilience are required due to the unpredictable nature of transient faults, which are brief interruptions in system performance. To reduce downtime and provide consistent user experience, organizations must recognize and address transient issues. Second, a major threat to system reliability is intermittent failures, which are defined by irregular disruptions that might happen at any time. To avoid flowing failures and guarantee the stability of necessary executions to preserve the system's overall integrity, intermittent faults must be effectively managed. Furthermore, we cannot exaggerate the seriousness of permanent faults. These enduring problems may cause the system to deteriorate over time, impacting system operation and SLAs. Therefore, resolving permanent faults is essential for maintaining the system's lifespan and functionality while ignoring them might cause irrevocable harm and compromise the global sustainability of the system. Finally, the maintenance of system continuity, robustness, and reliability is the primary reason for managing the discussed hardware failures. In the end, proactive fault management techniques contribute to uninterrupted system/application performance during unexpected obstacles by protecting the integrity of crucial operations and improving SLAs and thereby user experience and satisfaction.

3.2 Statistics of hybrid survey of load balancing and fault tolerance algorithms

It is also perceived in this survey that researchers have focused on the optimization of various parameters simultaneously along with fault tolerance. The response time was considered and optimized more frequently as compared to other QoS parameters. And least consideration is on task waiting time and the computational cost. Based on this survey, the statistics of various optimized parameters are presented in Fig. 12. Besides, the considered frameworks include both dynamic and static environments, and the researchers are more motivated toward dynamism than static algorithms. Figure 13. depicts the statistics of the surveyed models that support dynamism.

The analysis was carried out for the parameter optimization of the reliable cloud. Figure 14 presents the degree of optimization in metrics of scheduling with fault tolerance, scheduling with load balancing, fault tolerance, load balancing, and scheduling. Additionally, parameter optimization analysis of various fault-tolerant approaches from the literature was also conducted and presented in Fig. 15.

Finally, the observations regarding the platform or environment used for simulation in the presented surveys are statistically presented in Fig. 16.

4 Forthcoming research directions and open issues

It can be examined from the reviewed state-of-art that some important QoS parameters, except Response Time, are not being focused on. Other parameters, such as makespan, turnaround time, waiting time, flowtime, resource utilization, and accuracy, also need to be considered. Furthermore, various other faults, like byzantine and system crashes, etc., are also not examined much in hybrid fault tolerance algorithms. Therefore, it is necessary to enhance the performance of these hybrid fault tolerance algorithms by contemplating these limitations in forthcoming research. Moreover, researchers should focus on some of the below-mentioned aspects to overcome the limitations of existing techniques [138, 139].

Focus more on resilient fault tolerance.
Focus on the computational cost along with fault tolerance.
Identify and predict the faults accurately.
Resolve faults with load balancing and scheduling.
Fault handling with optimization of other QoS parameters.

4.1 Future works

After careful consideration and assessment, it is concluded that several research fields might be followed to raise the performance of cloud computing and boost the optimization of QoS parameters of cloud systems. They are listed below:

1.
The researchers can make the scheduling efficient for better makespan and average resource utilization.
2.
The assessed state-of-the-art shows that, except for response time, certain crucial QoS criteria are not being prioritized. It is also necessary to take into account additional factors including turnaround time, waiting time, flow time, resource utilization, and accuracy.
3.
To improve task execution time and scheduling, a large body of research is focused on discovering resource and workload identification criteria. For workloads to be adaptive, scalable, and optimal, under and overusing resources should be avoided.
4.
A sender-initiated load balancing mechanism that assists in uniform load distribution among dispersed nodes is necessary for task relocation.
5.
Reservation can be used for fault tolerance as suggested in [87] for ensuring complete execution of tasks where the resources are reserved well in advance and may be used in case of faults.
6.
It is essential to concentrate on penalty limiting while taking into account system failures to attain QoS optimisation-based allocation.
7.
Only a few scheduling methods include the availability feature, and it's highly dependent on VM failure and changes in the impact rate of users, therefore, to decrease VM failure, it is important to take this parameter into further consideration in later algorithms.
8.
The penalties on account of faults can be minimized by accompanying the models with efficient load-balancing techniques.
9.
It is clear from examining several methods that a task scheduling algorithm by itself cannot address all the issues. Most algorithms base their work scheduling on a few factors. One method, for instance, only considers the response time and execution time parameters and overlooks other QoS principles like the execution cost, dependability, utilization, etc. Therefore, by including more standards, an improved scheduling algorithm that can produce better results may be developed.
10.
Future studies should consider the factors of scalability, elasticity, and other fault overheads which are the properties of the system to fit in a situation.

4.2 Methodical roadmap for open challenges

A structured strategy or roadmap presented in Fig. 17 that incorporates prioritization based on influence and feasibility is needed to address the scheduling and load balancing with fault tolerance challenges.

5 Conclusion

In this study, diverse models for analyzing the faults, and rectifying these faults by implementing fault-tolerance integrated with scheduling and load-balancing strategies in cloud environments are comprehensively surveyed. The main emergent and developing methods regarding fault tolerance in the cloud environment are categorized into Proactive, Reactive, and Resilient. In resilient approaches, the revolutionary technologies AI/ML are considered and are observed to be more efficient than proactive and reactive techniques. It is because the reactive and proactive techniques normally employ the traditional procedures like, checkpoint restart, replication, migration, etc. which have limitations as these procedures could find it difficult to adjust dynamically or regulate to shifting demands, which could result in inefficiencies during times of high consumption.

After reviewing the literature, the below-mentioned conclusions can be drawn:

Checkpoint, restarting, and replication were found to be the frequently used methods to address the faults in the cloud.
Scholars and researchers are more concerned with determining crash defects than hardware faults such as transient, intermittent, or permanent faults.
When it comes to the implementation tool for evaluating the presented algorithms, research is mostly using the Cloudsim tool.
Proactive approaches have been used more frequently than reactive and resilient.
Researchers are more motivated toward response time and less towards makespan, adaptability, accuracy, and crashes.
Since the resilient approach utilized machine learning and artificial intelligence to predict and handle faults; therefore, it is the forthcoming effort of fault tolerance in the cloud.

Data availability

No datasets were generated or analysed during the current study.

Abbreviations

QoS:: Quality of Services
VM:: Virtual Machine
FT:: Fault Tolerance

References

Prasad R, Rohokale V (2020) Cyber security: the lifeline of information and communication technology. Springer International Publishing, Cham, Switzerland
Google Scholar
Lowe D, Galhotra B (2018) An overview of pricing models for using cloud services with an analysis of the pay-per-use model. Int J Eng Technol 7(3.12):248–254
Google Scholar
I Odun-Ayo, M Ananya, F Agono, R Goddy-Worlu (2018) Cloud computing architecture: A critical analysis. In 2018 18th international conference on computational science and applications (ICCSA) (pp. 1–7). IEEE
Mukwevho MA, Celik T (2018) Toward a smart cloud: A review of fault-tolerance methods in cloud systems. IEEE Trans Serv Comput 14(2):589–605
Google Scholar
Alzakholi O, Shukur H, Zebari R, Abas S, Sadeeq M (2020) Comparison among cloud technologies and cloud performance. J Appl Sci Technol Trends 1(2):40–47
Google Scholar
S Smys, R Bestak, Á Rocha (Eds.) (2019) Inventive Computation Technologies (Vol. 98). Springer Nature
U Samal, A Kumar (2023) "A software reliability model incorporating fault removal efficiency and it’s release policy." Comput Stat 1–19
U Samal, A Kumar (2024) "Empowering software reliability: Leveraging efficient fault detection and removal efficiency." Qual Eng 1–12
Samal U, Kumar A (2024) A neural network approach for software reliability prediction. Int J Reliability Qual Safety Eng 31:2450009
Google Scholar
Kumar S, Kushwaha DAS (2019) Future of fault tolerance in cloud computing. Think India J 22(17):6
Google Scholar
Gupta V, Kaur BP, Jangra S (2019) An efficient method for fault tolerance in cloud environment using encryption and classification. Soft Comput 23(24):13591–13602
Google Scholar
MA Shahid (2022) "A systematic survey of simulation tools for cloud and mobile cloud computing paradigm." J Independent Stud Res Comput 20.1
Shahid, Asim M, Alam MM, Su’ud MM (2023) A systematic parameter analysis of cloud simulation tools in cloud computing environments. Appl Sci 13.15:8785
Google Scholar
Shahid MA, Islam N, Alam MM, Mazliham MS, Musa S (2021) Towards Resilient Method: An exhaustive survey of fault tolerance methods in the cloud computing environment. Comput Sci Rev 40:100398
Google Scholar
Houssein EH, Gad AG, Wazery YM, Suganthan PN (2021) Task scheduling in cloud computing based on meta-heuristics: Review, taxonomy, open challenges, and future trends. Swarm Evol Comput 62:100841
Google Scholar
Shafiq DA, Jhanjhi NZ, Abdullah A (2021) Load balancing techniques in cloud computing environment: A review. J King Saud Univ Comput Inform Sci 34:3910–3933
Google Scholar
Hasan M, Goraya MS (2018) Fault tolerance in cloud computing environment: a systematic survey. Comput Ind 99:156–172. https://doi.org/10.1016/j.compind.2018.03.027
Article Google Scholar
Kumari P, Kaur P (2021) A survey of fault tolerance in cloud computing. J King Saud Univ Comput Inform Sci 33(10):1159–1176
Google Scholar
Wijayanti A (2020) Critical analysis on legal aid regulation for marginal community based on legal language. TEST: Eng Manag 8(2):2806–2814
A Kumar, DP Chawla (2020) A systematic literature review on load balancing algorithms of virtual machines in a cloud computing environment. Int J Comput Sci Eng. https://doi.org/10.26438/ijcse/v6i8.771778
Samal U, Kumar A (2024) Enhancing software reliability forecasting through a hybrid ARIMA-ANN model. Arab J Sci Eng 49(5):7571–7584
Google Scholar
Kumar P, Kumar R (2019) Issues and challenges of load balancing techniques in cloud computing: A survey. ACM Comput Surv (CSUR) 51(6):1–35
Google Scholar
SM Ataallah, SM Nassar, EE Hemayed (2015) Fault tolerance in cloud computing-survey. In 2015 11th International computer engineering conference (ICENCO) (pp. 241–245). IEEE
Mahallat I (2015) Fault-tolerance techniques in cloud storage: a survey. Int J Database Theory Appl 8(4):183–190
Google Scholar
S Prathiba, S Sowvarnica (2017) Survey of failures and fault tolerance in cloud. In 2017 2nd International Conference on Computing and Communications Technologies (ICCCT) (pp. 169–172). IEEE
Amin Z, Singh H, Sethi N (2015) Review on fault tolerance techniques in cloud computing. Int J Comput Appl 116(18):11–17
Google Scholar
Ragmani A, Elomri A, Abghour N, Moussaid K, Rida M, Badidi E (2020) Adaptive fault-tolerant model for improving cloud computing performance using artificial neural network. Procedia Comput Sci 170:929–934
Google Scholar
Yao G, Ren Q, Li X, Zhao S, Ruiz R (2020) A hybrid fault-tolerant scheduling for deadline-constrained tasks in Cloud systems. IEEE Trans Serv Comput 15:1371–1384
Google Scholar
Cheraghlou MN, Khadem-Zadeh A, Haghparast M (2016) A survey of fault tolerance architecture in cloud computing. J Netw Comput Appl 61:81–92
Google Scholar
Singh G, Kinger S (2013) A survey on fault tolerance techniques and methods in cloud computing. Int J Eng Res Technol 2(6):3335–3346
Google Scholar
Shah R, Veeravalli B, Misra M (2007) On the design of adaptive and decentralized load balancing algorithms with load estimation for computational grid environments. IEEE Trans Parallel Distrib Syst 18(12):1675–1686
Google Scholar
JM Shah, K Kotecha, S Pandya, DB Choksi, N Joshi (2017) Load balancing in cloud computing: Methodological survey on different types of algorithm. In 2017 International Conference on Trends in Electronics and Informatics (ICEI) (pp. 100–107). IEEE
Mishra SK, Sahoo B, Parida PP (2020) Load balancing in cloud computing: a big picture. J King Saud Univ-Comput Inform Sci 32(2):149–158
Google Scholar
Rathore NK, Rawat U, Kulhari SC (2020) Efficient hybrid load balancing algorithm. Natl Acad Sci Lett 43(2):177–185
MathSciNet Google Scholar
AK Patra, A Nanda, S Panigrahi, AK Mishra (2020) Design of artificial pancreas based on fuzzy logic control in type-I diabetes patient. In Innovation in Electrical Power Engineering, Communication, and Computing Technology (pp. 557–569). Springer, Singapore
SL Peng, G Suseendran, D Balaganesh (2020) Intelligent computing and innovation on data science. Springer Singapore
Chinnaiah MR, Niranjan N (2018) Fault tolerant software systems using software configurations for cloud computing. J Cloud Comput 7(1):1–17
Google Scholar
H Arabnejad, C Pahl, G Estrada, A Samir, F Fowley (2017) A fuzzy load balancer for adaptive fault tolerance management in cloud platforms. In European Conference on Service-Oriented and Cloud Computing (pp. 109–124). Springer, Cham
Edemo MK (2019) Developing fault tolerance architecture for real-time systems of cloud computing. Addis Ababa Science and Technology University, Addis Ababa, p 94
R Sana, B Harika, S Kumar (2020) Modeling for fault tolerance and scalability in cloud environment. 15
Abdulhamid SIM, Abd Latiff MS, Madni SHH, Abdullahi M (2018) Fault tolerance aware scheduling technique for cloud computing environment using dynamic clustering algorithm. Neural Comput Appl 29(1):279–293
Google Scholar
Sengupta S, Negi A (2019) Comparative Analysis of Contrast Enhancement Techniques for MRI Images. In International conference on Computer Networks, Big data and IoT (pp. 290–296). Springer, Cham
P Jain (2019) A dynamic process for fault tolerance techniques in cloud computing (DPFT). J Gujrat Res Soc 21:10
P Gupta, PK Gupta (2020) Trust & fault in multi layered cloud computing architecture (pp. 77–93). Springer
Madani S, Jamali S (2018) A comparative study of fault tolerance techniques in cloud computing. Int J Res Comput Appl Robot 6(3):7–15
Google Scholar
TJ Charity, GC Hua (2016) Resource reliability using fault tolerance in cloud computing. In 2016 2nd International Conference on Next Generation Computing Technologies (NGCT) (pp. 65–71). IEEE
G Vallee, K Charoenpornwattana, C Engelmann, A Tikotekar, C Leangsuksun, T Naughton, SL Scott (2008) A framework for proactive fault tolerance. In 2008 Third International Conference on Availability, Reliability and Security (pp. 659–664). IEEE
Saikia LP, Devi YL (2014) Fault tolerance techniques and algorithms in cloud computing. Int J Comput Sci Commun Networks 4(1):01–08
Google Scholar
Essa YM (2016) A survey of cloud computing fault tolerance: techniques and implementation. Int J Comput Appl 138(13):34–38
Google Scholar
Khaldi M, Rebbah M, Meftah B, Smail O (2020) Fault tolerance for a scientific workflow system in a cloud computing environment. Int J Comput Appl 42(7):705–714
Google Scholar
Xia Z, Zhu Y, Sun X, Qin Z, Ren K (2015) Towards privacy-preserving content-based image retrieval in cloud computing. IEEE Trans Cloud Comput 6(1):276–286
Google Scholar
GP Sarmila, N Gnanambigai, P Dinadayalan (2015) Survey on fault tolerant—Load balancing algorithmsin cloud computing. In 2015 2nd International Conference on Electronics and Communication Systems (ICECS) (pp. 1715–1720). IEEE
K Kotecha, V Piuri, HN Shah, R Patel (Eds.) (2020) Data Science and Intelligent Applications: Proceedings of ICDSIA 2020. Springer
Dhingra M, Gupta N (2019) Algorithms to enhance the reliability of virtual nodes using adaptive fault tolerance techniques. Int J Innov Technol Exploring Engineering 8(11):515–519
Google Scholar
Singh S, Chana I (2016) Resource provisioning and scheduling in clouds: QoS perspective. J Supercomput 72(3):926–960
Google Scholar
SU Mushtaq, S Sheikh, SM Idrees (2024) "Next-gen cloud efficiency: fault-tolerant task scheduling with neighboring reservations for improved cloud resource utilization." IEEE Access
TD Braun, HJ Siegel, N Beck, LL Bölöni, M Maheswaran, AI Reuther, ... RF Freund (2001) A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. J Parallel Distributed Comput 61(6):810–837
Reda NM, Tawfik A, Marzok MA, Khamis SM (2015) Sort-Mid tasks scheduling algorithm in grid computing. J Adv Res 6(6):987–993
Google Scholar
Y Feng, D Li, H Wu, Y Zhang (2000) A dynamic load balancing algorithm based on distributed database system. In Proceedings Fourth International Conference/Exhibition on High Performance Computing in the Asia-Pacific Region. 2:949–952. IEEE
JC Patni, MS Aswal, OP Pal, A Gupta (2011) Load balancing strategies for grid computing. In 2011 3rd International Conference on Electronics Computer Technology 3:239–243. IEEE
Cao J, Spooner DP, Jarvis SA, Nudd GR (2005) Grid load balancing using intelligent agents. Futur Gener Comput Syst 21(1):135–149
Google Scholar
Balasangameshwara J, Raju N (2012) A hybrid policy for fault tolerant load balancing in grid computing environments. J Netw Comput Appl 35(1):412–422
Google Scholar
Abohamama AS, Mohammed (2018) Fault Tolerance For Real Time Cloud Computing. Mansoura University, Diss
Google Scholar
MK Gokhroo, MC Govil, ES Pilli (2017) Detecting and mitigating faults in cloud computing environment. In 2017 3rd International Conference on Computational Intelligence & Communication Technology (CICT) (pp. 1–9). IEEE
Noor ASM, Zian NFM, Rahim NHA, Mamat R, Wan Azman WNA (2019) Novelty circular neighboring technique using reactive fault tolerance method. Int J Electrical Comput Eng 9(6):5211
Google Scholar
C Engelmann, GR Vallee, T Naughton, SL Scott (2009) Proactive fault tolerance using preemptive migration. In 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing (pp. 252–257). IEEE
Rezaei Kalantari K, Ebrahimnejad A, Motameni H (2020) Presenting a new fuzzy system for web service selection aimed at dynamic software rejuvenation. Complex Intell Syst 6(3):697–710
Google Scholar
Bala A, Chana I (2012) Fault tolerance-challenges, techniques and implementation in cloud computing. Int J Comput Sci Issues (IJCSI) 9(1):288
Google Scholar
MG Rakesh, A Baunthiyal, AK Jain (n.d.) Preemptive fault tolerance in DDS based distributed system using application migration
Mohammed B, Awan I, Ugail H, Younas M (2019) Failure prediction using machine learning in a virtualised HPC system and application. Clust Comput 22(2):471–485
Google Scholar
Battula SK, Garg S, Montgomery J, Kang B (2019) An efficient resource monitoring service for fog computing environments. IEEE Trans Serv Comput 13(4):709–722
Google Scholar
Park J, Yoo G, Lee E (2005) Proactive self-healing system based on multi-agent technologies. In: Third ACIS Int'l conference on software engineering research, management and applications (SERA'05). IEEE, pp 256–263
B Mohammed (2019) A framework for efficient management of fault tolerance in cloud data Centres and high-performance computing systems: An investigation and performance analysis of a cloud based virtual machine success and failure rate in a typical cloud computing environment and prediction methods (Doctoral dissertation, University of Bradford)
Nazari Cheraghlou M, Khademzadeh A, Haghparast M (2019) New fuzzy-based fault tolerance evaluation framework for cloud computing. J Netw Syst Manage 27(4):930–948
Google Scholar
Liakath JA, Krishnadoss P, Natesan G (2023) DCCWOA: A multi-heuristic fault tolerant scheduling technique for cloud computing environment. Peer-to-Peer Networking Appl 16:1–18
Google Scholar
Heyang X et al (2022) Fault tolerance and quality of service aware virtual machine scheduling algorithm in cloud data centers. J Supercomput 79:1–23
Google Scholar
Indhumathi R, Amuthabala K, Kiruthiga G, Yuvaraj N, Pandey A (2023) Design of task scheduling and fault tolerance mechanism based on GWO algorithm for attaining better QoS in cloud system. Wirel Pers Commun 128(4):2811–2829
Google Scholar
Zhu L et al (2021) A self-adapting task scheduling algorithm for container cloud using learning automata. IEEE Access 9:81236–81252
Google Scholar
Sheikh S, Nagaraju A, Shahid M (2021) A fault-tolerant hybrid resource allocation model for dynamic computational grid. J Comput Sci 48:101268
Google Scholar
Momenzadeh Z, Safi-Esfahani F (2019) Workflow scheduling applying adaptable and dynamic fragmentation (WSADF) based on runtime conditions in cloud computing. Futur Gener Comput Syst 90:327–346
Google Scholar
Al-Turjman F, Hasan MZ, Al-Rizzo H (2019) Task scheduling in cloud-based survivability applications using swarm optimization in IoT. Trans Emerg Telecommun Technol 30.8:e3539
Google Scholar
Meng S et al (2019) A fault-tolerant dynamic scheduling method on hierarchical mobile edge cloud computing. Comput Intell 35.3:577–598
MathSciNet Google Scholar
S Goutam, AK Yadav (2015) Preemptable priority based dynamic resource allocation in cloud computing with fault tolerance. In 2015 International Conference on Communication Networks (ICCN) (pp. 278–285). IEEE
Abd Latiff MS (2017) A checkpointed league championship algorithm-based cloud scheduling scheme with secure fault tolerance responsiveness. Appl Soft Comput 61:670–680
Google Scholar
Liu J, Wei M, Hu W, Xu X, Ouyang A (2018) Task scheduling with fault-tolerance in real-time heterogeneous systems. J Syst Architect 90:23–33
Google Scholar
Thaman J, Singh M (2017) Cost-effective task scheduling using hybrid approach in cloud. Int J Grid Util Comput 8(3):241–253
Google Scholar
SU Mushtaq, S Sophiya (2023) "A fault-tolerant resource reservation model in cloud computing." Recent Adv Comput Sci CRC Press 295–301
A Simy et al. (2012) "Task scheduling algorithm with fault tolerance for cloud." 2012 Int Conf Comput Sci IEEE
SU Mushtaq, S Sheikh, A Nain (2024) "The response rank based fault-tolerant task scheduling for cloud system." 2023 1st International Conference on Advanced Informatics and Intelligent Information Systems (ICAI3S 2023). Atlantis Press
Shahid MA, Alam MM, Su’ud MM (2023) Achieving reliability in cloud computing by a novel hybrid approach. Sensors 234:1965
Google Scholar
G Chen, H Jin, D Zou, BB Zhou, W Qiang, G Hu (2010) Shelp: Automatic self-healing for multiple application instances in a virtual machine environment. In 2010 IEEE International Conference on Cluster Computing (pp. 97–106). IEEE
IP Egwutuoha, S Chen, D Levy, B Selic, R Calvo (2012) A proactive fault tolerance approach to High Performance Computing (HPC) in the cloud. In 2012 Second International Conference on Cloud and Green Computing (pp. 268–273). IEEE
Bruneo D, Distefano S, Longo F, Puliafito A, Scarpa M (2013) Workload-based software rejuvenation in cloud systems. IEEE Trans Comput 62(6):1072–1085
MathSciNet Google Scholar
J Liu, J Zhou, R Buyya (2015) Software rejuvenation based fault tolerance scheme for cloud applications. In 2015 IEEE 8th International Conference on Cloud Computing (pp. 1115–1118). IEEE
Sun D, Zhang G, Wu C, Li K, Zheng W (2017) Building a fault tolerant framework with deadline guarantee in big data stream computing environments. J Comput Syst Sci 89:4–23
MathSciNet Google Scholar
S Malik, F Huet (2011) Adaptive fault tolerance in real time cloud computing. In 2011 IEEE World Congress on services (pp. 280–287). IEEE
LJL Zhang, J Zhang, J Fiaidhi, I Bojanova (2011) Cloud computing
Wang J, Bao W, Zhu X, Yang LT, Xiang Y (2014) FESTAL: fault-tolerant elastic scheduling algorithm for real-time tasks in virtualized clouds. IEEE Trans Comput 64(9):2545–2558
MathSciNet Google Scholar
Shahid MA, Islam N, Alam MM, Su’ud MM, Musa S (2020) A comprehensive study of load balancing approaches in the cloud computing environment and a novel fault tolerance approach. IEEE Access 8:130500–130526
Google Scholar
Sidiroglou S, Laadan O, Perez C, Viennot N, Nieh J, Keromytis AD (2009) Assure: automatic software self-healing using rescue points. ACM SIGARCH Comput Archit News 37(1):37–48
Google Scholar
Arvind R, Vinnarsi A (2015) Temperature monitoring with the linux kernel on a multi core processor. Int J Innov Res Sci Eng Technol 4:876–883. https://doi.org/10.15680/IJIRSET.2015.0403011
Article Google Scholar
K Toshniwal, JM Conrad (2010) A web-based sensor monitoring system on a Linux-based single board computer platform. In Proceedings of the IEEE SoutheastCon 2010 (SoutheastCon) (pp. 371–374). IEEE
Bharany S et al (2021) Energy-efficient clustering scheme for flying Ad-Hoc networks using an optimized LEACH protocol. Energies 14(19):6016. https://doi.org/10.3390/en1419
Article Google Scholar
Safara F, Souri A, Baker T, Al Ridhawi I, Aloqaily M (2020) PriNergy: A priority-based energy-efficient routing method for IoT systems. J Supercomput 76:8609–8626
Google Scholar
Asadi AN, Azgomi MA, Entezari-Maleki R (2020) Analytical evaluation of resource allocation algorithms and process migration methods in virtualized systems. Sustain Comput Inform Syst 25:100370
Google Scholar
Yuan H, Liu H, Bi J, Zhou M (2020) Revenue and energy cost-optimized biobjective task scheduling for green cloud data centers. IEEE Trans Autom Sci Eng 18:817–830
Google Scholar
Welsh T, Benkhelifa E (2020) On resilience in cloud computing: A survey of techniques across the Cloud Domain. ACM Comput Surv (CSUR) 53:1–36
Google Scholar
Abapour S, Nazari-Heris M, Mohammadi-Ivatloo B, Tarafdar Hagh M (2020) Game theory approaches for the solution of power system problems: A comprehensive review, Arch. Comput Methods Eng 27:81–103. https://doi.org/10.1007/s11831-018-9299-7
Article MathSciNet Google Scholar
Wang K, Wu J, Zheng X, Jolfaei A, Li J, Yu D (2020) Leveraging energy function virtualization with game theory for fault-tolerant smart grid. IEEE Trans Ind Inf 1. https://doi.org/10.1109/TII.2020.2971584
Shahid MA, Alam MM, Su’ud MM (2023) Improved accuracy and less fault prediction errors via modified sequential minimal optimization algorithm. Plos One 18.4:e0284209
Google Scholar
Verma R, Chandra S (2022) HBI-LB: A dependable fault-tolerant load balancing approach for fog based internet-of-things environment. J Supercomput 79:1–19
Google Scholar
Tamilvizhi T, Parvathavarthini B (2019) A novel method for adaptive fault tolerance during load balancing in cloud computing. Clust Comput 22(5):10425–10438
Google Scholar
Attallah SM, Fayek MB, Nassar SM, Hemayed EE (2021) Proactive load balancing fault tolerance algorithm in cloud computing. Concurrency Comput: Practice and Experience 33(10):e6172
Google Scholar
T Mohmmed, N Abdalrahman (n.d.) A load balancing with fault tolerance algorithm for cloud computing. In 2020 International Conference on Computer, Control, Electrical, and Electronics Engineering (ICCCEEE) (pp. 1–6). IEEE
MR Sumalatha, C Selvakumar, T Priya, RT Azariah, PM Manohar (2014) CLBC-Cost effective load balanced resource allocation for partitioned cloud system. In 2014 International Conference on Recent Trends in Information Technology (pp. 1–5). IEEE
Mohammed B, Kiran M, Awan IU, Maiyama KM (2016) An integrated virtualized strategy for fault tolerance in cloud computing environment. In 2016 Intl IEEE conferences on ubiquitous intelligence & computing, advanced and trusted computing, scalable computing and communications, cloud and big data computing, internet of people, and smart world congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld). IEEE. pp 542–549
Naksinehaboon N, Mihaela P, Nassar R, Leangsuksun CB, Scott S (2009) High performance computing systems with various checkpointing schemes. Int J Comput Commun Control 4(4):386–400
Google Scholar
Das P, Khilar PM (2013) VFT: a virtualization and fault tolerance approach for cloud computing. In: 2013 IEEE conference on information & communication technologies. IEEE, pp 473–478
Sachdeva R, Kakkar S (2017) A novel approach in cloud computing for load balancing using composite algorithms. Int J 7(2):51–56
Google Scholar
Babu KRR, Joy AA, Samuel (2017) Load balancing of tasks using hybrid technique with analytical method of esce & throttled algorithm. Int J Nov Res Dev 2(6):61–66
Google Scholar
Semmoud A, Hakem M, Benmammar B, Charr JC (2020) Load balancing in cloud computing environments based on adaptive starvation threshold. Concurrency Comput: Practice and Experience 32(11):e565. https://doi.org/10.1002/cpe.5652
Subalakshmi S, Malarvizhi N (2017) Enhanced hybrid approach for load balancing algorithms in cloud computing. Int J Sci Res Comput Sci Eng Inform Technol 2(2):136–142
Google Scholar
S Dam, G Mandal, K Dasgupta, P Dutta (2015) Genetic algorithm and gravitational emulation based hybrid load balancing strategy in cloud computing. In Proceedings of the 2015 third international conference on computer, communication, control and information technology (C3IT) (pp. 1–7). IEEE
Rathore J, Keswani B, Rathore VS (2018) An efficient load balancing algorithm for cloud environment. J Invent Comput Sci Commun Technol 4(1):37–41
Google Scholar
S Dam, G Mandal, K Dasgupta, P Dutta (2014) An ant colony based load balancing strategy in cloud computing. In Advanced Computing, Networking and Informatics-Volume 2 (pp. 403–413). Springer, Cham
SG Domanal, GRM Reddy (2014) Optimal load balancing in cloud computing by efficient utilization of virtual machines. In 2014 sixth international conference on communication systems and networks (COMSNETS) (pp. 1–4). IEEE
V Tailong, V Dimri (2016) Load balancing in cloud computing using modified optimize response time. Int J Adv Res Comput Sci Software Eng 6(5)
AN Singh, S Prakash (2018) WAMLB: weighted active monitoring load balancing in cloud computing. In Big data analytics (pp. 677–685). Springer, Singapore
S Ghosh, C Banerjee (2016) Priority based modified throttled algorithm in cloud computing. In 2016 international conference on inventive computation technologies (ICICT) (Vol. 3, pp. 1–6). IEEE
Alamin MA, Elbashir MK, Osman AA (2017) A load balancing algorithm to enhance the response time in cloud computing. J Basic Appl Sci 2(2):473–490
Google Scholar
Aliyu AN, Souley PB (2019) Performance analysis of a hybrid approach to enhance load balancing in a heterogeneous cloud environment. Int J Adv Sci Res Eng 5(7):246–257. https://doi.org/10.31695/ijasre.2019.33430
Article Google Scholar
Manikandan N, Pravin A (2018) An efficient improved weighted Round Robin load balancing algorithm in cloud computing. Int J Eng Technol (UAE) 7(3.1):110–117. https://doi.org/10.14419/ijet.v7i3.1.16810
Haidri RA, Katti CP, Saxena PC (2014) A load balancing strategy for cloud computing environment. Int Conf Signal Propag Comput Technol ICSPCT 2014:636–641. https://doi.org/10.1109/ICSPCT.2014.6884914
Article Google Scholar
Latha, Padma VL, Sudhakar Reddy N, Suresh Babu A (2023) Optimizing scalability and availability of cloud based software services using modified scale rate limiting algorithm. Theor Comput Sci 943:230–240
MathSciNet Google Scholar
Yuan S et al (2023) Availability-aware virtual resource provisioning for infrastructure service agreements in the cloud. Inform Syst Front 25.4:1495–1512
Google Scholar
Wang C, Fu Z, Cui G (2019) A neural-network-based approach for diagnosing hardware faults in cloud systems. Adv Mech Eng 11(2):1687814018819236
Google Scholar
Shahid, Asim M, Alam MM, Su’ud MM (2023) Performance evaluation of load-balancing algorithms with different service broker policies for cloud computing. Appl Sci 13.3:1586
Google Scholar
Mishra S, Scholar MT (2016) An Iwrr method based on efficient load balancing in cloud computing. Int J Recent Trends Eng Res 3(01):46–54
Google Scholar
Talwani S, Chana I (2017) Fault tolerance techniques for scientific applications in cloud. In 2017 2nd International Conference on Telecommunication and Networks (TEL-NET). IEEE, pp 1–5

Download references

Funding

Open access funding provided by NTNU Norwegian University of Science and Technology (incl St. Olavs Hospital - Trondheim University Hospital). Open access publication costs are covered by Norwegian University of Science and Technology, Norway.

Author information

Authors and Affiliations

Department of Computer Applications, Lovely Professional University, Jalandhar, 144411, India
Sheikh Umar Mushtaq, Sophiya Sheikh & Parvaz Ahmad Malla
Department of Computer Science (IDI), Norwegian University of Science and Technology (NTNU), 2815, Gjøvik, Norway
Sheikh Mohammad Idrees

Authors

Sheikh Umar Mushtaq
View author publications
You can also search for this author in PubMed Google Scholar
Sophiya Sheikh
View author publications
You can also search for this author in PubMed Google Scholar
Sheikh Mohammad Idrees
View author publications
You can also search for this author in PubMed Google Scholar
Parvaz Ahmad Malla
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.U.M: Conceptualisation, research design, primary experiments,Methodology, manuscript drafting, and revision. S.S: Data analysis,Supervision, manuscript drafting, and revision. S.M.I: Methodology, Results validation, Supervision and critical revisions. P.A.M: Conceptualisation and validation.

Corresponding author

Correspondence to Sheikh Mohammad Idrees.

Ethics declarations

Ethical approval

Not applicable.

Consent to publish

All the authors agreed to publish the manuscript.

Competing interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Mushtaq, S.U., Sheikh, S., Idrees, S.M. et al. In-depth analysis of fault tolerant approaches integrated with load balancing and task scheduling. Peer-to-Peer Netw. Appl. 17, 4303–4337 (2024). https://doi.org/10.1007/s12083-024-01798-5

Download citation

Received: 27 January 2024
Accepted: 21 August 2024
Published: 17 October 2024
Issue Date: November 2024
DOI: https://doi.org/10.1007/s12083-024-01798-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

In-depth analysis of fault tolerant approaches integrated with load balancing and task scheduling

Abstract

Similar content being viewed by others

A Systematic Overview of Fault Tolerance in Cloud Computing

Fault Tolerance in Cloud Computing- An Algorithmic Approach

Thorough Understanding of Existing Fault-Tolerant Techniques for Task Scheduling in Cloud Computing

1 Introduction

1.1 Research methodology and data analysis

1.2 Motivation

1.3 Our contribution and features of the study

1.4 Organization of the paper

1.5 Fault tolerance in cloud computing

1.5.1 Fault, error, and failure taxonomies

1.6 General fault-tolerance challenges in cloud computing

1.7 Measures for effective cloud reliability- a need for the hybrid framework

1.7.1 Cloud scheduling approach

1.7.2 Load balancing approaches

1.7.3 Fault-tolerant approaches

Reactive fault tolerance

Proactive fault-tolerance

Resilient fault-tolerance

1.7.4 General problem formulation for fault tolerance using replication

General problem formulation for fault tolerance using replication

2 Related literature

2.1 Scheduling with fault-tolerance

Scheduling and fault tolerance frameworks

2.1.1 Proactive-based scheduling and fault tolerance framework

2.1.2 Reactive-based scheduling and fault tolerance frameworks

2.1.3 Resilient-based scheduling and fault tolerance frameworks

2.2 Load balancing with fault tolerance

3 Discussions and observations

3.1 Statistics of hybrid survey of scheduling and fault tolerance algorithms

3.2 Statistics of hybrid survey of load balancing and fault tolerance algorithms

4 Forthcoming research directions and open issues

4.1 Future works

4.2 Methodical roadmap for open challenges

5 Conclusion

Data availability

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical approval

Consent to publish

Competing interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation