1 Introduction

Over the last 10 years, the use of Cloud has grown substantially. More facilities are incorporated into the cloud environment and are allowed to be accessed by everyone globally. Likewise, Cloud Computing companies such as IBM, Yahoo, Amazon, and Google are providing global access to services to customers [1]. Moreover, these are metered services which we commonly term subscriptions, and are frequently applied in the Software as a Service (SaaS) delivery simulation [2].

The cloud environment consists of two components i.e., the frontend, and the backend. The front end is the main interface on the consumer side and is accessed through different networks over the internet [3]. The Backend side particularly deals with the CSP (Cloud Service Provider) and provides services by utilizing data center resources. In these data centers, different physical machines known as servers are being stored. Multiple virtual copies of these physical machines can be created using the virtualization process. Virtualization deals with and handles multiple upcoming requests for a particular application/service across the globe. The different shareable resources can be Applications, Software, Hardware, etc.

In cloud architecture, there are mainly three services [4], Infrastructure (IaaS), Software (SaaS), and Platform as a Service (PaaS) [5, 6]. There may be chances of faults in all these three layers in a similar way while providing user services. Therefore, the detection and removal of faults is necessary for obtaining the best possible reliability as presented in [7, 8]. Moreover, the deficiencies in the infrastructure of the cloud yield a direct impact on resource reliability and availability [4]. These deficiencies need to be critically analyzed and treated to boost reliability and robustness. DNN (Deep Neural Network), a powerful deep learning tool exhibits is a promising solution for this [9]. Fault Tolerance is a significant technique that can notice, locate, and recover from faults and failures in the cloud environment. It makes the cloud more robust and enhances the efficiency of the environment [10]. Mainly, fault tolerance falls into two sub-areas i.e., Hardware Fault Tolerance and Software Fault Tolerance [11].

On the other hand, scheduling tasks appropriately is vital in delivering critical and essential services of the cloud. The ineffective scheduling of tasks increases the task execution time and waiting time. Besides, insignificant load balancing results in the under and over-utilizing of resources where the under-utilization of resources can lead to the wastage of resources, and over-utilization of resources can degrade the performance of cloud systems. Henceforth, proficient load distribution is essential to boost the performance of cloud-based applications.

There is a fundamental need to incorporate load balancing and scheduling in efficient fault-handling mechanisms due to architectural challenges in the cloud system. Therefore, this paper conducts a hybrid review employing fault tolerance with scheduling, load balancing, and analysis of QoS (Quality of Service) parameters optimization. This comprehensive review primarily centers on three core classifications of fault tolerance techniques, namely Reactive, Proactive, and Resilient Approaches. The Reactive Procedures are the conventional techniques of fault tolerance that include replication, detection, checkpointing/restarting, and recovery. In the Proactive Methods, the system is prevented from reaching a defective state that includes monitoring, prediction, and pre-emption. The actions are taken to minimize the defects, and thereby the failure condition is avoided. The Resilient Methods have shown a recent take-off in the literature and indicate a potential trajectory for the future of fault tolerance in cloud environments. This is because these methods are grounded on artificial cleverness and ML [10]. Besides, simulation toolkits play an analytical role in evaluating settings of cloud computing. These toolkits allow us to simulate and evaluate the cloud set-ups cost-efficiently without the requirement for massive infrastructure. Some of the most effective and powerful simulators have been discussed in [12]. Comparative analysis has been performed in [13] among various simulators concerning to various parameters to determine the features and functions of each toolkit.

1.1 Research methodology and data analysis

This section focuses on the setting of the methods that are used to perform the qualitative opinion of the literature in the review and the sources of considered state-of-the-art works. It also includes the incorporated methodology for the proposed research. In the end, we specified our significant contributions to this review.

The selection and elimination of the published articles were determined based on some aspects. The related articles were selected after analyzing the abstract, and afterward, a critical review/analysis was performed. The selection of the papers was achieved based on the standard in the database and the article itself. Furthermore, the inclusion was done based on the following conditions.

  1. a. Searching strategy

    A systematic survey of fault tolerance with efficient scheduling and load distribution techniques proposed in the literature was conducted through well-known sources.

    Several search keywords include Cloud Resources, Fault-tolerance, Task Scheduling, Load Balancing, QoS Parameters, Resource Optimization, failure in a cloud, essential cloud services, cloud architecture, scheduling techniques, etc., used in this study.

  2. b. Duration and validity of study
    • This review research mostly incorporates articles from 2009 to 2023 from well-believed journals, books, and conferences.

    • The statistics of the considered year for publications from 2009 to 2023 are depicted in Fig. 1.

    • Very few studies are included from 2007 and 2008.

    • The selected duration is chosen to capture a comprehensive range of data such as technical progressions, economic sequences, and policy variations, and confirm data availability pertinent to our study that replicates the evolution, progression, and trends applicable to our study objectives.

    Fig. 1
    figure 1

    Percentage of the included papers (2009 to 2023)

  3. c. Language and selection/inclusion criteria
    • The decision for the language criterion was specified as English. Because English is considered as the primary language for scholarly and intellectual publications particularly in the fields of computer science and distributed computing. Regulating criteria to English articles ensured that we selected high-quality and broadly recognized studies, smoothing a thorough and appropriate review.

    • The primary priority was given to hybrid fault tolerance approaches including either scheduling or load balancing.

    • Hybrid fault tolerance approaches optimize some other QoS parameters as well. Figure 2 presents the detailed inclusion and exclusion of the studies.

    Fig. 2
    figure 2

    Showing the Methodology of Inclusion and exclusion criteria of the studies

  4. d. Data processing and analysis
    • The data was initially organized into Excel and prepared for analysis.

    • Data categorization was made based on different QoS parameters, the environment used for simulation, types of faults considered, and other thematic considerations. This categorization helps us to analyze the literature more clearly and precisely.

    • The qualitative information was obtained by considering diverse Quality of Service (QoS) metrics, types of faults addressed, and the range of simulation environments utilized across a timeframe.

    • Furthermore, the analysis also highlights the various fault tolerance methods employed in the existing literature.

  5. e. Synthesis of the analysiss
    • For meaningful conclusions and insights, the data was observed based on the objectives of the study.

    • The patterns and relationships among the various studies were discussed for comparison and assessment.

  6. f. Quality assessment and validation procedure

    The presented Methodology Adapted for this study can be summarized in four stages:

    • Originally, the related articles were searched through the related keywords.

    • Some articles were selected based on title, standards, and optimization parameters.

    • Selected articles were gone through the abstract, and further inclusion and exclusion were performed.

    • Finally, inclusive articles were extensively reviewed, analyzed, and incorporated into this survey.

1.2 Motivation

Faults can lead to malfunctions that worsen a system's overall performance. Failures result in the breakdown/shutdown of a system, but occasionally, flaws cause performance to decline rather than the entire shutdown of the system. Various fault tolerance solutions can be employed to address different types of defects, such as network, physical, and process problems. However, it is crucial to achieve without comprehending the existence of the issue inside the architecture and the damage that the system flaw produced. Cloud is made up of comprises levels, each of which takes services from the layers below it. The failure at any layer has the potential to contaminate the layer right above it. Since faults at any one layer may affect the services that any of the layers provide. Thus, for high-performance computers, the appropriate fault tolerance system is needed to effectively handle these faults. The faults should be managed critically and dynamically to make the cloud environment more efficient and intelligent. Besides, in the cloud, efficient task scheduling leads to the maximum utilization of virtual machines, reducing operational costs, thereby revealing enhancements in the QoS parameters and eventually improving overall performance. Also, load balancing techniques need to be addressed comprehensively in different environments like static, dynamic, and nature-inspired cloud environments. Moreover, it is essential to thoroughly examine load-balancing techniques across various settings, including static, dynamic, and nature-inspired environments.

Various methods have been suggested in academic literature to address this concern and multifarious reviews are available in the literature for future researchers. While studying the existing surveys, it was observed that the surveys are not thorough enough, wide-ranging, and sufficient in certain ways. Although the authors in [14] have presented a comprehensive survey about fault tolerance, this survey does not focus on other aspects of the cloud like efficient load balancing and scheduling. In [15], an immense survey focused on scheduling but lacked emphasis on fault handling and load distribution. Besides, [16] presented a vast survey focusing on load balancing across cloud resources but lacking in fault handling and cloud optimization. Similarly, [17] also provides a survey emphasizing fault tolerance frameworks, however, fails to significantly enhance the performance of the cloud environment. In [18], only considering fault-tolerant approaches does not give prominence to major cloud aspects such as scheduling and load balancing. Similarly, the most recent survey presented in [19] focused on both scheduling and fault tolerance but no ways for optimal load distribution. Additionally, the observations presented in [20] were limited to a few aspects concerning fault-handling techniques, and only crash and byzantine fault models were considered. Also, there is no consideration of QoS parameters. Similarly, the recent survey was presented in [21] but was found limited to reliability. In other words, these reviews were not significantly focused on the discussed issues of the cloud related to fault tolerance with scheduling/load balancing simultaneously. After this comprehensive analysis, it was observed that none of the mentioned surveys offer extensive consideration of the above-mentioned scenarios of cloud computing. The QoS and other important aspects related to the clouds' fault tolerance concerns are focused on by the researchers in the existing surveys but are very limited. This renders the current review inadequate for analyzing the current art in cloud systems. Hence, there is a dire need to present a survey focusing on reliability-related aspects of the cloud. Therefore, we got motivated and moved to present this systematic and hybrid review. In this survey, we try to discover and explore the site of hybrid fault tolerance models that will focus not only on traditional fault tolerance techniques but also integrate some other important cloud aspects like scheduling/load balancing. This integration helps us to highlight the likely applications, challenges, and incipient trends.

The hybrid models in fault tolerance with load balancing and scheduling extend to several advantages over single scheme approaches.

  • Illustrative example:

  • Consider the scenario, where the CSP hosts several services and applications for its clients, utilizing solely fault tolerance mechanisms (single-model schemes). In often cases, fault tolerance frequently results in redistributing the workload from faulty virtual machines (VMs) to the unaffected VMs. This redistribution often upsets the load equilibrium between VMs, which leads to an unequal workload distribution and a deterioration in overall service performance. However, if CSP implements the hybrid model which integrates multiple reliability measures would enhance reliability and provide robust services to the clients. In our example, if CSPs employ the hybrid model that performs load balancing after fault tolerant measures. This will help CSPs to simultaneously minimize the risks of non-uniform load distributions and other overheads associated with fault tolerance and progress the QoS.

  • Besides, to make this emerging domain more observable for future researchers, there is a need to analyze the up-to-date methods concerning these factors [10]. This review is also inspired by peer surveys of the existing literature along with their limitations. Moreover, it represented the analysis of some important aspects of the existing literature such as QoS, static/dynamic, environmental setup used, fault tolerance approaches, and fault models, and presented the results in the graphical visualization form. The analysis provided offers a comprehensive perspective on the existing research efforts that have been the focal point of existing studies. The overall comparison of the top-cited surveys with the proposed survey is also illustrated in the subsequent sections.

1.3 Our contribution and features of the study

The primary contributions of this survey include:

  • This article presents an in-depth examination of the cloud environment. The main faults and fault taxonomy in cloud systems are also discussed in detail.

  • Various researchers have already addressed fault tolerance and load balancing mechanisms, however, much of their work has focused on the employment of either fault tolerance or load balancing separately. The presented survey incorporates a review of fault tolerance with two other related aspects, i.e., load balancing and scheduling which is the peak need of the time and was found missing in the current surveys.

  • Moreover, Tables 1 and 2 present a comparative analysis of our contribution with the recent and current top-cited studies respectively.

  • The survey has been presented in two categories i.e., Fault tolerance with Scheduling and Fault tolerance with Load balancing.

  • The generalized problem formulation of fault tolerance has also been presented to understand the workings of fault tolerance using the replication technique.

  • We further outlined the difficulties associated with ensuring fault tolerance integrated with scheduling and load balancing in cloud computing systems and comprised a thorough examination of common problems faced. It will assist future researchers to promptly recognize or understand the problems related to the study.

  • The study also presents feasible graphical observations about the literature such as parameters optimized, faults model addressed, the environmental tool used, etc. These detailed observations are presented separately for both categories and were not found in the existing surveys to the best of our knowledge. A dedicated discussion and observation section is designed for that purpose.

  • This hybrid review aids in investigating the potential challenges of hybrid fault-tolerant models and provides a detailed roadmap for future research directions. The aim is to enhance migration methods, thereby mitigating failures among nodes.

  • Moreover, the overall study provides a platform for future researchers to analyze the current state of the art regarding considered issues and find the appropriate future research problems.

  • In the end, there is a dedicated section highlighting the future research directions of the problem.

Table 1 Comparative analysis related to the contribution of the top-cited study and the proposed study
Table 2 Enlightenment of reactive fault-tolerant techniques

1.4 Organization of the paper

The following specified structure is adhered to for this research review article:

The overall structure of this research review article is as specified. Section 2.2.3 presents the detailed introduction of the study having its subsection from 1.1 to 1.7. These subsections encompass the Research Methodology and Data Analysis, Motivation with an illustrative example elucidating how hybrid frameworks can benefit CSPs, and the Authors' Contribution. Moreover, Section 2.2.3 also focuses on the significance of fault tolerance in the cloud, encompassing a taxonomy of faults, errors, and failures, along with delineating the challenges associated with fault tolerance in dedicated subsections. It also delineates the specifics of scheduling, load balancing, and fault tolerance in the pursuit of reliable cloud services. Additionally, it formulates the problem associated with fault tolerance in this context. The detailed survey literature with the comparative analysis is elaborated on in Section 2. Section 3 depicts the discussions and observations from the existing reviewed literature while presenting the overall analysis of fault tolerance with both scheduling and load balancing in the dedicated sections. The forthcoming directions with open issues and future works in the related research are highlighted in Section 4. Additionally, the methodical roadmap for open challenges is also included in Section 4. Finally, Section 5 concludes the whole study. The organization of the presented study has been presented in Fig. 3.

Fig. 3
figure 3

Showing the organization of the study

1.5 Fault tolerance in cloud computing

Faults in any resource may affect the task execution time and QoS parameters of the cloud, which will eventually reduce the deed of the system. The efficient fault tolerance policy helps to identify and overcome errors in the cloud architecture, and thereby the performance metrics are boosted. The fault tolerance capability should be considered with other techniques like scheduling and load balancing for the effective performance of the system. Moreover, the load balancing and scheduling approaches should do their respective standardizes along with fault tolerance. In case of a crash or connection error, the system should be capable enough to provide an alternative VM to handle these failures for smooth and uninterrupted task execution. Because these crashes in any nodes will affect the efficiency of the entire system. Therefore, handling faults enhances the utility of the system to accomplish the tasks precisely and accurately resolving the occurrence of internal defects [38]. An inclusion of fault tolerance with other reliability-related techniques like scheduling and load balancing will make the cloud environment more efficient, specifically for the real-time and dynamic processing of tasks [39]. Hence, fault tolerance is a major aspect that ensures robustness, reliability, and other performance metrics in the cloud environment [40, 41].

1.5.1 Fault, error, and failure taxonomies

The fault is the condition of the system when it loses the ability to function for an expected output due to an unexpected condition or defect in any of the internal or external components. The main faults within the cloud environment are enumerated as follows: [42]:

  • The Network Faults: These defects arise due to network interruption in any connection, nodes, cluster, etc., [43, 44].

  • The Physical Faults: When any of the hardware resources like CPU, memory, storage, etc., fails, these types of faults will occur. The power failure also gives rise to these types of faults [42].

  • The Process Faults: These are the common faults in a cloud environment that occur because of the unavailability of any resource, software, etc., [43].

  • The Service Expiry Fault: This type of fault arises if the service clock of the resource run out when the application is in use [43].

  • The Media Fault: Any crash in the media of the cloud will lead to these types of faults [39].

  • The Processor Faults: This type of fault mainly occurs because of malfunctioning in the operating system [45].

  • The Restrictions Faults: This type of fault occurs when any fault arises and is unnoticed or ignored by the controlling or any other responsible agent [17].

  • The Parametric Faults: If the optimizing parameters are ambiguous or do not differ and remain unexplained, this type of fault occurs [17].

  • The Time Restriction Faults: These faults occur when the particular application is not completed by the predefined deadline [17].

  • The fault tolerance mechanism makes the cloud environment efficient by providing necessary services even in case of failure of one or multiple components [46, 47]. If there is any kind of fault in the system, it leads to error, and error, in turn, culminates in failure.

  • Fault: The abnormal state of any coordination when assigned tasks cannot be performed. Usually, the fundamental cause of this state is the presence of some bugs in single/multiple components of the system [26, 29, 30, 48,49,50]. Faults are categorized into various groups, as depicted in Fig. 4.

  • Error: A system experiencing faults may transition into an error state. Compromised performance due to errors can subsequently result in incomplete or complete failure of the system. Errors have been classified into the following categories, as shown in Fig. 5.

  • Failure: The presence of an error can take the system to the failure state and it has a absolute effect on the user. Moreover, the failure is recognized by the user by seeing the incorrect output of the system [25, 26, 30]. The failures have been classified into the following categories, as exhibited in Fig. 6.

Fig. 4
figure 4

Showing different fault categories

Fig. 5
figure 5

Showing different error categories

Fig. 6
figure 6

Showing different failure categories

1.6 General fault-tolerance challenges in cloud computing

Ensuring a fault-tolerant cloud environment involves evaluating numerous challenges. Some of these challenges are discussed below:

  • Task and failure heterogeneity: The cloud utilizes different hardware and operating systems simultaneously and considers the underlying heterogeneous frameworks [51]. Resultantly, in handling the heterogenous type of faults, and eventually increasing the complexity to overcome them.

  • Automation: The extensive use of VMs in the cloud environment is increasing exponentially and managing these platforms in real time is more difficult. Therefore, there is a good need to automate fault tolerance strategies for complex networks [15].

  • Cloud halts: The main plan of fault tolerance is to provide uninterrupted service altogether in case of any service interruption or malfunction of any host server or network system. The Service Level Agreements [26] for all companies should be prepared accordingly.

  • Retrieval Points and Recovery Time Objectives targeting: This Point is established to preserve the set of track records that may be at risk of loss in the event of a server error [14]. On the other hand, Recovery Time is the time required by the procedure to get back on track or running after the failure [52]. The main aim is to decrease RPO (Retrieval Point Objectives) and RTO (Recovery Time Objectives) at the minimum possible rate [10].

  • Cloud Workload: Cloud workloads are the specific applications-related tasks/services or specific amounts of work executed on a cloud resource. The workloads could be of two types, i.e., Enabled, and Native loads. The Native workloads are also labeled as “born on the web” and are entirely cloud-developed applications. On the other hand, an enabled workload pertains to the computational tasks generated by cloud applications. Moreover, the Proactive and resilient approaches seem relevant [53] to fill the fault tolerance conditions of both Active and Native concepts [10].

1.7 Measures for effective cloud reliability- a need for the hybrid framework

The claim for the cloud computing standard has enlarged intensely in the past few years as it allows the dynamic fetching and renouncing of computing resources that too in a device-independent and cost-effective manner with slight effort or communication from the service provider. Despite lots of enhancements in the cloud, it is still prone to many system failures which results in growing apprehension regarding the reliability of cloud public services. Reliability is the way of measuring the efficiency of the system and its value can be adjusted accordingly after performing computation where the default reliability is 100% [54]. The conditions of reliability must be met for stable and efficient processing of the cloud. It is also one of the critical Quality of Service constraints. Moreover, optimized QoS parameters play an important role in effective and adequate resource allocation and have been extensively inspected in Cloud Computing standards. These parameters are used to consider the efficiency of various Scheduling, Load Balancing, or Fault Tolerance techniques in the cloud.

1.7.1 Cloud scheduling approach

Cloud scheduling is performed by mapping the incoming task to the most suitable available VM. The objective of ascertaining the sequence in which events or tasks should be executed in the cloud and simultaneously analyzing the required QoS parameters is termed Scheduling. Cloud Scheduling mainly includes the following:

Prediction of future incoming workloads and Normalizing the QoS parameters.

Selection of the most optimal VM and executing the particular task via, Heuristic/Meta-Heuristic algorithms.

Generally, the VM/task scheduling is done in two ways:

  • On-Demand Scheduling: This scheduling considers the dynamic cloud workloads on demand and VMs are provided quickly by cloud service providers as required. However, it may lead to the problem of workload dispersal. In other words, multiple tasks may be processed by a single VM at a time (Over-provisioning Problem) resulting in degrading the performance of the system.

  • Long-Term Reservation: This scheduling reserves the resources for the long term. However, providing many VMs can lead to Under-provisioning problems in some situations.

These Under and Provisioning problems may cause the wastage of VMs and task execution time, and thereby the overall cost of services may increase. Hence, a well-organized and effective provisioning technique is essential that examines and schedules the cloud workloads efficiently. Figure 7 explains the process of VM Provisioning and Scheduling (VPS) [55].

Fig. 7
figure 7

VM provisioning and scheduling (VPS)

The main aims of VM provisioning are:

  • Fulfill the User’s demand without SLA violation.

  • Prior prediction of user requirements based on incoming workload size.

In cloud provisioning, the SLA is settled between the end users and Cloud Service Providers after fully analyzing the incoming workloads. Before scheduling (mapping) the incoming workload (applications/tasks) to the particular VM/resources, the running VMs are monitored regularly for load estimation [56]. If the VM is found overutilized, then that particular VM is disabled temporarily for any future assignments and these VMs are not allocated immediately after mapping. Afterward, the task executing capability of the VM is also tested before any further allocation. This study also contains a review of various research papers focusing on the principles of load-balancing and scheduling. In the cloud, efficient scheduling of jobs is the main factor ensuring high-performance applications. However, in the cloud, scheduling not only has to pact with the dynamism and the widespread nature of the cloud, but it should also consider the optimization of other important parameters. The matching of tasks to the corresponding machines and scheduling the organization of execution of these tasks refers to mapping. Efficient mapping minimizes the total execution time of the meta-task. The meta-task is identified as a collected work of independent tasks having no inter-task dependencies. The mapping of such meta-tasks is being achieved statically (i.e., offline or in an analytical manner). The general problem of optimally mapping tasks to machines is NP-complete [57]. Task scheduling [58] is the fundamental step of VM management in the cloud. Task scheduling can be of two types: Static and Dynamic Scheduling.

1.7.2 Load balancing approaches

Load balancing is among the chief requirements of a cloud environment. Load balancing usually shifts the load from the highly loaded VM to the minimum loaded VM to ensure the uniform dispersal of load among VMs. It aimed to share the workload among computational resources to maintain load equilibrium and allow each resource to function within its designated efficiency threshold. The uneven distribution of load among VMs affects the improvement of response time, interaction overhead, output, and resource utilization of the system [31]. Furthermore, it improves VM availability and maintains reliability. Besides, the load can be balanced by implanting resource redundancy that fulfills scalability. Numerous strategies have been intended by researchers to attain the finest load balancing. Some of the advantages that inspire the implementation of load balancing in the Cloud are as follows:

  • Efficient VM Utilization in a Cloud Environment: In the cloud, VMs may be inadequately loaded further the general performance of the system will be affected. Moreover, the selected competitive VM can be highly utilized while the other VM may remain idle throughout the process and the underutilized VM may wait for a task. This scenario results in higher processing time and maximizes waiting time. To overcome such inconsistencies, VM utilization needs to be efficient by optimally balancing the load among resources.

  • Adequate Load Distribution: Ample Load distribution is necessary to attain the best possible performing of the system. It leads to utilizing the maximum computing capability of a particular VM and parallel task execution. Likewise, it ensures an adequate load allocated to every single VM according to its capacity in all conditions. It is necessary to dispense workload among all VMs uniformly according to their processing capacities to diminish the task execution time to the meanest possible value.

  • Minimization of Response Time: Inappropriate load distribution leads to several disparities resulting in higher response time which eventually results in an inconsistent state of the system. Thus, it is crucial to realize optimal load balancing to minimize the response time and achieve enhanced system throughput.

Besides, In the Cloud, VM can work independently or collectively as per the requirement and nature of the task. Each VM is capable of processing workload as per its processing capabilities. The prime target of load balancing is to achieve a balanced distribution of workloads among the available VMs. Typically, load-balancing algorithms comprise two elementary policies, i.e., the transfer policy and the location policy [59]. The transfer policy identifies whether the VM is overloaded or not. The dynamic system aspects are also addressed by this policy. The transfer policy also elects the necessity to introduce the load migration for the system. This policy determines when a node is ready to function as a transmitter based on workload evidence i.e., transfers a task to another VM. It further determines when a node acts as a recipient and retrieves a task from another VM.

However, the location policy decides on a suitable under-loaded or over-loaded VM. It locates corresponding VMs and allows them to send or receive workload between them to expand the total performance of a system. Later policies are further categorized as receiver-initiated, sender-initiated, or symmetrically initiated. The location policy chooses an alternative VM for the task migration. If the VM is identified as a qualified receiver, it further searches for a qualified sender VM and vice versa. Upon a virtual machine's eligibility as a transmitter or receiver, a selection policy will be implemented to determine which job in the queue should be moved first [31]. Based on the information and implementation used by these two policies, load balancing mechanisms are also classified as mentioned below [60]:

  • Static Load Balancing

  • Dynamic Load Balancing

  • Adaptive Load Balancing

  • Periodic Load Balancing

  • Non-Periodic Load Balancing

  • Advance Load Balancing

Generally speaking, load-balancing algorithms can also be categorized as hierarchical, decentralized, or centralized depending on where migration decisions are prepared [61, 62].

1.7.3 Fault-tolerant approaches

Cloud is a dynamic system that supports several dispersed resources (VMs) that are heterogeneous and complete millions of user tasks. Nevertheless, this VM has the flexibility to join or exit the system at any given time. Thus, achieving fault tolerance is a critical issue in such dynamic systems [63]. Additionally, the execution of a fault-tolerant system also leads to the optimizations of various QoS parameters and cloud characteristics. Therefore, significant benefits can be attained. It also assures task execution on time, in case of any unexpected scenarios like failure, resource disconnection from the system, task migration, any other unanticipated user operation, etc. Moreover, while numerous previous studies have tackled fault tolerance and task allocation, only a limited number have examined issues at the processor level. In recent literature, a handful of works have delved into extensive research on scheduling and load balancing while incorporating fault tolerance [17]. The concept of abstraction has been split into different layers, i.e., Infrastructure as a Service, Platform as a Service, and Software as a Service layer. There is a necessity to implement appropriate fault tolerance techniques for fault diagnosis to determine several faults in these service levels. This research article includes various fault diagnosis methods corresponding to these service layers, along with fault categories. The defects in any layer can have an impact on its top layer because of the layer interrelationships [17].

Moreover, to reach higher levels of strength in cloud computing, the failures need to be accessed and handled effectively [26, 29, 48, 64]. Extensive work has been proposed in the literature to make the cloud fault-proof. Some approaches proposed in the literature can be labeled as mentioned in Fig. 8.

Fig. 8
figure 8

Showing the categories of fault tolerance techniques under different approaches

Reactive fault tolerance

Once a defect has occurred, reactive fault tolerance is applied. Using this approach, we can decrease the impact of the fault in the cloud and thereby increase the system's robustness and reliability [46, 48]. The focus is on the device recovering in case of failure inside the system [10]. Furthermore, data replication and data transfer are used for restoration [65]. These approaches address Byzantine Faults, Crash faults, Hardware faults, and Host failure. Different fault-tolerant techniques that utilize a reactive approach are planned in Table 2.

Proactive fault-tolerance

This strategy provides pre-planned alternative solutions for the process of handling faults; therefore, fault prediction is proactive. Moreover, the faulty component is substituted with an alternative component runtime to avoid recovery from errors and faults [4, 46, 47, 66]. This approach provides the effectiveness of cost with maximum efficiency and reliability of the system [27] and addresses Software and Parametric faults. Some of the proposed proactive fault-tolerant techniques in the literature are listed in Table 3.

Table 3 Enlightenment of proactive fault-tolerant techniques

Resilient fault-tolerance

These techniques have some similarities with the Proactive approach. The defects are forecasted, and the effects are prevented or moderated by applying some methodologies. The forecasting utilizes some intelligent learning, which makes Resilient techniques different from Proactive ones. These approaches are adopted for general faults. In this strategy, the system is continuously monitored for faults, which makes it adaptive fault tolerance [10]. Some of the proposed Resilient fault-tolerant techniques in the literature are presented in Table 4.

Table 4 Enlightenment of resilient fault-tolerant techniques

In general, the reactive strategy does not require enforcement of any qualification mechanism in the system till the fault occurs. Efforts are being made to moderate the injurious effects after the detection of faults in the system [74]. In a Proactive strategy, the system is in continuous tracking to analyze the faults and eliminate them before they appear. The device state is continuously screened to predict the fault occurrence in advance. In Resilient strategies, the system operates even in the presence of faults, and the faults will be removed in the given timeframe. The related pros and cons of these tactics are presented in Tables 5 and 6, respectively.

Table 5 Pros of fault-tolerant strategies
Table 6 Cons of fault-tolerant strategies

1.7.4 General problem formulation for fault tolerance using replication

General problem formulation for fault tolerance using replication

Problem Statement: Problem formulation that focuses on the importance of fault tolerance in the circumstances of clouds.

Problem Scope: The fault tolerance in the cloud is addressed for continuous service delivery even in the event of failures or breakdowns.

Objectives: The main goal is to reduce fault-related service interruptions and downtime to maximize cloud service availability. Additionally, increasing resource utilization, loss of data, and maintaining SLA thresholds are also included in the formulation.

Problem Constraints: To guarantee that the efficiency effect of services is provided as needed. The fault tolerance techniques should add as little overhead as possible. Moreover, the solution should apply to the related computational resources.

Parameters: The parameters manipulated during fault tolerance are MTTF (Mean Time to Failure), MTBF (Mean Time Between Failure), MTTR (Mean Time To Reappear), etc. However, the parameters that are optimized are average resource utilization, makespan, recovery rate, failure rate, success rate, etc. There can be some decision parameters in fault tolerance such as selection of alternative resources, fault detection algorithm, recovery mechanism, etc.

Problem Formulation: For fault tolerance in real-time systems, two important sets can be considered i.e., tasks set (T), and VM set (V). T: {t1, t2…tn}, indicating that n real-time tasks at any instance in the Cloud environment. For each actual-time task {ti | ti ∈ T}, tI has some set of attributes associated with it such as arrival time, dimensions, expected execution time, anticipated finish time, anticipated harvest time, deadline-limit, etc. Deadline and harvest time can be related to each other as follows:

$$Exp\; HT=D-Min\; PT$$

V: { v1, v2…vm}, indicating that m number of accessible VMs in the Cloud environment.

For each accessible VM {vi | vi ∈ V}, vI has some set of attributes associated with it such as vm_id, capacity, cluster, etc.

Fault tolerance can be achieved by using any of the fault-tolerant approaches. Here we are utilizing the replication Fault tolerant technique. Here, the scheduler should possess the capability to generate the required amount of replicas separately for every real-time task.

For each {ti | ti ∈ T}

Enable the scheduler to generate replicas

Allocated VM to each replica,

Calculate the expected finish time fi,j,k for a given replica by the following equation:

$${\text{F}}_{\text{i},\text{j},\text{k}}=\text{A}\left({\text{t}}_{\text{i}}\right)+\text{w}\left({\text{r}}_{\text{i}}\right)+\text{e}\left({\text{r}}_{\text{i},\text{j},\text{k}}\right)$$

Where, i,j, and k represent the key of the original real-time task, the key of the current replica, and the key of the allotted VM, respectively. A is the arrival time for the real-time task, w is the waiting time of the replica, and e is the expected execution time of the replica over the allotted VM.

Further, e(ri,j,k) is computed by the following equation:

$$\text{e}\left({\text{r}}_{\text{i},\text{j},\text{k}}\right)=\frac{task\;dimensions}{computional\;power\;of\;alloted\;VM}$$

After e(ri,j,k) expires, the following condition is evaluated for every real-time task.

Ifreplica(ti) = failed

Mark ti “failed”

Else Mark ti “Succeeded”

Additionally, a reservation mechanism can also be used to achieve Fault tolerance where we reserve the VM in advance which will be allocated in case of fault.

Estimation Metrics: It comprises the estimation of some optimization parameters like recovery time, reached reliability, and effectiveness of resource use for both fault and regular operating conditions.

2 Related literature

The advancement in cloud computing technology has reformed the approach computing assets are provisioned, utilized, and managed. Cloud computing offers a vast array of services that are flexible, scalable, and cost-effective. To improve the utilization of cloud resources, various dynamic resource allocation algorithms have been intended in the works. However, ensuring fault-tolerant scheduling and load balancing is a critical challenge that needs to be addressed to provide uninterrupted services in the cloud. Virtual machine reservation is one of the promising approaches that can mitigate these challenges by allocating reserved resources for fault tolerance and load balancing.

2.1 Scheduling with fault-tolerance

Efficient scheduling in the cloud provides optimization of various Quality of Service parameters, especially task completion time. Besides, scalability, availability, security, and fault tolerance are the key features of cloud services. Instead of the complete breakdown of the system, the faults in the cloud lead to performance degradation only. Without fault-tolerant scheduling when one or more components of the system fail, the task execution, waiting time, response time, etc. may increase. This leads to enhanced throughput as well. However, Fault tolerance provides an alternative way for the process completion even if some of the resources may not work properly [46, 64]. Few works of literature have proposed fault-tolerant scheduling algorithms with optimized parameters. Recently, in [75], the Dynamic Clustering Cuckoo Whale Optimization Algorithm (DCCWOA) has been suggested for supporting effective fault-tolerant scheduling in the cloud. The algorithm was tested for varying the tasks between 100 to 1000 with 8 virtual machines. The problem of fault tolerance was also investigated in [76], and a greedy-based best fit decreasing (GBFD) algorithm was proposed for increasing the success rate of task execution along with optimization of other parameters. The model was valued with numerous loads of PUMA datasets. Additionally, the computational complexity was claimed to be O(nm) where n is the VM number in the data center, and m represents computing nodes. In [77], authors proposed GWO (Grey Wolf Optimization)—based Task Scheduling evaluated on the 1000MI task dataset. Fault handling is carried out in the proposed work with efficient task scheduling by employing the task resubmission technique. Extending the chain of work and solving the problems of dependability relationships, learning automata was used and a self-adapting scheduling strategy namely, ADATSA was proposed in [78]. The model was experimentally evaluated on 53 servers with 3 Master nodes and 50 slaves. The complexity was proposed to be O(NK) + O(MS) where N represents cluster nodes, K represents resource category, M is average tasks on a node, and S is average state transitions. In [79], a Fault-Tolerant Hybrid Resource Allocation Model (FTHRM) was recommended which confirms fault tolerance and minimized Turn-around-Time (TAT). The proposed model employs a prior reservation process to distribute resources to the respective tasks, ensuring the guaranteed execution of tasks. Resource reservation is also enabled for time slots with resource organization as needed by the task set with adjusting VM heterogeneity. In case of resource failure, alternative resources are being supplied where the most preferred resource has had the least former workload and the smallest execution time. The authors in [80] presented the framework for adaptive scheduling and fragmentation of tasks namely (WSADF) Workflow-scheduling applying -adaptable and dynamic-fragmentation which initially creates the fragments concerned with the number of VMs in the fragmentation phase and later the scheduling phase pick out the VMs concerned to reduce the usage of bandwidth. WSADF was evaluated on the workload ranging from 25 to 1000 and VMs ranging from 5 to 25. While making the task scheduling adaptable to both heterogeneity and homogeneous environments, CPSO and FIPS were proposed in [81]. The proposed task scheduling was evaluated on 30 servers under 1000 iterations. In this chain to integrate localized edge clouds with publicly accessible clouds and enhance scheduling effectiveness and scalability, a hierarchy-based edge cloud concept was introduced in [82]. Additionally, FTDS, a failure rescue technique is suggested to address the fears that arise while mobile apps are being executed. For evaluation, the workflow was taken from 10 to 70 applications while taking the length of the workflow from 10 to 60. Besides, some of the SLA (Service level agreement) parameters like, CPU necessity, system bandwidth, and memory need to be considered with appropriate scheduling. In this regard, the pre-emption-based algorithm was proposed in [83] which pre-empts the resources from the low-priority task to the high-priority task in case of unavailability of the resources and provides reservation of resources reflecting numerous SLA parameters for facility deployment. The evaluations were carried out via 4 cloud simulations by performing 10 consecutive runs and 60 requests having 10 to 15 subtasks. The cost and deadline of the tasks are considered for defining the priority of the tasks. Moreover, it provides a dynamic resource provisioning and effective fault tolerance process. In this chain, a fault-tolerance aware task scheduling scheme was proposed in [84] namely Checkpointed League Championship Algorithm (CPLCA). This algorithm provides fault tolerance using the checkpointing strategy along with task migration and was evaluated by using workload in the form of Standard Workload Format accessible via the San Diego Supercomputer Center (SDSC). Efficient scheduling and fault handling mutually may ensure task execution and thereby fulfill the real-time environment of the cloud. However, heterogeneous systems and their complexities are increasing dramatically leading to failures. These failures can be eliminated by implementing efficient scheduling approaches. Therefore, the task scheduling problem on heterogeneous systems was addressed in [85]. Being an NP-hard problem, a heuristic algorithm Deadline Based Scheduling Algorithm (DBSA) was proposed to resolve it. The DBSA approach dynamically estimates the figure of permanent tolerating failures by calculating the makespan first till the system tolerates a fixed number of failures. Afterward continuously comparing the makespan with the specified deadline to get the successive number of tolerating failures. The model was evaluated in the workload ranges from 20 to 100 with 4 and 8 VMs. Gaussian Elimination, Fast Fourier Transformation, and Molecular Dynamics Code are used as a kind of application graphs for testing. Finally, the task is mapped to the appropriate processor without violating precedence constraints. Further, in [86] Cost-effective, NNCA_PSO was proposed by modifying Particle Swarm Optimisation (PSO). During evaluations, the workload was varied from 70 to 560 and VMs were used from 4 to 8. Further, the Advance Reservation Fault Tolerance Model (ARFTM) was proposed in [87] which maps the tasks using MCT and tolerates faults using the advance reservation technique. ARFTM was evaluated by varying the workload from 1 to 300.

However, in [88], the fact that “the network bandwidth is limited” and the scheduling policies should decrease the bandwidth usage in cloud computing was considered. Moreover, the author proposes a data locality-based task scheduling approach, i.e., the Balance Reduce Algorithm (BAR). It will reduce network access and thereby reduce bandwidth usage and job completion time while not specifying the type and nature of workload used for evaluation. Furthermore, an improved Balance Reduce Algorithm was proposed with an improvement in machine failure handling. Later in [41], fault tolerance-based scheduling was proposed namely the Dynamic Clustering League Championship algorithm (DCLCA) to reduce the premature failure of the tasks. The model was evaluated in two scenarios where a parallel workload archive containing 73,496 tasks in the form of Standard Workload Format accessible via the San Diego Supercomputer Center (SDSC) was used in the first scenario. In the second scenario, workloads were produced as of the CloudSim’s Workload PlanetLab. All the surveyed methods are brief in Table 7.

Table 7 Comparative analysis of recent scheduling-based fault tolerance algorithms

Scheduling and fault tolerance frameworks

Various scheduling and fault tolerance frameworks are recommended in the literature. In this section, these frameworks are surveyed and presented. Comparative analyses of different scheduling and fault tolerance frameworks are presented in Table 8.

Table 8 Comparative analysis of various Fault tolerance and scheduling frameworks

2.1.1 Proactive-based scheduling and fault tolerance framework

In this approach, the system can handle any disruptions or interruptions. The state of the system is monitored continuously for breakdowns and failure. Some of the proactive-based scheduling and fault tolerance frameworks found in the literature are mentioned below:

  • SHelp [91]: This approach was proposed by improving the existing framework namely, ASSURE [100] which was implemented at the rescue points. ASSURE searches the rescue point by traversing the rescue-trace graph while in SHelp each rescue point is assigned some zero-initialized weight. For a particular rescue point, the weight increases proportionally as the rescue point is applied. Whenever a fault occurs, the searching of these rescue points is performed in the decreasing order of their weight.

  • PFHC [92]: This is a proactive framework of fault tolerance proposed for HPC (High-Performance Computing) applications in the cloud. This framework works on three chief modules: Node Monitoring Module is prepared with some special Lm-sensors [101, 102] to perform periodic monitoring for several parameters such as fan speed, CPU temperature, etc., for wellness. The fault Tolerance Module comes into action in case of the occurrence of faults. It is responsible for providing the information to the resource provider for an alternative resource and migrating the requests to the new resource. The Controller Module is installed at every node. The fault-tolerant policy is implemented by this module. The module is also responsible for the actual-time migration of VMs.

  • WSRC [93]: This framework contains a module namely, a failure detector that checks the Virtual Machine Manager (VMM) periodically for any kind of variations such as delay in response time or mismanagement of memory. If any fluctuation is found, the VM running status is saved and VMM is repaired using the rejuvenation technique. Rejuvenation generally leads to high overheads however, WSRC uses variable time rejuvenation to control overheads.

  • SRFSC [94]: The software rejuvenation technique was used in this framework. This framework primarily works in three phases: In the first phase, the packet that has the information about CPU and VM’s memory usage is received by Aging Failure Detection. The another step is the evaluation of VM for failing grades. This step is known as Aging Degree Evaluation which itself is done in two main aspects: CPU/Memory usage and Packet arrival, i.e., whether the packet arrived before/after expectations or the packet has been lost. In the third and final step, the decision of Rejuvenation is taken. Here, the VMs are migrated to any other native VM and the original VM is rebooted.

  • FTDG [95]: FTDG is a fault-tolerant framework where the pre-emptive relocation is being achieved. The architecture of this framework mainly comprises of four functioning spaces. User Space is used by the user to submit their data flows. Graph Space transforms the submitted user data into Direct Acyclic Graphs (DAG). Moreover, the DAG is analyzed for the critical and non-critical paths. In Storm Space, Scheduling and fault tolerance mechanisms are applied. Hardware Space contains various data center resources.

2.1.2 Reactive-based scheduling and fault tolerance frameworks

In such frameworks, the faults are handled once they occur. Unlike proactive approaches, monitoring of system behavior is not required in such frameworks. Some of the Reactive-based scheduling and fault tolerance frameworks found in the literature are mentioned below.

  • AFTRC [96]: In (Adaptive Fault Tolerance in Real-time Cloud Computing), the received tasks are held in some input buffer and the task execution will be accomplished on a First Come First Serve basis. This model also consists of the other modules. The Acceptance Test (AT) is the module that checks the results of each embedded algorithm for accuracy and verifies the results. The Time Checker (TC) checks whether the result is obtained within the deadline or not. If not, then the specific task is sent backside to the input buffer. The Reliability Assessor (RA) adjusts the reliabilities of VMs based on obtained results. The decision Mechanism (DM) takes the highest reliable node and selects the output from that.

  • BFTCloud [97]: This framework uses replication techniques and completes the user requests timely even in the presence of faults. The amount of replicas/nodes is utilized by employing the failure probability of all nodes. The failure likelihood of the replica group should constantly be less than the top-level failure likelihood. The functioning of BFTCloud framework mainly works in five phases: Primary Selection: In this phase, the basal node is designated based on the rating by adding the priority weight and QoS value assigned to each node. The highest rating value node will be chosen as the primary node. Replica Selection: In this phase, the number of replicas is selected by observing the QoS of every node from the viewpoint of both the primary node and the cloud module. The new QoS is calculated, and again rating will be done. Request Execution: This phase allows the nodes to complete the request and react to the cloud module accordingly. The cloud, in turn, checks the consistency of the obtained results based on different cases [17]. If the results are consistent, then the primary replica is assigned to the next request. Primary Updating: In case of a fault in the primary replica, this phase informs all other replicas to select the alternative. Replica Updating: This phase removes the faulty replica and adds the new nodes to decrease the failure probability.

  • FESTAL [98]: It is a fault-tolerant scheduling framework where the primary backup technique is realized to handle the faults. In this framework, the user tasks are queue up in some input buffer and assigned to the schedular having three controllers, i.e., Resource Controller, Backup Copy Controller, and Real-time Controller.

The Backup Copy Controller takes the backup. Afterward, the Resource Controller explores the two VMs, that can complete the task within the deadline. Based on the search results, two decisions can be made.

  • In case the two corresponding VMs are found, both task instances are scheduled on the respective VMs.

  • In case no such VM is found, the task is rejected.

In this framework, "If the anticipated end time is less than or identical to the task time-limit, a passive backup is utilized; otherwise, an active backup is employed.

2.1.3 Resilient-based scheduling and fault tolerance frameworks

These techniques have some similarities to the proactive approach. Moreover, the defects are forecasted, and the effects are prevented/moderated by applying some methodologies. The forecasting uses some intelligent learning that makes resilient techniques different from proactive ones. Compared to conventional fault tolerance techniques, resilient fault tolerance provides increased durability and adaptability in the event of system breakdowns. Some of the advantages of resilient fault tolerance over traditional fault tolerance are:

  • Dynamic environment:

  • Resilient systems can bounce back from errors without sacrificing functionality because they can dynamically adjust according to shifting circumstances. They are made to respond quickly to changing threats and difficulties. However, conventional fault tolerance techniques could find it difficult to adjust to sudden or rapid shifts in the environment. They might not react to new kinds of errors as well since they frequently rely on predetermined rules.

  • Recovery:

  • Often, automated recovery mechanisms found in resilient systems are capable of promptly detecting and fixing errors without the need for human interaction. This reduces the effect on Coordinated functions and decreases downtime. On the other hand, to recover from errors, traditional approaches might need more manual intervention as compared to Resilient ones. This could result in longer time frames for recovery and a higher chance of service interruption.

  • Real-time track reporting:

  • Sophisticated analytics and tracking techniques offer practical observations on the system’s performance and are frequently integrated into resilient systems. Furthermore, active defect identification and prevention are also possible by these techniques. Unlikely, conventional approaches might be less successful in locating and addressing errors as they depend on frequent checks or event-generated reactions.

  • Optimization:

  • Resilient systems maximize the use of the available resources during fault recovery, guaranteeing that resources are distributed effectively to sustain critical operations. Besides, traditional techniques could use expensive strategies, which could result in more inefficiency and lower effectiveness of the system all around.

  • Flexibility and adaptability:

  • Improved adaptability and flexibility are frequently displayed by resilient designs, enabling them according to changing demands and adjust resources upward or downward in response to consumption. However, traditional approaches could find it difficult to adjust dynamically or regulate shifting demands, which could result in inefficiencies during times of high consumption.

Put it all up, systems can recover swiftly and efficiently in dynamic contexts due to resilient fault tolerance, which provides a more systematic and flexible approach to addressing failures. When compared to conventional fault tolerance techniques, this strategy frequently results in increased overall performance, decreased interruptions, and enhanced system efficiency. In this context, EFTT (Efficient Fault Tolerance Technique) is a type of resilient-based approach. In [99], the author used Machine Learning to handle faults and generate solutions for FT. ML was, nevertheless, applied as a sub-constituent of the general FT solution. Some solutions have intensively employed ML to forecast faults using a set of specified variables. Many applications have been working with ML while handling hardware faults. Here, artificial intelligence, or machine learning, is used to create a system that can operate autonomously like a human without the need for human concern. Machine learning procedures can be used to increase a system's reliability even in the case of fault tolerance. Such fault tolerance techniques are known as Resilient Fault-Tolerant Techniques as discussed in Section 5.3. Machine learning techniques are typically used in proactive approaches, predicting failures before they happen by using historical system data. The Resilient techniques for fault tolerance are the emerging ones because the ML accesses data and even can learn from data. One of the similar learning methods namely, reinforcement learning was applied in [103] that studies the fitness of VMs in cloud environments. By using such types of learning, every VM participates in the learning process independently. As recommended in [104], fault tolerance in a distributed or parallel learning system is achieved by constantly tracking the input parameters in the server. Here, the entire system returns to the most recent checkpoint following an error. Checkpoints are not performed at every stage by such systems, even in the presence of a high number of calls and activity in the network. Forecasting defects are well-known in fault identification and handling, as stated in [105]. Quick error detection can prevent more serious system failures. Numerous processes make up this operation, and some of the most recent research investigations include model-based approaches that are quantitative, model-based approaches that are qualitative, and history-based. Apart from reinforcement learning, unsupervised learning is an additional technique for pattern recognition in the data without predefined output [106]. Such techniques do not allow for the estimation of the outcome since unsupervised learning lacks an output target. Instead, algorithms have chosen to depend on their expertise to pull out as much detail as they can from the data. The deep learning techniques were proposed in [107] as a rapid way to identify multicriteria errors in complicated industrialized analysis. Fault tolerance can benefit from the application of such AI-related techniques.

Fault Induction: In this Resilient technique, failures are managed by making assumptions based on the reaction of the system. Moving forward in this technique, [108] proposed that a hybrid energy system be practically used to apply the multi-source power administration technique. The analysis shows how to improve fault tolerance, scalability, efficiency, and dependability. The concepts proposed in [99] are being used by some of the most well-known firms in the world, including Google and Amazon, to increase their fault tolerance. Here the authors have employed the software namely gameday. GameDay is software intended to highlight significant shortcomings in methods for finding flaws and dependencies between different components of the system. In a GameDay scenario, team members from every level of the business must collaborate to find a solution. In the repeatable tests if everything went perfectly, then the GameDay activity will be considered successful. Similarly, the authors in [109], employed game theory and declared that the kind of smart grid operator will swiftly supply electricity through a dispersed system. Additionally, several classifiers have been compared for the metrics like accuracy and fault predictions in [110].

2.2 Load balancing with fault tolerance

Load balancing with fault tolerance is a significant dispute in cloud computing. The efficient load balancing techniques require the inclusion of fault tolerance capacity as well. It enables the system to distribute the load to all the available nodes uniformly and simultaneously deals with detecting and removing the faults to maximize the performance and efficiency of the cloud environment. Various algorithms are surveyed and presented in Table 9. The authors have introduced Honeybee Inspired-Load Balancing (HBI-LB) in [111], a reliable and nature-inspired Fault Tolerant load-balancing approach. The assigned tasks in the suggested method were in the range of 100 to 500 in number and 2000 to 10000 in length. Further 10 and 15 fog centers and fog nodes were utilized respectively. The information of scheduling the other in-progress tasks about the status and load on the resources is given by other assigned tasks in the same way as the honeybees inform buddies about their position. Besides, in [112], the Proactive and Reactive Fault Tolerance Framework (PFTF) was proposed with ECB (Elastic Cloud Balancer). It avoids the situation in the cloud where some nodes are idle or minimum loaded, and some are overloaded. The proposed ECB enhances the scheduling quality in combination with the Job Shop Scheduling by considering and optimizing QoS parameters. The model was evaluated by taking the tasks in the range of 9 to 13 and task size in the range of 1000 MB to 8000 MB. Additionally, due to the dynamic nature of cloud infrastructure, real-time features such as availability and reliability need to be achieved. In this chain, Proactive Load Balancing Fault Tolerance (PLBFT) was proposed in [113] as an efficient fault-tolerant load-balancing model evaluated on the private cloud platform. This model relies on migrating the faulty VM to another destination host directly. Besides, the load in the destination VM is managed (in case of overload in the destination VM) before migrating the defective VM there. Furthermore, this approach shows high reliability as compared to other similar techniques. Load balancing and fault tolerance techniques are designed to provide highly reliable and available services. For further growth in the availability of cloud services, a combination of load-balancing and fault-tolerant techniques has been proposed [114]. The proposed model is highly reliable in case of task failure while taking the task number between 13 to 18, task execution time between 1 to 9, and task priority between 1 to 3 with four VMs. Moreover, in [115], Deadline Pre-emptive Scheduling (DBPS) was proposed based on cloud partitioning where the fault tolerance has been achieved by Throttled Load Balancing for Cloud (TLBC). The model was tested on a workload of 10 to 300 while not specifying the number of VMs. However, a Machine learning-based approach was proposed in [99], namely, Fault-tolerance Load Balancing (FTLB), which embeds fault tolerance in load balancing with the optimization of other QoS parameters. The evaluation was performed using 100 computing cycles on three VMs. Furthermore, an Integrated Virtualized Failover strategy (IVFS) like AFTRC was proposed in [116]. It employs replication and checkpoint-restart in which Cloud Load Balancer (CLB) was added to AFTRC, and checkpointing was carried out by implementing the Reward Renewal Process (RRP) [117]. Once the load was received, it was transferred to CLB by the Cloud Controller (CC). The main job of CLB was to replicate the load on some suitable VM based on load information in case of failure.

Table 9 Comparative analysis of different proposed fault tolerance and load balancing algorithms

The comparative analysis of different fault tolerance-based load-balanced algorithms is presented in Table 10. These algorithms were proposed to distribute the workload regardless of faults across the nodes, i.e., having the capacity to handle the faults.

Table 10 Comparative analysis of fault-tolerant-based load-balancing algorithms

3 Discussions and observations

The presented survey summarizes the focus of researchers on distinct hybrid fault tolerance-related frameworks. The main emergent and developing methods of fault tolerance in a cloud environment are categorized into three different categories: Reactive Methods, Proactive Methods, and Resilient Methods. The survey was conducted on two main hybrid fault-tolerant categories, i.e., scheduling with fault tolerance and load balancing with fault tolerance. On surveying, several observations were gathered and listed below.

3.1 Statistics of hybrid survey of scheduling and fault tolerance algorithms

While dealing with hybrid frameworks of scheduling and fault tolerance, researchers have focused on all three fault tolerance approaches, i.e., Reactive, Proactive, and Resilient. However, it is observed that more emphasis on Proactive and less on Resilient ones. The related statistics of these approaches are depicted in Fig. 9.

Fig. 9
figure 9

Showing different fault-tolerance approaches targeted by researchers

Moreover, different techniques such as Replication, Migration, and Rejuvenation have also been employed while dealing with this hybrid framework. Replication techniques are mainly used for reactive approaches. On the other hand, Migration and Rejuvenation techniques are utilized for proactive approaches. It is also observed from the literature that replication and migration techniques were more frequently used to address the faults in the cloud. Moreover, self-healing and checkpoint restart techniques are used by the SHelp framework. The statistics of different approaches employed for Reactive, Proactive, and Resilient strategies in this hybrid framework are depicted in Fig. 10.

Fig. 10
figure 10

Showing category-wise percentage of different techniques used in different fault-tolerant approaches

It is also noticed from the presented survey that different types of faults have been handled by using hybrid fault-tolerant scheduling and load-balancing frameworks. Moreover, it was observed that software faults, hardware faults, parametric faults, and crashes were resolved using a proactive approach. The reactive approach addressed configuration faults, parametric faults, byzantine faults, participant faults, and host failures. Likewise, resilient approaches are utilized to manage general faults. Additionally, the overall statistics of different faults handled by considered hybrid frameworks are depicted in Fig. 11.

Fig. 11
figure 11

Showing the percentage of optimized parameters in surveyed scheduling and fault tolerance

The statistics of the fault models focused on the surveyed articles show that researchers are more motivated towards software faults but the transient, intermittent, and permanent faults are found to be less in the eyes of the researchers. For several strong reasons, addressing these kinds of faults is essential in distributed systems/applications. First, proactive steps to guarantee system resilience are required due to the unpredictable nature of transient faults, which are brief interruptions in system performance. To reduce downtime and provide consistent user experience, organizations must recognize and address transient issues. Second, a major threat to system reliability is intermittent failures, which are defined by irregular disruptions that might happen at any time. To avoid flowing failures and guarantee the stability of necessary executions to preserve the system's overall integrity, intermittent faults must be effectively managed. Furthermore, we cannot exaggerate the seriousness of permanent faults. These enduring problems may cause the system to deteriorate over time, impacting system operation and SLAs. Therefore, resolving permanent faults is essential for maintaining the system's lifespan and functionality while ignoring them might cause irrevocable harm and compromise the global sustainability of the system. Finally, the maintenance of system continuity, robustness, and reliability is the primary reason for managing the discussed hardware failures. In the end, proactive fault management techniques contribute to uninterrupted system/application performance during unexpected obstacles by protecting the integrity of crucial operations and improving SLAs and thereby user experience and satisfaction.

3.2 Statistics of hybrid survey of load balancing and fault tolerance algorithms

It is also perceived in this survey that researchers have focused on the optimization of various parameters simultaneously along with fault tolerance. The response time was considered and optimized more frequently as compared to other QoS parameters. And least consideration is on task waiting time and the computational cost. Based on this survey, the statistics of various optimized parameters are presented in Fig. 12. Besides, the considered frameworks include both dynamic and static environments, and the researchers are more motivated toward dynamism than static algorithms. Figure 13. depicts the statistics of the surveyed models that support dynamism.

Fig. 12
figure 12

Showing the percentage of optimized parameters in surveyed load balancing and fault tolerance frameworks

Fig. 13
figure 13

Showing the percentage of dynamism in surveyed hybrid load balancing and fault tolerance frameworks

The analysis was carried out for the parameter optimization of the reliable cloud. Figure 14 presents the degree of optimization in metrics of scheduling with fault tolerance, scheduling with load balancing, fault tolerance, load balancing, and scheduling. Additionally, parameter optimization analysis of various fault-tolerant approaches from the literature was also conducted and presented in Fig. 15.

Fig. 14
figure 14

Showing the analyses of parameter optimizations for different cloud reliability measures

Fig. 15
figure 15

Showing the percentage of parameter optimizations for different fault-tolerant approaches

Finally, the observations regarding the platform or environment used for simulation in the presented surveys are statistically presented in Fig. 16.

Fig. 16
figure 16

Showing the percentage of tools used for simulation by the researchers

4 Forthcoming research directions and open issues

It can be examined from the reviewed state-of-art that some important QoS parameters, except Response Time, are not being focused on. Other parameters, such as makespan, turnaround time, waiting time, flowtime, resource utilization, and accuracy, also need to be considered. Furthermore, various other faults, like byzantine and system crashes, etc., are also not examined much in hybrid fault tolerance algorithms. Therefore, it is necessary to enhance the performance of these hybrid fault tolerance algorithms by contemplating these limitations in forthcoming research. Moreover, researchers should focus on some of the below-mentioned aspects to overcome the limitations of existing techniques [138, 139].

  • Focus more on resilient fault tolerance.

  • Focus on the computational cost along with fault tolerance.

  • Identify and predict the faults accurately.

  • Resolve faults with load balancing and scheduling.

  • Fault handling with optimization of other QoS parameters.

4.1 Future works

After careful consideration and assessment, it is concluded that several research fields might be followed to raise the performance of cloud computing and boost the optimization of QoS parameters of cloud systems. They are listed below:

  1. 1.

    The researchers can make the scheduling efficient for better makespan and average resource utilization.

  2. 2.

    The assessed state-of-the-art shows that, except for response time, certain crucial QoS criteria are not being prioritized. It is also necessary to take into account additional factors including turnaround time, waiting time, flow time, resource utilization, and accuracy.

  3. 3.

    To improve task execution time and scheduling, a large body of research is focused on discovering resource and workload identification criteria. For workloads to be adaptive, scalable, and optimal, under and overusing resources should be avoided.

  4. 4.

    A sender-initiated load balancing mechanism that assists in uniform load distribution among dispersed nodes is necessary for task relocation.

  5. 5.

    Reservation can be used for fault tolerance as suggested in [87] for ensuring complete execution of tasks where the resources are reserved well in advance and may be used in case of faults.

  6. 6.

    It is essential to concentrate on penalty limiting while taking into account system failures to attain QoS optimisation-based allocation.

  7. 7.

    Only a few scheduling methods include the availability feature, and it's highly dependent on VM failure and changes in the impact rate of users, therefore, to decrease VM failure, it is important to take this parameter into further consideration in later algorithms.

  8. 8.

    The penalties on account of faults can be minimized by accompanying the models with efficient load-balancing techniques.

  9. 9.

    It is clear from examining several methods that a task scheduling algorithm by itself cannot address all the issues. Most algorithms base their work scheduling on a few factors. One method, for instance, only considers the response time and execution time parameters and overlooks other QoS principles like the execution cost, dependability, utilization, etc. Therefore, by including more standards, an improved scheduling algorithm that can produce better results may be developed.

  10. 10.

    Future studies should consider the factors of scalability, elasticity, and other fault overheads which are the properties of the system to fit in a situation.

4.2 Methodical roadmap for open challenges

A structured strategy or roadmap presented in Fig. 17 that incorporates prioritization based on influence and feasibility is needed to address the scheduling and load balancing with fault tolerance challenges.

Fig. 17
figure 17

Showing the proposed structured roadmap to address the cloud challenges

5 Conclusion

In this study, diverse models for analyzing the faults, and rectifying these faults by implementing fault-tolerance integrated with scheduling and load-balancing strategies in cloud environments are comprehensively surveyed. The main emergent and developing methods regarding fault tolerance in the cloud environment are categorized into Proactive, Reactive, and Resilient. In resilient approaches, the revolutionary technologies AI/ML are considered and are observed to be more efficient than proactive and reactive techniques. It is because the reactive and proactive techniques normally employ the traditional procedures like, checkpoint restart, replication, migration, etc. which have limitations as these procedures could find it difficult to adjust dynamically or regulate to shifting demands, which could result in inefficiencies during times of high consumption.

After reviewing the literature, the below-mentioned conclusions can be drawn:

  • Checkpoint, restarting, and replication were found to be the frequently used methods to address the faults in the cloud.

  • Scholars and researchers are more concerned with determining crash defects than hardware faults such as transient, intermittent, or permanent faults.

  • When it comes to the implementation tool for evaluating the presented algorithms, research is mostly using the Cloudsim tool.

  • Proactive approaches have been used more frequently than reactive and resilient.

  • Researchers are more motivated toward response time and less towards makespan, adaptability, accuracy, and crashes.

  • Since the resilient approach utilized machine learning and artificial intelligence to predict and handle faults; therefore, it is the forthcoming effort of fault tolerance in the cloud.