Abstract:
Distributed systems can be homogeneous (cluster), or heterogeneous such as Grid, Cloud and P2P. Several problems can occur in these types of systems, such as quality of s...Show MoreMetadata
Abstract:
Distributed systems can be homogeneous (cluster), or heterogeneous such as Grid, Cloud and P2P. Several problems can occur in these types of systems, such as quality of service (QoS), resource selection, load balancing and fault tolerance. Fault tolerance is a main subject regarding the design of distributed systems. When a hardware or software failure occurs in the system, it causes a failure and we call it, in this case, a fault. Moreover, in order to allow the system to continue its functionalities, even in the presence of these faults, they must find techniques, which tolerate failure; the goal of these techniques is to detect and to correct these errors. In this paper, we introduce at first an overview of the basic concepts of distributed systems and their failures types, then we present, in a detailed manner, the different techniques that tolerate fault, used to identify and to correct faults in different kinds of systems such as: cluster, grid computing, Cloud and P2P systems.
Date of Conference: 24-25 October 2018
Date Added to IEEE Xplore: 03 January 2019
ISBN Information: