Abstract:
This paper presents a novel approach to detect resources in distributed systems with an increased occurrence of intermittent faults that exceed the amount of unavoidable ...Show MoreMetadata
Abstract:
This paper presents a novel approach to detect resources in distributed systems with an increased occurrence of intermittent faults that exceed the amount of unavoidable transient faults caused by environmental phenomena. Intermittent faults occur due to stressed resources and often are a precursor of permanent faults. The proposed early fault detection and diagnosis allows the use of precautionary measures before the permanent failure of a component in a distributed system occurs. In this paper, we present four methods that can implicitly detect intermittent faults by taking the distributed applications and their dependencies into account. Thus, explicit tests are not required which would lead to additional costs and resource load. On the other hand, the implicit approach may considerably reduce the number of plausibility tests compared to the conservative solution with one test per resource. We analyzed and evaluated implementations of the proposed fault detection principle. The experimental results give evidence of the feasibility of our approach and show a comparison of the implemented methods in terms of runtime and detection rate.
Date of Conference: 20-23 January 2014
Date Added to IEEE Xplore: 20 February 2014
Electronic ISBN:978-1-4799-2816-3