Skip to main content
Log in

Optimal fault-tolerant computing on multiprocessor systems

  • Published:
Acta Informatica Aims and scope Submit manuscript

Abstract.

Suppose \(m \ge 2\) identical processors, each subject to random failures, are available for running a single job of given duration \(\tau\). The failure law is operative only while a processor is active. To guard against the loss of accrued work due to a failure, checkpoints can be made, each requiring time \(\delta\); a successful checkpoint saves the state of the computation, but failures can also occur during checkpoints. The problem is to determine how best to schedule checkpoints if the goal is to maximize the probability that the job finishes before all \(m\) processors fail. We solve this problem first for \(m=2\) and an exponential failure law. For given \(\tau\) and \(\delta\) we show how to determine an integer \(k \ge 0\) and time intervals \(I_1, \ldots, I_{k+1}\) such that an optimal procedure is to run the job on one processor, checkpointing at the end of each interval \(I_j, j = 1, \ldots, k\), until either the job is done or a failure occurs. In the latter case, the remaining processor resumes the job starting in the state saved by the last successful checkpoint; the job then runs until it completes or until the second processor also fails. We give an explicit formula for the maximum achievable probability of completing the job for any fixed \(k \ge 0\). An explicit result for \(k_{opt}\), the optimum value of \(k\), seems out of reach; however, we give upper and lower bounds on \(k_{opt}\) that are remarkably tight; they show that only a few values of \(k\) need to be tested in order to find \(k_{opt}\). With the failure rate normalized to 1, we also derive the asymptotic estimate \( k_{opt} - \sqrt{2 \tau / \delta} = O(1)~~{\rm as}~~ \delta \to 0 ~, \) and calculate conditional expected job completion times. For the more difficult problem with \(m \ge 3\) processors, we formulate a computational approach based on a discretized model in which the failure law is the analogous geometric distribution. By proving a unimodality property of the optimal completion probability, we are able to describe a computation of this optimum that requires \(O(m n \log n )\) time, where \(n\) is the job running time. Several examples bring out behavioral details.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Author information

Authors and Affiliations

Authors

Additional information

Received: 29 September 1995 / 29 January 1997

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bruno, J., Coffman Jr, E. Optimal fault-tolerant computing on multiprocessor systems. Acta Informatica 34, 881–904 (1997). https://doi.org/10.1007/s002360050110

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s002360050110

Keywords

Navigation