Optimal fault-tolerant computing on multiprocessor systems

Bruno, John; Coffman Jr, E.G.

doi:10.1007/s002360050110

Optimal fault-tolerant computing on multiprocessor systems

Published: November 1997

Volume 34, pages 881–904, (1997)
Cite this article

Acta Informatica Aims and scope Submit manuscript

John Bruno¹ &
E.G. Coffman Jr²

84 Accesses
12 Citations
Explore all metrics

Abstract.

Suppose \(m \ge 2\) identical processors, each subject to random failures, are available for running a single job of given duration \(\tau\). The failure law is operative only while a processor is active. To guard against the loss of accrued work due to a failure, checkpoints can be made, each requiring time \(\delta\); a successful checkpoint saves the state of the computation, but failures can also occur during checkpoints. The problem is to determine how best to schedule checkpoints if the goal is to maximize the probability that the job finishes before all \(m\) processors fail. We solve this problem first for \(m=2\) and an exponential failure law. For given \(\tau\) and \(\delta\) we show how to determine an integer \(k \ge 0\) and time intervals \(I_1, \ldots, I_{k+1}\) such that an optimal procedure is to run the job on one processor, checkpointing at the end of each interval \(I_j, j = 1, \ldots, k\), until either the job is done or a failure occurs. In the latter case, the remaining processor resumes the job starting in the state saved by the last successful checkpoint; the job then runs until it completes or until the second processor also fails. We give an explicit formula for the maximum achievable probability of completing the job for any fixed \(k \ge 0\). An explicit result for \(k_{opt}\), the optimum value of \(k\), seems out of reach; however, we give upper and lower bounds on \(k_{opt}\) that are remarkably tight; they show that only a few values of \(k\) need to be tested in order to find \(k_{opt}\). With the failure rate normalized to 1, we also derive the asymptotic estimate \( k_{opt} - \sqrt{2 \tau / \delta} = O(1)~~{\rm as}~~ \delta \to 0 ~, \) and calculate conditional expected job completion times. For the more difficult problem with \(m \ge 3\) processors, we formulate a computational approach based on a discretized model in which the failure law is the analogous geometric distribution. By proving a unimodality property of the optimal completion probability, we are able to describe a computation of this optimum that requires \(O(m n \log n )\) time, where \(n\) is the job running time. Several examples bring out behavioral details.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Author information

Authors and Affiliations

Computer Science Department, University of California, Santa Barbara, CA 93106, USA (e-mail: bruno@cs.ucsb.edu) , , , , , , US
John Bruno
Bell Labs, Lucent Technologies, Murray Hill, NJ 07974, USA (e-mail: egc@research.bell-labs.com) , , , , , , US
E.G. Coffman Jr

Authors

John Bruno
View author publications
You can also search for this author in PubMed Google Scholar
E.G. Coffman Jr
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Received: 29 September 1995 / 29 January 1997

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bruno, J., Coffman Jr, E. Optimal fault-tolerant computing on multiprocessor systems. Acta Informatica 34, 881–904 (1997). https://doi.org/10.1007/s002360050110

Download citation

Issue Date: November 1997
DOI: https://doi.org/10.1007/s002360050110

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimal fault-tolerant computing on multiprocessor systems

Abstract.

Access this article

Similar content being viewed by others

Online Checkpointing with Improved Worst-Case Guarantees

Scheduling for Fault-Tolerance: An Introduction

Multiprocessor Scheduling with Availability Constraints

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimal fault-tolerant computing on multiprocessor systems

Abstract.

Access this article

Similar content being viewed by others

Online Checkpointing with Improved Worst-Case Guarantees

Scheduling for Fault-Tolerance: An Introduction

Multiprocessor Scheduling with Availability Constraints

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation