article

On the completion time distribution for tasks that must restart from the beginning if a failure occurs

Authors:
Robert Sheahan

University of Connecticut, Storrs, CT

University of Connecticut, Storrs, CT
View Profile

,
Lester Lipsky

University of Connecticut, Storrs, CT

University of Connecticut, Storrs, CT
View Profile

,
Pierre M. Fiorini

University of Southern Maine, Portland, MA

University of Southern Maine, Portland, MA
View Profile

,
Søren Asmussen

Aarhus University, Denmark

Aarhus University, Denmark
View Profile

ACM SIGMETRICS Performance Evaluation Review Volume 34 Issue 3December 2006pp 24–26https://doi.org/10.1145/1215956.1215967

Published:01 December 2006Publication History

ACM SIGMETRICS Performance Evaluation Review

Abstract

For many systems, failure is so common that the design choice of how to deal with it may have a significant impact on the performance of the system. There are many specific and distinct failure recovery schemes, but they can be grouped into three broad classes: RESUME, also referred to as preemptive resume (prs), or check-pointing; REPLACE, also referred to as preemptive repeat different (prd); and RESTART, also referred to as preemptive repeat identical (pri). The following describes the three recovery schemes: (1) RESUME: when a task is fails, it knows exactly where it stops, and can continue from that point when allowed to resume; (2)REPLACE: given a task fails, then when it begins processing again, it starts with a brand new task sampled from the same task time distribution; and, (3) RESTART: When a task fails, it loses all that it had acquired to up to that point and must start anew when upon continuing later. This is distinctly different from (2) since the task must run at least as long as it did before it failed, whereas a new sample, selected at random, might run for a shorter or longer time.

References

P. Fiorini, R. Sheahan, and L. Lipsky, "On Unreliable Computing Systems When Heavy-Tails Appear as a Result of The Recovery Procedure," ACM Sigmetrics Perf. Eval. Rev., Vol. 33(2), 2005. Google ScholarDigital Library
V. Kulkarni, V. Nicola, and K. Trivedi, "The Completion Time of a Job on a Multmode System," Advances in Applied Probability, 19:932--954, 1987.Google ScholarCross Ref

Recommendations

An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart
FTXS '16: Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale

Fault tolerance is a key challenge to building the first exa\-scale system. To understand the potential impacts of failures on next-generation systems, significant effort has been devoted to collecting, characterizing and analyzing failures on current ...
Read More
The Effect of Different Failure Recovery Procedures on the Distribution of Task Completion Times
IPDPS '05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17

For a system to be reliable, it must have one or more methods of dealing with failures. Distributed systems face both node failure and communication channel failure. Communication channels, in particular, may suffer failures at a very high rate. ...
Read More
Minimizing completion time of a program by checkpointing and rejuvenation

Checkpointing with rollback-recovery is a well known technique to reduce the completion time of a program in the presence of failures. While checkpointing is corrective in nature, rejuvenation refers to preventive maintenance of software aimed to reduce ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGMETRICS Performance Evaluation Review Volume 34, Issue 3
December 2006
62 pages
ISSN:0163-5999
DOI:10.1145/1215956
Issue’s Table of Contents

Copyright © 2006 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 December 2006
Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 31
  Total Citations
  View Citations
- 178
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

On the completion time distribution for tasks that must restart from the beginning if a failure occurs

ACM SIGMETRICS Performance Evaluation Review

Abstract

References

Cited By

Recommendations

An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart

The Effect of Different Failure Recovery Procedures on the Distribution of Task Completion Times

Minimizing completion time of a program by checkpointing and rejuvenation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

On the completion time distribution for tasks that must restart from the beginning if a failure occurs

ACM SIGMETRICS Performance Evaluation Review

Abstract

References

Cited By

Recommendations

An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart

The Effect of Different Failure Recovery Procedures on the Distribution of Task Completion Times

Minimizing completion time of a program by checkpointing and rejuvenation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media