Abstract
For many systems, failure is so common that the design choice of how to deal with it may have a significant impact on the performance of the system. There are many specific and distinct failure recovery schemes, but they can be grouped into three broad classes: RESUME, also referred to as preemptive resume (prs), or check-pointing; REPLACE, also referred to as preemptive repeat different (prd); and RESTART, also referred to as preemptive repeat identical (pri). The following describes the three recovery schemes: (1) RESUME: when a task is fails, it knows exactly where it stops, and can continue from that point when allowed to resume; (2)REPLACE: given a task fails, then when it begins processing again, it starts with a brand new task sampled from the same task time distribution; and, (3) RESTART: When a task fails, it loses all that it had acquired to up to that point and must start anew when upon continuing later. This is distinctly different from (2) since the task must run at least as long as it did before it failed, whereas a new sample, selected at random, might run for a shorter or longer time.
- P. Fiorini, R. Sheahan, and L. Lipsky, "On Unreliable Computing Systems When Heavy-Tails Appear as a Result of The Recovery Procedure," ACM Sigmetrics Perf. Eval. Rev., Vol. 33(2), 2005. Google ScholarDigital Library
- V. Kulkarni, V. Nicola, and K. Trivedi, "The Completion Time of a Job on a Multmode System," Advances in Applied Probability, 19:932--954, 1987.Google ScholarCross Ref
Recommendations
An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart
FTXS '16: Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme ScaleFault tolerance is a key challenge to building the first exa\-scale system. To understand the potential impacts of failures on next-generation systems, significant effort has been devoted to collecting, characterizing and analyzing failures on current ...
The Effect of Different Failure Recovery Procedures on the Distribution of Task Completion Times
IPDPS '05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17For a system to be reliable, it must have one or more methods of dealing with failures. Distributed systems face both node failure and communication channel failure. Communication channels, in particular, may suffer failures at a very high rate. ...
Minimizing completion time of a program by checkpointing and rejuvenation
Checkpointing with rollback-recovery is a well known technique to reduce the completion time of a program in the presence of failures. While checkpointing is corrective in nature, rejuvenation refers to preventive maintenance of software aimed to reduce ...
Comments