Numerical computation algorithms for sequential checkpoint placement

doi:10.1016/j.peva.2008.11.003

Performance Evaluation

Volume 66, Issue 6, June 2009, Pages 311-326

https://doi.org/10.1016/j.peva.2008.11.003 Get rights and content

Abstract

This paper concerns sequential checkpoint placement problems under two dependability measures: steady-state system availability and expected reward per unit time in the steady state. We develop numerical computation algorithms to determine the optimal checkpoint sequence, based on the classical Brender’s fixed point algorithm and further give three simple approximation methods. Numerical examples with the Weibull failure time distribution are devoted to illustrate quantitatively the overestimation and underestimation of the sub-optimal checkpoint sequences based on the approximation methods.

Introduction

System failures in large scaled computer systems can lead to a huge economic or critical social loss. Checkpointing and rollback recovery is a commonly used solution for improving the dependability of file systems, and is regarded as a low-cost environment diversity technique from the standpoint of fault-tolerant computing. Especially, when the file system to write and/or read data is designed in terms of preventive maintenance, checkpoint generations can back up occasionally or periodically the significant data on the primary medium to the safe secondary medium, and can play a significant role to limit the amount of data processing for the recovery actions after system failures occur. If checkpoints are frequently taken, a larger overhead by checkpointing itself will be incurred. Conversely, if checkpoints are seldom placed, a larger recovery overhead after a system failure will be required. Hence, it is important to determine the optimal checkpoint sequence taking account of the trade-off between two kinds of overhead factor above. Since the system failure phenomenon under uncertainty is described by a probability distribution, called the system failure time distribution, the optimal checkpoint sequence should be determined based on any stochastic model [1], [2], [3], [4], [5].

Young [6] obtained the optimal checkpoint interval approximately for the computation restart after system failures. Baccelli [7], Chandy et al. [2], Dohi et al. [8], Gelenbe and Derochette [9], Gelenbe [10], Gelenbe and Hernandez [11], Goes and Sumita [12], Grassi et al. [13], Kulkarni et al. [14], Nicola and Van Spanje [15], Sumita et al. [16] proposed performance evaluation models for database recovery, and calculated the optimal checkpoint intervals which maximize the system availability or minimize the mean overhead during the normal operation. L’Ecuyer and Malenfant [17] formulated a dynamic checkpoint placement problem by a Markov decision process. Ziv and Bruck [18] reconsidered a checkpoint placement problem under a random environment, by taking account of the change of operation circumstance. Vaidya [19] examined the impact of checkpoint latency on overhead ratio for a simple checkpoint model. Recently, Okamura et al. [20] reformulated the Vaidya’s model [19] with a semi-Markov decision process.

On the other hand, some authors discussed the sequential checkpoint placement problems where the checkpoint intervals were not always constant. For instance, in almost all checkpoint models for transaction-based systems [7], [8], [9], [10], [11], [12], [16], it could be proved theoretically that the constant checkpoint intervals maximizing the system availability were better than the independent and identically distributed random checkpoint intervals. For any case, however, the sequential policy with aperiodic checkpoint interval can provide the general framework on the checkpoint placement, because the sequential checkpoint involves the periodic one as a special case. Duda [21] derived a recursive formula satisfying the optimal aperiodic checkpoint sequence maximizing the mean program execution time. Toueg and Babaog˜lu [22] developed a discrete dynamic programming algorithm which mminimizes the expected execution time of tasks placing checkpoints between two consecutive tasks under very general assumptions. Kaio and Osaki [23] and Ling et al. [24] proposed approximate methods to calculate the optimal checkpoint sequence minimizing the expected cumulative operation cost until the system failure.

In the sequential checkpoint placement problem, it is assumed that the system failure time obeys the common probability distribution, i.e. does not always obey the negative exponential distribution. Actually, the non-exponential system failure time distribution with increasing failure rate can be assumed for some real workstation failure data [25]. Also, it is reported that some system failures are caused by software aging such as resource exhaustion and that the system failure time cannot be regarded as an exponentially distributed random variable any more (see e.g.[26]). The sequential checkpoint placement problem is formulated as a complex non-linear optimization problem with unknown number of decision variables. This leads to the computational difficulty to place optimally the aperiodic checkpoint sequence even for a simple centralized system under a criterion of optimality.

Although the checkpointing models mentioned above mainly focused on centralized systems, the analytical techniques for reliability and performance evaluation can be applied to distributed systems [27]. Wong and Franklin [28] considered simple Markov models to determine the frequency of checkpointing in parallel systems. Plank and Thomaso [29] modeled the performance of coordinated checkpointing systems [30], [31], [32] where the number of processors dedicated to the application and the checkpoint intervals are selected by the user before running the program. They employed a birth and death Markov chain to determine the system availability of the parallel system over the long term. Agbaria et al. [33] took account of the rollback propagation [34] and evaluated the coordinated checkpointing protocols based on both the overhead ratio and simple Markov chain models. On the other hand, uncoordinated checkpointing techniques [35] are used to reduce the checkpointing overhead in normal processing. Soliman and Elmaghraby [36] developed the so-called hybrid state saving technique to reduce the mean time to execute a finite length task for an uncoordinated checkpointing system. However, the aperiodic checkpoint scheme has not been developed yet in the literature on parallel and distributed systems.

In this paper we consider sequential checkpoint placement problems for centralized systems under two dependability measures: steady-state system availability and expected reward per unit time in the steady state [37], [38]. When the checkpoint strategy is restricted to constant intervals, the past literature [7], [2], [8], [9], [10], [11], [12], [16], [6] provided satisfactory answers on the optimal constant checkpoint interval with the negative exponential or the general system failure time distribution under the specific cost criteria. Surprisingly, it should be noted that the general sequential checkpoint placement problems have not been studied sufficiently during the last three decades except for a few examples [21], [23], [24], [22]. Recently, Ozaki et al. [39] dealt with the same problem under the expected cumulative operation cost over infinite/finite time horizon, and developed an effective computation algorithm to calculate the optimal checkpoint sequence. However, the algorithm proposed in their paper [39] was not all-round and could not be applied to the general problems. In this paper, we develop numerical computation algorithms for the optimal sequential checkpoint placement under the steady-state system availability and the expected reward per unit time in the steady state. The basic idea is due to the Brender’s classical fixed-point theorem [40], that is, the computation algorithms proposed here converge to the real optimal solutions eventually.

The rest part of this paper is planned as follows: in Section 2, we define the notation and describe two sequential checkpoint placement models with perfect and imperfect checkpointing, referred as Model A and Model B, respectively. The system availability and the expected reward rate are formulated in Section 3 and Section 4, respectively, where the numerical computation algorithms to maximize them are derived. In Section 5 we introduce three approximation methods to calculate the sub-optimal checkpoint sequence. Numerical examples with the Weibull failure time distribution are devoted in Section 6 to illustrate the overestimation and underestimation of the sub-optimal checkpoint sequences based on the approximation methods. We compare the real optimal checkpoint sequence and its associated dependability measures with three approximate solutions. Finally, the paper is concluded with some remarks in Section 7. The computation algorithms in this paper provide evidently exact solutions for unsolved problems during the last three decades and their impact to the actual fault-tolerant file management will be very significant, because the underlying techniques may be applied to the aperiodic and distributed checkpointing protocols.

Section snippets

Model A

Consider a simple file system with sequential checkpointing over an infinite time horizon. The system operation starts at time $t_{0} = 0$ , and the checkpoint (CP) is sequentially placed at time ${t_{1}, t_{2}, \dots, t_{k}, \dots}$ . At each CP, $t_{k} (k = 1, 2, \dots)$ , all the file data on the main memory is saved to a safe secondary medium such as CD-Rom, where the cost (time overhead) $c_{0} (> 0)$ is needed per each CP placement. It is assumed at the moment that the system operation stops during the checkpointing and the file system has

Model A

From the previous discussion in Section 2, the steady-state system availability for Model A is given by $A_{A} (\bar{t}) = \frac{E [X ∣ \bar{t}]}{T (\bar{t})} = {c_{0} [1 + \sum_{k = 1}^{\infty} \bar{F} (t_{k})] + μ^{- 1} + a_{0} [\sum_{k = 1}^{\infty} \int_{t_{k - 1}}^{t_{k}} x d F (x) - \sum_{k = 1}^{\infty} (t_{k} - t_{k - 1}) \bar{F} (t_{k})] + b_{0}}^{- 1} / μ .$ It should be noted that obtaining the optimal CP sequence maximizing $A_{A} (\bar{t})$ is equivalent to minimizing $T (\bar{t})$ because the numerator of Eq. (18) is constant with respect to $\bar{t}$ . Recently, this type of optimal CP placement problem was considered by the same authors [39], where the sequence of

Performability analysis

Our next concern is the expected reward per unit time in the steady state. Define:

•
$a$ : reward per unit time in the normal state
•
$b$ : reward per unit time during the CP is placed
•
$c$ : reward per unit recovery time in the system down state
•
$P_{1} (\bar{t})$ : steady-state probability that the system is in normal state
•
$P_{2} (\bar{t})$ : steady-state probability that the system is in checkpointing state
•
$P_{3} (\bar{t})$ : steady-state probability that the system is recovering from a system failure.

Then, the expected reward per unit time in

Exponential approximation

For Model A under the availability criterion, it is well known that the optimal CP interval is constant, i.e., $t_{1} = t_{2} - t_{1} = \dots = t_{k + 1} - t_{k} = \dots$ , if $F (t)$ is the exponential distribution with mean $1 / μ$ . Under the assumptions that $a_{0} = 1$ and $b_{0} = 0$ , Young [6] considered the checkpoint restart model with constant CP interval with the exponential system failure time distribution, and derived the following non-linear equation which satisfies the optimal CP interval $t_{1}$ : $e^{c_{0} μ} - t_{1} μ - e^{- t_{1} μ} = 0 .$ Based on the second order

Numerical examples

In this section we calculate numerically the exact and approximate optimal CP sequences, and compare them in terms of dependability measures. Here we represent the approximations by an exponential distribution, the constant CP interval with general distribution and the constant hazard by Approximation 1, Approximation 2 and Approximation 3, respectively. Suppose that the system failure time obeys the Weibull distribution: $F (t) = 1 - e^{- {(\frac{t}{η})}^{m}}$ with shape parameter $m (> 0)$ and scale parameter $η (> 0)$ . We

Conclusion

In this paper, we have developed numerical computation algorithms for sequential checkpoint placement, so as to maximize the steady-state system availability and the expected reward per unit time in the steady state, and compared numerically the real optimal solutions with some approximate ones. The lesson learned from the numerical study in this paper is that three approximation methods provide rather different checkpoint sequences with larger error in the earlier operational phase, as the

Tatsuya Ozaki received the B.S.E. and M.S. from Hiroshima University, Japan, in 2001 and 2005, respectively. In 2005, he joined NTT Facilities, Inc., Japan as a Technical Stuff. His research interests are dependable computing and performance evaluation. His papers appeared in IEEE Transactions on Dependable and Secure Computing and several major conferences like DSN 2004, DASC 2006, etc.

References (43)

A. Duda
The effects of checkpointing on program execution time
Information Processing Letters
(1983)
N. Kaio et al.
A note on optimum checkpointing policies
Microelectronics and Reliability
(1985)
K.F. Wong et al.
Checkpointing in distributed systems
Journal of Parallel and Distributed Systems
(1996)
J.S. Plank et al.
Processor allocation and checkpoint interval selection in cluster computing systems
Journal of Parallel and Distributed Computing
(2001)
K.M. Chandy
A survey of analytic models of roll-back and recovery strategies
Computer
(1975)
K.M. Chandy et al.
Analytic models for rollback and recovery strategies in database systems
IEEE Transactions on Software Engineering
(1975)
G.M. Lohman et al.
Optimal policy for batch operations: Backup, checkpointing, reorganization and updating
ACM Transactions on Database Systems
(1977)
V.F. Nicola
Checkpointing and modeling of program execution time
A.N. Tantawi et al.
Performance analysis of checkpointing strategies
ACM Transactions on Computer Systems
(1984)
J.W. Young
A first order approximation to the optimum checkpoint interval
Communications of ACM
(1974)

F. Baccelli

Analysis of s service facility with periodic checkpointing

Acta Informatica

(1981)

T. Dohi et al.

Availability models with age dependent-checkpointing

E. Gelenbe et al.

Performance of rollback recovery systems under intermittent failures

Communications of the ACM

(1978)

E. Gelenbe

On the optimum checkpoint interval

Journal of the ACM

(1979)

E. Gelenbe et al.

Optimum checkpoints with age dependent failures

Acta Informatica

(1990)

P.B. Goes et al.

Stochastic models for performance analysis of database recovery control

IEEE Transactions on Computers

(1995)

V. Grassi et al.

On the optimal checkpointing of critical tasks and transaction-oriented systems

IEEE Transactions on Software Engineering

(1992)

V.G. Kulkarni et al.

Effects of checkpointing and queueing on program performance

Stochastic Models

(1990)

V.F. Nicola et al.

Comparative analysis of different models of checkpointing and recovery

IEEE Transactions on Software Engineering

(1990)

U. Sumita et al.

Analysis of effective service time with age dependent interruptions and its application to optimal rollback policy for database management

Queueing Systems

(1989)

P. L’Ecuyer et al.

Computing optimal checkpointing strategies for rollback and recovery systems

IEEE Transactions on Computers

(1988)

Cited by (16)

Evaluation of Level of Confidence and Optimization of Roll-back Recovery with Checkpointing for Real-Time Systems
2014, Microelectronics Reliability
Citation Excerpt :
The second key aspect relates to when checkpoints are taken. With respect to this there are two different RRC schemes, i.e. periodic (equidistant) checkpointing [24,16,11,18,25,10] and aperiodic checkpointing scheme [26–28,7,9]. In a periodic checkpointing scheme, the checkpoints are taken periodically, meaning that the distance between two successive checkpoints is always the same (equal to the period), while in an aperiodic checkpointing scheme, the distance between two successive checkpoints is not constant.
Increasing soft error rates for semiconductor devices manufactured in later technologies enforce the usage of fault tolerant techniques such as Roll-back Recovery with Checkpointing (RRC). As RRC introduces time overhead that increases the completion (execution) time, time constraints (deadlines) might be violated. This is a drawback for a class of computer systems where the correct operation is defined not only by providing the correct outcome of an operation but also by ensuring that the deadlines are met. These computer systems are referred to as real-time systems (RTSs). In general RTSs are classified as soft and hard RTSs depending on the consequences of violating the deadlines. For soft RTSs, where consequences of violating the deadlines are not very severe, research have focused on optimizing RRC and shown that it is possible to find the optimal number of checkpoints such that the average execution time (AET) is minimal. While minimal AET is important for soft RTSs, it is more important to provide a high probability that deadlines are met for hard RTSs, where consequences of violating the deadlines may be catastrophic. Hence, there is a need of probabilistic guarantees that jobs employing RRC complete before a given deadline. Traditionally, AET analysis have been used for soft RTSs and worst case execution time (WCET) analysis along with schedule feasibility have been used for hard RTSs. In this paper we introduce a reliability metric, Level of Confidence (LoC), which is equally applicable to both soft and hard RTS. LoC is used as a metric to evaluate to what extent a deadline is met. The main contributions of this paper are as follows. First, we present a mathematical framework for the evaluation of LoC when RRC is employed. Second, we provide a proof to verify the correctness of the proposed expression. Third, in the context of hard RTSs, we provide a method to obtain the optimal number of checkpoints that maximizes the LoC. Fourth, in the context of soft RTSs where the maximal LoC may not be needed, but instead some LoC requirement is needed, we present an optimization method for RRC that finds the number of checkpoints that results in the minimal completion time while the minimal completion time satisfies a given LoC requirement. Fifth, we use the proposed framework to evaluate and compare probabilistic guarantees when RRC is optimized towards soft RTSs.
Checkpoint scheduling model for optimality
2011, Information Processing Letters
To minimize the expected execution time, a general checkpoint scheduling algorithm is proposed to determine the near optimal checkpointing time sequence. More precisely, based on a simple timing policy, an execution analytical model is introduced and the expected effective ratio is derived. By maximizing the expected effective ratio, the optimal checkpoint period for the exponential failure distribution can be obtained directly, and a general checkpoint scheduling algorithm is developed to perform the near optimal checkpointing time sequence for an arbitrary failure distribution. Experimental results reveal that the proposal can perform varying checkpoint interval according to the failure distribution and the expected effective ratio of the execution is considerable for the long-running application in term of reliability.
Standard Inspection Models
2023, Springer Series in Reliability Engineering
Optimal Inspection Policies to Minimize Expected Cost Rates
2022, International Journal of Reliability, Quality and Safety Engineering
Computation algorithms for workload-dependent optimal checkpoint placement
2022, International Journal of Systems Assurance Engineering and Management
The Optimal Checkpoint Interval for the Long- Running Application
2021, Research Anthology on Architectures, Frameworks, and Integration Strategies for Distributed and Cloud Computing

View all citing articles on Scopus

Tadashi Dohi received the B.S.E., M.S. and Dr. of Engineering degrees from Hiroshima University, Japan, in 1989, 1991 and 1995, respectively. In 1992, he joined the Department of Industrial and Systems Engineering, Hiroshima University, Japan, as an Assistant Professor. Now, he is working as a Full Professor in the Department of Information Engineering, Graduate School of Engineering, Hiroshima University, Japan, since 2002. In 1992 and 2000, he was a Visiting Research Scholar in University of British Columbia, Canada and Duke University, USA, respectively, on leave of absence from Hiroshima University. His research areas include software reliability engineering, dependable computing and performance evaluation. He is a Regular Member of ORSJ, JSIAM, IEICE, ISCIE and IEEE. He published over 200 journal papers and refereed conference papers. Dr. Dohi is serving as an Associate Editor of IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences (A) and Asia-Pacific Journal of Operational Research, and an Editorial Board Member of Journal of Risk and Reliability, Journal of Autonomic and Trusted Computing, International Journal of Reliability and Quality Performance, etc. He published over 300 refereed papers. Dr. Dohi served as a General Chair of several international conferences like AIWARM 2004–2008 and WoSAR 2008 and as a Program Committee Chair of RASOR 2005–2007 and ISAS 2009.

Naoto Kaio received the B.S.E., M.S. and Dr. of Engineering degrees from Hiroshima University, Japan, in 1976, 1978 and 1982, respectively. He is a Full Professor in the Department of Economic Informatics, Hiroshima Shudo University, Japan. From 1986 to 1987, he was a Visiting Research Scholar in the William E. Simon Graduate School of Business Administration, University of Rochester, USA. His research areas include systems science, operations research and reliability theory. He is a Regular Member of ORSJ, IEICE, JIMA, IPSJ, JSQC, REAJ and IEEE. Also, Dr. Kaio is serving as Regional Editor for Asia in Journal of Quality in Maintenance Engineering.

^☆: This work is supported by a Grant-in-Aid for Scientific Research from the Ministry of Education, Sports, Science and Culture of Japan under Grant No. 18510138 (2006-2008) and the Research Program 2008 under the Center for Academic Development and Cooperation of the Hiroshima Shudo University, Japan. The authors very much appreciate two reviewers’ comments to improve the first version of this paper.

View full text

Numerical computation algorithms for sequential checkpoint placement☆

Abstract

Introduction

Section snippets

Model A

Model A

Performability analysis

Exponential approximation

Numerical examples

Conclusion

Information Processing Letters

Microelectronics and Reliability

Journal of Parallel and Distributed Systems

Journal of Parallel and Distributed Computing

A survey of analytic models of roll-back and recovery strategies

Computer

Analytic models for rollback and recovery strategies in database systems

IEEE Transactions on Software Engineering

Optimal policy for batch operations: Backup, checkpointing, reorganization and updating

ACM Transactions on Database Systems

Checkpointing and modeling of program execution time

Performance analysis of checkpointing strategies

ACM Transactions on Computer Systems

A first order approximation to the optimum checkpoint interval

Communications of ACM

Analysis of s service facility with periodic checkpointing

Acta Informatica

Availability models with age dependent-checkpointing

Performance of rollback recovery systems under intermittent failures

Communications of the ACM

On the optimum checkpoint interval

Journal of the ACM

Optimum checkpoints with age dependent failures

Acta Informatica

Stochastic models for performance analysis of database recovery control

IEEE Transactions on Computers

On the optimal checkpointing of critical tasks and transaction-oriented systems

IEEE Transactions on Software Engineering

Effects of checkpointing and queueing on program performance

Stochastic Models

Comparative analysis of different models of checkpointing and recovery

IEEE Transactions on Software Engineering

Analysis of effective service time with age dependent interruptions and its application to optimal rollback policy for database management

Queueing Systems

Computing optimal checkpointing strategies for rollback and recovery systems

IEEE Transactions on Computers