Numerical computation algorithms for sequential checkpoint placement

https://doi.org/10.1016/j.peva.2008.11.003Get rights and content

Abstract

This paper concerns sequential checkpoint placement problems under two dependability measures: steady-state system availability and expected reward per unit time in the steady state. We develop numerical computation algorithms to determine the optimal checkpoint sequence, based on the classical Brender’s fixed point algorithm and further give three simple approximation methods. Numerical examples with the Weibull failure time distribution are devoted to illustrate quantitatively the overestimation and underestimation of the sub-optimal checkpoint sequences based on the approximation methods.

Introduction

System failures in large scaled computer systems can lead to a huge economic or critical social loss. Checkpointing and rollback recovery is a commonly used solution for improving the dependability of file systems, and is regarded as a low-cost environment diversity technique from the standpoint of fault-tolerant computing. Especially, when the file system to write and/or read data is designed in terms of preventive maintenance, checkpoint generations can back up occasionally or periodically the significant data on the primary medium to the safe secondary medium, and can play a significant role to limit the amount of data processing for the recovery actions after system failures occur. If checkpoints are frequently taken, a larger overhead by checkpointing itself will be incurred. Conversely, if checkpoints are seldom placed, a larger recovery overhead after a system failure will be required. Hence, it is important to determine the optimal checkpoint sequence taking account of the trade-off between two kinds of overhead factor above. Since the system failure phenomenon under uncertainty is described by a probability distribution, called the system failure time distribution, the optimal checkpoint sequence should be determined based on any stochastic model [1], [2], [3], [4], [5].

Young [6] obtained the optimal checkpoint interval approximately for the computation restart after system failures. Baccelli [7], Chandy et al. [2], Dohi et al. [8], Gelenbe and Derochette [9], Gelenbe [10], Gelenbe and Hernandez [11], Goes and Sumita [12], Grassi et al. [13], Kulkarni et al. [14], Nicola and Van Spanje [15], Sumita et al. [16] proposed performance evaluation models for database recovery, and calculated the optimal checkpoint intervals which maximize the system availability or minimize the mean overhead during the normal operation. L’Ecuyer and Malenfant [17] formulated a dynamic checkpoint placement problem by a Markov decision process. Ziv and Bruck [18] reconsidered a checkpoint placement problem under a random environment, by taking account of the change of operation circumstance. Vaidya [19] examined the impact of checkpoint latency on overhead ratio for a simple checkpoint model. Recently, Okamura et al. [20] reformulated the Vaidya’s model [19] with a semi-Markov decision process.

On the other hand, some authors discussed the sequential checkpoint placement problems where the checkpoint intervals were not always constant. For instance, in almost all checkpoint models for transaction-based systems [7], [8], [9], [10], [11], [12], [16], it could be proved theoretically that the constant checkpoint intervals maximizing the system availability were better than the independent and identically distributed random checkpoint intervals. For any case, however, the sequential policy with aperiodic checkpoint interval can provide the general framework on the checkpoint placement, because the sequential checkpoint involves the periodic one as a special case. Duda [21] derived a recursive formula satisfying the optimal aperiodic checkpoint sequence maximizing the mean program execution time. Toueg and Babaog˜lu [22] developed a discrete dynamic programming algorithm which mminimizes the expected execution time of tasks placing checkpoints between two consecutive tasks under very general assumptions. Kaio and Osaki [23] and Ling et al. [24] proposed approximate methods to calculate the optimal checkpoint sequence minimizing the expected cumulative operation cost until the system failure.

In the sequential checkpoint placement problem, it is assumed that the system failure time obeys the common probability distribution, i.e. does not always obey the negative exponential distribution. Actually, the non-exponential system failure time distribution with increasing failure rate can be assumed for some real workstation failure data [25]. Also, it is reported that some system failures are caused by software aging such as resource exhaustion and that the system failure time cannot be regarded as an exponentially distributed random variable any more (see e.g.[26]). The sequential checkpoint placement problem is formulated as a complex non-linear optimization problem with unknown number of decision variables. This leads to the computational difficulty to place optimally the aperiodic checkpoint sequence even for a simple centralized system under a criterion of optimality.

Although the checkpointing models mentioned above mainly focused on centralized systems, the analytical techniques for reliability and performance evaluation can be applied to distributed systems [27]. Wong and Franklin [28] considered simple Markov models to determine the frequency of checkpointing in parallel systems. Plank and Thomaso [29] modeled the performance of coordinated checkpointing systems [30], [31], [32] where the number of processors dedicated to the application and the checkpoint intervals are selected by the user before running the program. They employed a birth and death Markov chain to determine the system availability of the parallel system over the long term. Agbaria et al. [33] took account of the rollback propagation [34] and evaluated the coordinated checkpointing protocols based on both the overhead ratio and simple Markov chain models. On the other hand, uncoordinated checkpointing techniques [35] are used to reduce the checkpointing overhead in normal processing. Soliman and Elmaghraby [36] developed the so-called hybrid state saving technique to reduce the mean time to execute a finite length task for an uncoordinated checkpointing system. However, the aperiodic checkpoint scheme has not been developed yet in the literature on parallel and distributed systems.

In this paper we consider sequential checkpoint placement problems for centralized systems under two dependability measures: steady-state system availability and expected reward per unit time in the steady state [37], [38]. When the checkpoint strategy is restricted to constant intervals, the past literature [7], [2], [8], [9], [10], [11], [12], [16], [6] provided satisfactory answers on the optimal constant checkpoint interval with the negative exponential or the general system failure time distribution under the specific cost criteria. Surprisingly, it should be noted that the general sequential checkpoint placement problems have not been studied sufficiently during the last three decades except for a few examples [21], [23], [24], [22]. Recently, Ozaki et al. [39] dealt with the same problem under the expected cumulative operation cost over infinite/finite time horizon, and developed an effective computation algorithm to calculate the optimal checkpoint sequence. However, the algorithm proposed in their paper [39] was not all-round and could not be applied to the general problems. In this paper, we develop numerical computation algorithms for the optimal sequential checkpoint placement under the steady-state system availability and the expected reward per unit time in the steady state. The basic idea is due to the Brender’s classical fixed-point theorem [40], that is, the computation algorithms proposed here converge to the real optimal solutions eventually.

The rest part of this paper is planned as follows: in Section 2, we define the notation and describe two sequential checkpoint placement models with perfect and imperfect checkpointing, referred as Model A and Model B, respectively. The system availability and the expected reward rate are formulated in Section 3 and Section 4, respectively, where the numerical computation algorithms to maximize them are derived. In Section 5 we introduce three approximation methods to calculate the sub-optimal checkpoint sequence. Numerical examples with the Weibull failure time distribution are devoted in Section 6 to illustrate the overestimation and underestimation of the sub-optimal checkpoint sequences based on the approximation methods. We compare the real optimal checkpoint sequence and its associated dependability measures with three approximate solutions. Finally, the paper is concluded with some remarks in Section 7. The computation algorithms in this paper provide evidently exact solutions for unsolved problems during the last three decades and their impact to the actual fault-tolerant file management will be very significant, because the underlying techniques may be applied to the aperiodic and distributed checkpointing protocols.

Section snippets

Model A

Consider a simple file system with sequential checkpointing over an infinite time horizon. The system operation starts at time t0=0, and the checkpoint (CP) is sequentially placed at time {t1,t2,,tk,}. At each CP, tk(k=1,2,), all the file data on the main memory is saved to a safe secondary medium such as CD-Rom, where the cost (time overhead) c0(>0) is needed per each CP placement. It is assumed at the moment that the system operation stops during the checkpointing and the file system has

Model A

From the previous discussion in Section 2, the steady-state system availability for Model A is given by AA(t¯)=E[Xt¯]T(t¯)={c0[1+k=1F¯(tk)]+μ1+a0[k=1tk1tkxdF(x)k=1(tktk1)F¯(tk)]+b0}1/μ. It should be noted that obtaining the optimal CP sequence maximizing AA(t¯) is equivalent to minimizing T(t¯) because the numerator of Eq. (18) is constant with respect to t¯. Recently, this type of optimal CP placement problem was considered by the same authors [39], where the sequence of

Performability analysis

Our next concern is the expected reward per unit time in the steady state. Define:

  • a: reward per unit time in the normal state

  • b: reward per unit time during the CP is placed

  • c: reward per unit recovery time in the system down state

  • P1(t¯): steady-state probability that the system is in normal state

  • P2(t¯): steady-state probability that the system is in checkpointing state

  • P3(t¯): steady-state probability that the system is recovering from a system failure.

Then, the expected reward per unit time in

Exponential approximation

For Model A under the availability criterion, it is well known that the optimal CP interval is constant, i.e., t1=t2t1==tk+1tk=, if F(t) is the exponential distribution with mean 1/μ. Under the assumptions that a0=1 and b0=0, Young [6] considered the checkpoint restart model with constant CP interval with the exponential system failure time distribution, and derived the following non-linear equation which satisfies the optimal CP interval t1: ec0μt1μet1μ=0. Based on the second order

Numerical examples

In this section we calculate numerically the exact and approximate optimal CP sequences, and compare them in terms of dependability measures. Here we represent the approximations by an exponential distribution, the constant CP interval with general distribution and the constant hazard by Approximation 1, Approximation 2 and Approximation 3, respectively. Suppose that the system failure time obeys the Weibull distribution: F(t)=1e(tη)m with shape parameter m(>0) and scale parameter η(>0). We

Conclusion

In this paper, we have developed numerical computation algorithms for sequential checkpoint placement, so as to maximize the steady-state system availability and the expected reward per unit time in the steady state, and compared numerically the real optimal solutions with some approximate ones. The lesson learned from the numerical study in this paper is that three approximation methods provide rather different checkpoint sequences with larger error in the earlier operational phase, as the

Tatsuya Ozaki received the B.S.E. and M.S. from Hiroshima University, Japan, in 2001 and 2005, respectively. In 2005, he joined NTT Facilities, Inc., Japan as a Technical Stuff. His research interests are dependable computing and performance evaluation. His papers appeared in IEEE Transactions on Dependable and Secure Computing and several major conferences like DSN 2004, DASC 2006, etc.

References (43)

  • A. Duda

    The effects of checkpointing on program execution time

    Information Processing Letters

    (1983)
  • N. Kaio et al.

    A note on optimum checkpointing policies

    Microelectronics and Reliability

    (1985)
  • K.F. Wong et al.

    Checkpointing in distributed systems

    Journal of Parallel and Distributed Systems

    (1996)
  • J.S. Plank et al.

    Processor allocation and checkpoint interval selection in cluster computing systems

    Journal of Parallel and Distributed Computing

    (2001)
  • K.M. Chandy

    A survey of analytic models of roll-back and recovery strategies

    Computer

    (1975)
  • K.M. Chandy et al.

    Analytic models for rollback and recovery strategies in database systems

    IEEE Transactions on Software Engineering

    (1975)
  • G.M. Lohman et al.

    Optimal policy for batch operations: Backup, checkpointing, reorganization and updating

    ACM Transactions on Database Systems

    (1977)
  • V.F. Nicola

    Checkpointing and modeling of program execution time

  • A.N. Tantawi et al.

    Performance analysis of checkpointing strategies

    ACM Transactions on Computer Systems

    (1984)
  • J.W. Young

    A first order approximation to the optimum checkpoint interval

    Communications of ACM

    (1974)
  • F. Baccelli

    Analysis of s service facility with periodic checkpointing

    Acta Informatica

    (1981)
  • T. Dohi et al.

    Availability models with age dependent-checkpointing

  • E. Gelenbe et al.

    Performance of rollback recovery systems under intermittent failures

    Communications of the ACM

    (1978)
  • E. Gelenbe

    On the optimum checkpoint interval

    Journal of the ACM

    (1979)
  • E. Gelenbe et al.

    Optimum checkpoints with age dependent failures

    Acta Informatica

    (1990)
  • P.B. Goes et al.

    Stochastic models for performance analysis of database recovery control

    IEEE Transactions on Computers

    (1995)
  • V. Grassi et al.

    On the optimal checkpointing of critical tasks and transaction-oriented systems

    IEEE Transactions on Software Engineering

    (1992)
  • V.G. Kulkarni et al.

    Effects of checkpointing and queueing on program performance

    Stochastic Models

    (1990)
  • V.F. Nicola et al.

    Comparative analysis of different models of checkpointing and recovery

    IEEE Transactions on Software Engineering

    (1990)
  • U. Sumita et al.

    Analysis of effective service time with age dependent interruptions and its application to optimal rollback policy for database management

    Queueing Systems

    (1989)
  • P. L’Ecuyer et al.

    Computing optimal checkpointing strategies for rollback and recovery systems

    IEEE Transactions on Computers

    (1988)
  • Cited by (16)

    • Evaluation of Level of Confidence and Optimization of Roll-back Recovery with Checkpointing for Real-Time Systems

      2014, Microelectronics Reliability
      Citation Excerpt :

      The second key aspect relates to when checkpoints are taken. With respect to this there are two different RRC schemes, i.e. periodic (equidistant) checkpointing [24,16,11,18,25,10] and aperiodic checkpointing scheme [26–28,7,9]. In a periodic checkpointing scheme, the checkpoints are taken periodically, meaning that the distance between two successive checkpoints is always the same (equal to the period), while in an aperiodic checkpointing scheme, the distance between two successive checkpoints is not constant.

    • Checkpoint scheduling model for optimality

      2011, Information Processing Letters
    • Standard Inspection Models

      2023, Springer Series in Reliability Engineering
    • Optimal Inspection Policies to Minimize Expected Cost Rates

      2022, International Journal of Reliability, Quality and Safety Engineering
    • Computation algorithms for workload-dependent optimal checkpoint placement

      2022, International Journal of Systems Assurance Engineering and Management
    • The Optimal Checkpoint Interval for the Long- Running Application

      2021, Research Anthology on Architectures, Frameworks, and Integration Strategies for Distributed and Cloud Computing
    View all citing articles on Scopus

    Tatsuya Ozaki received the B.S.E. and M.S. from Hiroshima University, Japan, in 2001 and 2005, respectively. In 2005, he joined NTT Facilities, Inc., Japan as a Technical Stuff. His research interests are dependable computing and performance evaluation. His papers appeared in IEEE Transactions on Dependable and Secure Computing and several major conferences like DSN 2004, DASC 2006, etc.

    Tadashi Dohi received the B.S.E., M.S. and Dr. of Engineering degrees from Hiroshima University, Japan, in 1989, 1991 and 1995, respectively. In 1992, he joined the Department of Industrial and Systems Engineering, Hiroshima University, Japan, as an Assistant Professor. Now, he is working as a Full Professor in the Department of Information Engineering, Graduate School of Engineering, Hiroshima University, Japan, since 2002. In 1992 and 2000, he was a Visiting Research Scholar in University of British Columbia, Canada and Duke University, USA, respectively, on leave of absence from Hiroshima University. His research areas include software reliability engineering, dependable computing and performance evaluation. He is a Regular Member of ORSJ, JSIAM, IEICE, ISCIE and IEEE. He published over 200 journal papers and refereed conference papers. Dr. Dohi is serving as an Associate Editor of IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences (A) and Asia-Pacific Journal of Operational Research, and an Editorial Board Member of Journal of Risk and Reliability, Journal of Autonomic and Trusted Computing, International Journal of Reliability and Quality Performance, etc. He published over 300 refereed papers. Dr. Dohi served as a General Chair of several international conferences like AIWARM 2004–2008 and WoSAR 2008 and as a Program Committee Chair of RASOR 2005–2007 and ISAS 2009.

    Naoto Kaio received the B.S.E., M.S. and Dr. of Engineering degrees from Hiroshima University, Japan, in 1976, 1978 and 1982, respectively. He is a Full Professor in the Department of Economic Informatics, Hiroshima Shudo University, Japan. From 1986 to 1987, he was a Visiting Research Scholar in the William E. Simon Graduate School of Business Administration, University of Rochester, USA. His research areas include systems science, operations research and reliability theory. He is a Regular Member of ORSJ, IEICE, JIMA, IPSJ, JSQC, REAJ and IEEE. Also, Dr. Kaio is serving as Regional Editor for Asia in Journal of Quality in Maintenance Engineering.

    This work is supported by a Grant-in-Aid for Scientific Research from the Ministry of Education, Sports, Science and Culture of Japan under Grant No. 18510138 (2006-2008) and the Research Program 2008 under the Center for Academic Development and Cooperation of the Hiroshima Shudo University, Japan. The authors very much appreciate two reviewers’ comments to improve the first version of this paper.

    View full text