Optimum checkpoints for programs with loops

doi:10.1016/j.simpat.2019.101951

Simulation Modelling Practice and Theory

Volume 97, December 2019, 101951

https://doi.org/10.1016/j.simpat.2019.101951 Get rights and content

Abstract

Checkpoints are widely used to improve the performance of computer systems and programs in the presence of failures, and they significantly reduce the overall cost of running a program, if the program or the underlying system, are subject to failures. Thus application level checkpointing has been proposed for programs which may execute on platforms which are prone to failures, and also to reduce the execution time of programs which are prone to internal failures. This paper develops a mathematical model to estimate the average execution time of a program in the presence of failures, without and with application level checkpointing, and we use it to predict the optimum interval number of instructions which should be executed between the placement of successive checkpoints. The case of programs with loops and nested loops is also discussed. The results are illustrated with several numerical examples.

Introduction

Cloud and Fog Computing allow diverse software applications to run on complex interconnected systems where reliability and security can be of significant concern. Major failures in such systems occur [1], due to complex effects between various factors including human decisions and systemic interactions in the architecture, the software systems, as well as the network connections [2] and malicious adversaries. Furthermore, a recent report [3] states that “The main problems affecting the cloud are insecure interface APIs, shared resources, data breaches, malicious insiders, and misconfiguration issues” including active adversarial mechanisms [4]. Clearly, Cloud providers will do their best to improve the security and reliability of their platforms. However, we also need methods that can limit the average execution time of applications that run on the Cloud and Fog despite the intermittent failures of the platforms. This is particularly of interest for long-running applications or those that are run frequently and repeatedly.

One such mechanism that we investigate in this paper is the Application Level Checkpoint and Restart (ALCR) that is widely used to enhance the reliability of long-running programs [5], [6], [7] by periodically saving a copy or checkpoint of the current execution state of software. The most recent copy is the used to restart program execution in case of failure. Originally developed for transaction-oriented systems and databases [8], [9], [10], [11], [12], it has been widely adopted to improve the reliability of modern High Performance Computing (HPC) [13], [14] software.

Long intervals of time between checkpoints will increase the overhead associated with system restart, while short intervals will increase the overhead caused by the checkpoints themselves. The checkpoint interval must then be optimized so as to minimize a program’s expected execution time in the presence of failures [15], [16], [17]. In [18], [19] the impact of asynchronous checkpointing strategies on the performance of distributed systems has been studied. Among the existing checkpointing strategies, ALCR [5], [20] uses a small memory footprint [6], [7], but requires significant expertise for the selection of source code locations in which checkpoints should be inserted. Yet existing ALCR tools and libraries facilitate the insertion of checkpoints in long-running loops, since computational loops constitute a significant source of failure-related re-executions [21], [22]. However such tools do not provide a method to select the inter-checkpoint interval which has a significant influence on the average execution time of software.

In this paper we propose that the inter-checkpoint intervals in specific loop be selected optimally as a function of program failure rate, the execution cost for establishing a checkpoint, and the execution time related to restarting the program after a failure, based on a mathematical model. We suggest that this approach can be implemented as an API within an ALCR tool, to select the optimum checkpoint interval in program loops.

In the sequel, Section 2 reviews earlier work. Section 3 provides examples to help understand the ALCR mechanism and its associated costs. Section 4 describes the mathematical model and the numerical approach. The optimum checkpoint interval is discussed in Section 4.3. Section 5 presents numerical examples and Section 6 presents conclusions and future research.

Section snippets

Related work

If no scheme is adopted to enhance the performance of a transaction oriented system in the presence of failures, all previously executed transactions would need to be re-executed in case of a failure. The Checkpoint and Rollback/Recovery mechanism saves a secure and faithful copy of the system state at predetermined instants (the checkpoints); in addition in the case of transaction oriented systems, it will save an “audit trail” of the sequence of transactions that were executed since the most

Indicative examples

In this section, some examples are provided regarding the changes that should be performed to the source code of a software application, in order to add checkpoints into long-running loops, using actual ALCR libraries. Their purpose is to help the reader understand the overall concept of the ALCR mechanism, and also to explain why the arbitrary selection of the checkpoint interval may affect negatively the execution time of a software application. These examples are also expected to facilitate

Expected execution time of a program without and with checkpoints

Consider a program P that executes a total of M instructions; it may contain loops so that M is the total number of instructions it executes. Assume that when the execution starts, there is an overhead associated with loading its data and code into memory, which consumes A time units. If the program is executed without any errors or failures, and if each instruction is executed in c time units, then the total execution time for P will be: $T (P) = A + c M .$ Now suppose that no failures or errors occur

Numerical examples

In this section, the effect of the checkpoint interval K on the expected execution time of a software application is illustrated through a set of numerical examples. More specifically, the case of a software application with a single loop is considered, and the analysis is repeated for different loop sizes (i.e., different values of M). For each one of these cases, the expected execution time of the same program with and without the adoption of the ALCR mechanism is calculated, while the

Conclusions and future work

This paper has proposed a method for setting the checkpoint intervals in ALCR for software applications which contain long-running loops, and which run on platforms that are subject to failures. We have shown that the optimum checkpoint interval, which minimizes the expected execution time of the program, depends on various parameters which can be incorporated into a single numerical expression. The expression can then be used as part of an ALCR tool to compute the optimum checkpoint interval

Acknowledgment

This research was supported by the European Commission through the Horizon 2020 SDK4ED Project under Grant Agreement No. 780572, but the contents of this paper represent solely the opinions of the authors, and it does not engage the responsibility of the European Commission.

References (47)

E. Gelenbe
Dealing with software viruses: a biological paradigm
Inf. Secur. Tech. Rep.
(2007)
G. Rodríguez et al.
CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications
Concurr. Comput.
(2010)
J. Daly
A higher order estimate of the optimum checkpoint interval for restart dumps
Fut. Gener. Comput. Syst.
(2006)
E. Dijkstra et al.
Software engineering techniques
NATO Science Committee
(1969)
Summary of the Amazon s3 Service Disruption in the Northern Virginia (US-East-1) Region
(2018)
Top 20 high profile cloud failures of all time
(2016)
C. Wueest et al.
Mistakes in the IaaS cloud could put your data at risk
Symantec
(2015)
I.P. Egwutuoha et al.
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
J. Supercomput.
(2013)
R. Arora
ITALC : interactive tool for application - level checkpointing
Proceedings of the Fourth International Workshop on HPC User Support Tools
(2017)
F. Shahzad et al.
CRAFT: a library for easier application-level checkpoint/restart and automatic fault tolerance
IEEE Trans. Parallel Distrib.Syst.
(2018)

J.W. Young

A first order approximation to the optimum checkpoint interval

Commun. ACM

(1974)

E. Gelenbe

A model of roll-back recovery with multiple checkpoints

Proceedings of the 2nd International Conference on Software Engineering

(1976)

E. Gelenbe et al.

Performance of rollback recovery systems under intermittent failures

Commun. ACM

(1978)

E. Gelenbe

On the optimum checkpoint interval

J. ACM

(1979)

E. Gelenbe et al.

Optimum checkpoints with age-dependent failures

Acta Inf.

(1990)

E.N. Elnozahy et al.

A survey of rollback-recovery protocols in message-passing systems

ACM Comput. Surv.

(2002)

H. Takizawa et al.

CheCL: transparent checkpointing and process migration of OpenCL applications

Proceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011

(2011)

E. Gelenbe

Model on information renewal by the method of multiple test-points (english translation of Avtomat. I Telemekh., 1979:4, pp. 142–151

Autom. Remote Control

(1979)

E. Gelenbe et al.

Introduction aux Réseaux de Files d’Attente

(1982)

E. Gelenbe et al.

Analysis and Synthesis of Computer Systems

(2010)

E. Gelenbe et al.

Availability of a distributed computer system with failures

Acta Inf.

(1986)

S.K. Tripathi et al.

Load sharing in distributed systems with failures

Acta Inf.

(1988)

M. Siavvas et al.

Optimum interval for application-level checkpoints

2019 6th IEEE International Conference on Cyber Security and Cloud Computing (CSCloud)

(2019)

Cited by (21)

Resource allocation and aging priority-based scheduling of linear workflow applications with transient failures and selective imprecise computations
2024, Cluster Computing
Software Security Vulnerability Prediction Modeling for PHP Systems
2023, SSRN
Exploring Technical Debt in Security Questions on Stack Overflow
2023, arXiv
Exploring Technical Debt in Security Questions on Stack Overflow
2023, International Symposium on Empirical Software Engineering and Measurement
Review of Some Recent European Cybersecurity Research and Innovation Projects
2022, Infocommunications Journal
Towards Efficient Cache Allocation for High-Frequency Checkpointing
2022, Proceedings - 2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics, HiPC 2022

View all citing articles on Scopus

View full text

Optimum checkpoints for programs with loops

Abstract

Introduction

Section snippets

Related work

Indicative examples

Expected execution time of a program without and with checkpoints

Numerical examples

Conclusions and future work

Acknowledgment

Inf. Secur. Tech. Rep.

Concurr. Comput.

Fut. Gener. Comput. Syst.

Summary of the Amazon s3 Service Disruption in the Northern Virginia (US-East-1) Region

Top 20 high profile cloud failures of all time

Mistakes in the IaaS cloud could put your data at risk

Symantec

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

J. Supercomput.

ITALC : interactive tool for application - level checkpointing

Proceedings of the Fourth International Workshop on HPC User Support Tools

CRAFT: a library for easier application-level checkpoint/restart and automatic fault tolerance

IEEE Trans. Parallel Distrib.Syst.

A first order approximation to the optimum checkpoint interval

Commun. ACM

A model of roll-back recovery with multiple checkpoints

Proceedings of the 2nd International Conference on Software Engineering

Performance of rollback recovery systems under intermittent failures

Commun. ACM

On the optimum checkpoint interval

J. ACM

Optimum checkpoints with age-dependent failures

Acta Inf.

A survey of rollback-recovery protocols in message-passing systems

ACM Comput. Surv.

CheCL: transparent checkpointing and process migration of OpenCL applications

Proceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011

Model on information renewal by the method of multiple test-points (english translation of Avtomat. I Telemekh., 1979:4, pp. 142–151

Autom. Remote Control

Introduction aux Réseaux de Files d’Attente

Analysis and Synthesis of Computer Systems

Availability of a distributed computer system with failures

Acta Inf.

Load sharing in distributed systems with failures

Acta Inf.

Optimum interval for application-level checkpoints

2019 6th IEEE International Conference on Cyber Security and Cloud Computing (CSCloud)