Optimum checkpoints for programs with loops
Introduction
Cloud and Fog Computing allow diverse software applications to run on complex interconnected systems where reliability and security can be of significant concern. Major failures in such systems occur [1], due to complex effects between various factors including human decisions and systemic interactions in the architecture, the software systems, as well as the network connections [2] and malicious adversaries. Furthermore, a recent report [3] states that “The main problems affecting the cloud are insecure interface APIs, shared resources, data breaches, malicious insiders, and misconfiguration issues” including active adversarial mechanisms [4]. Clearly, Cloud providers will do their best to improve the security and reliability of their platforms. However, we also need methods that can limit the average execution time of applications that run on the Cloud and Fog despite the intermittent failures of the platforms. This is particularly of interest for long-running applications or those that are run frequently and repeatedly.
One such mechanism that we investigate in this paper is the Application Level Checkpoint and Restart (ALCR) that is widely used to enhance the reliability of long-running programs [5], [6], [7] by periodically saving a copy or checkpoint of the current execution state of software. The most recent copy is the used to restart program execution in case of failure. Originally developed for transaction-oriented systems and databases [8], [9], [10], [11], [12], it has been widely adopted to improve the reliability of modern High Performance Computing (HPC) [13], [14] software.
Long intervals of time between checkpoints will increase the overhead associated with system restart, while short intervals will increase the overhead caused by the checkpoints themselves. The checkpoint interval must then be optimized so as to minimize a program’s expected execution time in the presence of failures [15], [16], [17]. In [18], [19] the impact of asynchronous checkpointing strategies on the performance of distributed systems has been studied. Among the existing checkpointing strategies, ALCR [5], [20] uses a small memory footprint [6], [7], but requires significant expertise for the selection of source code locations in which checkpoints should be inserted. Yet existing ALCR tools and libraries facilitate the insertion of checkpoints in long-running loops, since computational loops constitute a significant source of failure-related re-executions [21], [22]. However such tools do not provide a method to select the inter-checkpoint interval which has a significant influence on the average execution time of software.
In this paper we propose that the inter-checkpoint intervals in specific loop be selected optimally as a function of program failure rate, the execution cost for establishing a checkpoint, and the execution time related to restarting the program after a failure, based on a mathematical model. We suggest that this approach can be implemented as an API within an ALCR tool, to select the optimum checkpoint interval in program loops.
In the sequel, Section 2 reviews earlier work. Section 3 provides examples to help understand the ALCR mechanism and its associated costs. Section 4 describes the mathematical model and the numerical approach. The optimum checkpoint interval is discussed in Section 4.3. Section 5 presents numerical examples and Section 6 presents conclusions and future research.
Section snippets
Related work
If no scheme is adopted to enhance the performance of a transaction oriented system in the presence of failures, all previously executed transactions would need to be re-executed in case of a failure. The Checkpoint and Rollback/Recovery mechanism saves a secure and faithful copy of the system state at predetermined instants (the checkpoints); in addition in the case of transaction oriented systems, it will save an “audit trail” of the sequence of transactions that were executed since the most
Indicative examples
In this section, some examples are provided regarding the changes that should be performed to the source code of a software application, in order to add checkpoints into long-running loops, using actual ALCR libraries. Their purpose is to help the reader understand the overall concept of the ALCR mechanism, and also to explain why the arbitrary selection of the checkpoint interval may affect negatively the execution time of a software application. These examples are also expected to facilitate
Expected execution time of a program without and with checkpoints
Consider a program P that executes a total of M instructions; it may contain loops so that M is the total number of instructions it executes. Assume that when the execution starts, there is an overhead associated with loading its data and code into memory, which consumes A time units. If the program is executed without any errors or failures, and if each instruction is executed in c time units, then the total execution time for P will be:Now suppose that no failures or errors occur
Numerical examples
In this section, the effect of the checkpoint interval K on the expected execution time of a software application is illustrated through a set of numerical examples. More specifically, the case of a software application with a single loop is considered, and the analysis is repeated for different loop sizes (i.e., different values of M). For each one of these cases, the expected execution time of the same program with and without the adoption of the ALCR mechanism is calculated, while the
Conclusions and future work
This paper has proposed a method for setting the checkpoint intervals in ALCR for software applications which contain long-running loops, and which run on platforms that are subject to failures. We have shown that the optimum checkpoint interval, which minimizes the expected execution time of the program, depends on various parameters which can be incorporated into a single numerical expression. The expression can then be used as part of an ALCR tool to compute the optimum checkpoint interval
Acknowledgment
This research was supported by the European Commission through the Horizon 2020 SDK4ED Project under Grant Agreement No. 780572, but the contents of this paper represent solely the opinions of the authors, and it does not engage the responsibility of the European Commission.
References (47)
Dealing with software viruses: a biological paradigm
Inf. Secur. Tech. Rep.
(2007)- et al.
CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications
Concurr. Comput.
(2010) A higher order estimate of the optimum checkpoint interval for restart dumps
Fut. Gener. Comput. Syst.
(2006)- et al.
Software engineering techniques
NATO Science Committee
(1969) Summary of the Amazon s3 Service Disruption in the Northern Virginia (US-East-1) Region
(2018)Top 20 high profile cloud failures of all time
(2016)- et al.
Mistakes in the IaaS cloud could put your data at risk
Symantec
(2015) - et al.
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
J. Supercomput.
(2013) ITALC : interactive tool for application - level checkpointing
Proceedings of the Fourth International Workshop on HPC User Support Tools
(2017)- et al.
CRAFT: a library for easier application-level checkpoint/restart and automatic fault tolerance
IEEE Trans. Parallel Distrib.Syst.
(2018)
A first order approximation to the optimum checkpoint interval
Commun. ACM
A model of roll-back recovery with multiple checkpoints
Proceedings of the 2nd International Conference on Software Engineering
Performance of rollback recovery systems under intermittent failures
Commun. ACM
On the optimum checkpoint interval
J. ACM
Optimum checkpoints with age-dependent failures
Acta Inf.
A survey of rollback-recovery protocols in message-passing systems
ACM Comput. Surv.
CheCL: transparent checkpointing and process migration of OpenCL applications
Proceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011
Model on information renewal by the method of multiple test-points (english translation of Avtomat. I Telemekh., 1979:4, pp. 142–151
Autom. Remote Control
Introduction aux Réseaux de Files d’Attente
Analysis and Synthesis of Computer Systems
Availability of a distributed computer system with failures
Acta Inf.
Load sharing in distributed systems with failures
Acta Inf.
Optimum interval for application-level checkpoints
2019 6th IEEE International Conference on Cyber Security and Cloud Computing (CSCloud)
Cited by (21)
Exploring Technical Debt in Security Questions on Stack Overflow
2023, International Symposium on Empirical Software Engineering and MeasurementReview of Some Recent European Cybersecurity Research and Innovation Projects
2022, Infocommunications JournalTowards Efficient Cache Allocation for High-Frequency Checkpointing
2022, Proceedings - 2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics, HiPC 2022