Abstract
Computer users with very long computations run the risk of losing work because of machine failures. Such losses can often be reduced by scheduling saves on secure storage devices of work successfully done. In the model studied here, the user leaves the computation unattended for extended periods of time, after which he or she returns to check whether a machine failure occurs. When a check reveals a failure, the user resets the computation so that it resumes from the point of the last successful save.
Saves are themselves time consuming, so that any strategy for scheduling saves must strike a balance between the computing time lost during saves and the computing time that is occasionally lost, because of a failure since the last successful save.
For a given time to the next check and given constant save times, this paper computes schedules that maximize the expected amount of work successfully done before the next check, under the uniform and exponential failure laws. Explicit formulas are obtained for the uniform law. A recurrence leads to routine numerical calculations for the more difficult system with an exponential failure law.
Similar content being viewed by others
References
Boguslavsky, L.B., Coffman, E.G., Jr., Gilbert, E.N., Kreinin, A.Y.: Scheduling checks and saves. ORSA J. Comput.4, 60–69 (1992)
Coffman, E.G., Jr., Gilbert, E.N.: Optimal strategies for scheduling saves and preventive maintenance. IEEE Trans. Reliab.39, 9–18 (1990)
Goyal, A., Nicola, V., Tantawi, A.N., Trivedi, K.S.: Reliability of systems with limited repairs. IEEE Trans Reliab. (Special Issue on Fault Tolerant Computing)R-36, 202–207 (1987)
Kulkarni, V.G., Nicola, V.F., Trivedi, K.S.: Effects of checkpointing and queueing on program performance. Research Rep. RC 13283, IBM Research, Yorktown Heights, NY 10598, USA
Tantawi, A.N., Ruschitzka, M.: Performance analysis of checkpointing strategies. ACM Trans. Comput. Syst.2, 123–144 (1984)
Toueg, S., Babaoglu, O.: On the optimum checkpoint selection problem. SIAM J. Comput.13, 630–649 (1984)
Trivedi, K.S.: Reliability evaluation for fault tolerant systems. In: Mathematical computer performance and reliability, pp. 403–414. Amsterdam: North-Holland 1983
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Coffman, E.G., Flatto, L. & Kreinin, A.Y. Scheduling saves in fault-tolerant computations. Acta Informatica 30, 409–423 (1993). https://doi.org/10.1007/BF01210593
Received:
Issue Date:
DOI: https://doi.org/10.1007/BF01210593