Summary
An approximate model is presented for the mean response time in a distributed computer system in which components may fail. Each node in the system periodically performs a checkpoint, and also periodically tests the other nodes to determine whether they are failed or not. When a node fails, it distributes its workload to other nodes which appear to be operational, based on the results of its most recent test. An approximate response time model is developed, explicitly allowing for the delays caused by transactions being incorrectly transferred to failed nodes, because of out-of-date testing results. For the case when all nodes are identical, a closed form solution is derived for the optimal testing rate minimizing the average response time. Numerical results are presented illustrating the relationships among the problem parameters.
Similar content being viewed by others
References
Baccelli, F.: Analysis of a service facility with periodic checkpointing, Acta Informatica 15, 67–81 (1981)
Bouchet, P.: Procédures de reprise dans les systèmes de gestion de base de données réparties. Acta Informatica 11, 305–340 (1979)
Chandy, K.M., Ramamoorthy, C.V.: Rollback and recovery strategies for computer programs. IEEE Trans. Comput. 6, 546–556 (1972)
Chandy, K.M.: A survey of analytic models of rollback and recovery strategies. Computer 5, 40–47 (1975)
Chandy, K.M., Browne, J.C., Dissly, C.W., Uhrig, W.R.: Analytical models for rollback and recovery strategies in data base systems. IEEE Trans. Software Eng. 1, 100–110 (1975)
Eager, D.L., Lazowska, E.D., Zahorjan, J.: Dynamic load sharing in homogeneous distributed systems, Technical Report 84-10-01, Department of Computer Science, University of Washington, Seattle, October 1984
Gelenbe, E., Derochette, D.: Performance of rollback recovery systems under intermittent failures. Commun. ACM 21, 493–499 (1978)
Gelenbe, E.: On the optimum checkpoint interval. J. ACM 26, 259–270 (1979)
Gelenbe, E., Finkel, D., Tripathi, S.K.: On the availability of a distributed computer system with failing components. Acta Informatica 23, 643–655 (1986)
Author information
Authors and Affiliations
Additional information
This research was performed while Satish Tripathi and David Finkel were visiting ISEM. Satish Tripathi's research was supported in part by grants from NSF (grant no. DCR-84-05235) and NASA (grant no. NAG 5-235), and by Université de Paris-Sud
Rights and permissions
About this article
Cite this article
Tripathi, S.K., Finkel, D. & Gelenbe, E. Load sharing in distributed systems with failures. Acta Informatica 25, 677–689 (1988). https://doi.org/10.1007/BF00291054
Received:
Issue Date:
DOI: https://doi.org/10.1007/BF00291054