Skip to main content
Log in

Load sharing in distributed systems with failures

  • Published:
Acta Informatica Aims and scope Submit manuscript

Summary

An approximate model is presented for the mean response time in a distributed computer system in which components may fail. Each node in the system periodically performs a checkpoint, and also periodically tests the other nodes to determine whether they are failed or not. When a node fails, it distributes its workload to other nodes which appear to be operational, based on the results of its most recent test. An approximate response time model is developed, explicitly allowing for the delays caused by transactions being incorrectly transferred to failed nodes, because of out-of-date testing results. For the case when all nodes are identical, a closed form solution is derived for the optimal testing rate minimizing the average response time. Numerical results are presented illustrating the relationships among the problem parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Baccelli, F.: Analysis of a service facility with periodic checkpointing, Acta Informatica 15, 67–81 (1981)

    Google Scholar 

  2. Bouchet, P.: Procédures de reprise dans les systèmes de gestion de base de données réparties. Acta Informatica 11, 305–340 (1979)

    Google Scholar 

  3. Chandy, K.M., Ramamoorthy, C.V.: Rollback and recovery strategies for computer programs. IEEE Trans. Comput. 6, 546–556 (1972)

    Google Scholar 

  4. Chandy, K.M.: A survey of analytic models of rollback and recovery strategies. Computer 5, 40–47 (1975)

    Google Scholar 

  5. Chandy, K.M., Browne, J.C., Dissly, C.W., Uhrig, W.R.: Analytical models for rollback and recovery strategies in data base systems. IEEE Trans. Software Eng. 1, 100–110 (1975)

    Google Scholar 

  6. Eager, D.L., Lazowska, E.D., Zahorjan, J.: Dynamic load sharing in homogeneous distributed systems, Technical Report 84-10-01, Department of Computer Science, University of Washington, Seattle, October 1984

    Google Scholar 

  7. Gelenbe, E., Derochette, D.: Performance of rollback recovery systems under intermittent failures. Commun. ACM 21, 493–499 (1978)

    Google Scholar 

  8. Gelenbe, E.: On the optimum checkpoint interval. J. ACM 26, 259–270 (1979)

    Google Scholar 

  9. Gelenbe, E., Finkel, D., Tripathi, S.K.: On the availability of a distributed computer system with failing components. Acta Informatica 23, 643–655 (1986)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Additional information

This research was performed while Satish Tripathi and David Finkel were visiting ISEM. Satish Tripathi's research was supported in part by grants from NSF (grant no. DCR-84-05235) and NASA (grant no. NAG 5-235), and by Université de Paris-Sud

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tripathi, S.K., Finkel, D. & Gelenbe, E. Load sharing in distributed systems with failures. Acta Informatica 25, 677–689 (1988). https://doi.org/10.1007/BF00291054

Download citation

  • Received:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF00291054

Keywords

Navigation