Abstract.
Clusters and distributed systems offer fault tolerance and high performance through load sharing. When all n computers are up and running, we would like the load to be evenly distributed among the computers. When one or more computers break down, the load on these computers must be redistributed to other computers in the system. The redistribution is determined by the recovery scheme. The recovery scheme is governed by a sequence of integers modulo n. Each sequence guarantees minimal load on the computer that has maximal load even when the most unfavorable combinations of computers go down. We calculate the best possible such recovery schemes for any number of crashed computers by an exhaustive search, where brute force testing is avoided by a mathematical reformulation of the problem and a branch-and-bound algorithm. The search nevertheless has a high complexity. Optimal sequences, and thus a corresponding optimal bound, are presented for a maximum of twenty one computers in the distributed system or cluster.
Similar content being viewed by others
References
Bertsekas, D.P., Özveren, C., Stamoulis, G.D., Tsitsiklis, J.N. (1991) Optimal communication algorithms for hypercubes. J. Parallel Distributed Comput. 11: 263-275
Bloom, G.S., Golomb, S.W. (1977) Applications of numbered, undirected graphs. Proceedings of the IEEE 65(4): 562-571
Chabridon, S., Gelenbe, E. (1995) Failure detection algorithms for a reliable execution of parallel programs. 14th Symposium on Reliable Distributed Systems SRDS’14, Bad Neuenahr, Germany, September 1995, Proceedings
Chinchani, R., Upadhyaya, S., Kwiat, K. (2003) A tamper-resistant framework for unambiguous detection of attacks in user space using process monitors. First IEEE International Workshop on Information Assurance IWIA’03, March 24-24, 2003, Darmstadt, Germany, Proceedings, pp 25-36
Dimitromanolakis, A (2002) Analysis of the golomb ruler and the sidon set problems, and determination of large, near-optimal golomb rulers. Dept. of Electronic and Computer Engineering Technical University of Crete
Flavin, C. (1991) Understanding fault tolerant distributed systems. Communication ACM 34(2): 56-78
Gelenbe, E. (1976) A model for roll-back recovery with multiple checkpoints. 2nd International Conference on Software Engineering, San Francisco, California, US, October 1976, Proceedings, pp. 251-255
Gelenbe, E., Chabridon, S. (1995) Dependable execution of distributed programs. Elsevier, Simulation Practice and Theory 3(1): 1-16
Gelenbe, E., Derochete, D. (1978) Performance of rollback recovery systems under intermittent failures. Communication ACM 21(6): 493-499
Greenberg, D.S., Bhatt, S.N. (1990) Routing multiple paths in hypercubes. Second Annual ACM Symposium on Parallel Algorithms and Architectures, Island of Crete, Greece, 1990, Proceedings, pp. 45-54
Hewlett-Packard Company (2002) TruCluster server - Cluster highly available applications. Hewlett-Packard Company, September
Hewlett-Packard (2002) Managing MC/ServiceGuard. Hewlett-Packard, March
Huang, C., McKinley, P.K. (1994) Communication issues in parallel computing across ATM networks. IEEE Parallel and Distributed Technology: Systems and Applications 2(4): 73-86
IBM (2002) HACMP. Concepts and Facilities Guide. IBM, July
Kameda, H., Fathy, E.-Z.S., Ryu, I., Li, J. (2002) A performance comparison of dynamic vs. static load balancing policies in a mainframe - Personal computer network model. Information: an International Journal 5(4): 431-446
Klonowska, K., Lundberg, L., Lennerstad, H. (2003) Using golomb rulers for optimal recovery schemes in fault tolerant distributed computing. 17th International Parallel & Distributed Processing Symposium IPDPS 2003, Nice, France, April 2003, Proceedings, pp. 9-, CD-ROM
Klonowska, K., Lundberg, L., Lennerstad, H., Svahnberg, C. (2004) Using modulo rulers for optimal recovery schemes in distributed computing. 10th International Symposium PRDC 2004, Papeete, Tahiti, French Polynesia, March 2004, Proceedings, pp. 133-142
Krishna, C.M., Shin, K.G. (1997) Real-time systems. (McGraw-Hill International Editions, Computer Science Series, ISBN 0-07-114243-6)
Lundberg, L., Häggander, D., Klonowska, K., Svahnberg, C. (2003) Recovery schemes for high availability and high performance distributed real-time computing. 17th International Parallel & Distributed Processing Symposium IPDPS 2003, Nice, France, April 2003, Proceedings, p. 122a, CD-ROM
Lundberg, L., Svahnberg, C. (2001) Optimal recovery schemes for high-availability cluster and distributed computing. Journal of Parallel and Distributed Computing 61(11): 1680-1691
Mahmood, A., McCluskey, E.J. (1988) Concurrent error detection using watchdog processors - A survey. IEEE Transactions on Computers 37(2): 160-174
Microsoft Corporation (2003) Server clusters: Architecture overview for Windows server 2003. Microsoft Corporation, March
Pande, S.S., Agrawal, D.P., Mauney, J. (1994) A threshold scheduling strategy for Sisal on distributed memory machines. Journal on Parallel and Distributed Computing 21(2), 223-236
Pfister, G.F. (1998) In search of clusters. Prentice-Hall
Reinhardt, S.K., Mukherjee, S.S. (2000) Transient fault detection via simultaneous multithreading. 27th Annual International Symposium on Computer Architecture (ISCA), Vancouver, British Columbia, Canada, June, 2000, Proceedings
Stalling, W. (2003) Computer organization & architecture. Designing for performance, 6th edn. Prentice Hall, ISBN 0-13-049307-4
Sun Microsystems (2000) Sun cluster 3.0 data services installation and configuration guide. Sun Microsystems
TruCluster. Systems Administration Guide, Digital Equipment Corporation, http://www.unix.digital.com/faqs/publications/cluster\_doc
Vaidya, N.H. (1994) Another two-level failure recovery scheme: Performance impact of checkpoint placement and checkpoint latency. Technical Report 94-068. Department of Computer Science, Texas A&M University, December
Willebeek-LeMair, M., Reeves, A.P. (1993) Strategies for dynamic load balancing on highly parallel computers. IEEE Transactions on Parallel and Distributed Systems 9(4): 979-993
Young, M., Taylor, R.N. (1989) Rethinking the taxonomy of fault detection techniques. International Conference Software Enginering (ICSE), ACM, May, 1989, Proceedings, pp. 53-62
http://www.distributed.net/ogr/index.html
http://www.research.ibm.com/people/s/shearer/grtab.html
Author information
Authors and Affiliations
Corresponding author
Additional information
Received: 26 May 2004, Published online: 14 March 2005
Rights and permissions
About this article
Cite this article
Klonowska, K., Lennerstad, H., Lundberg, L. et al. Optimal recovery schemes in fault tolerant distributed computing. Acta Informatica 41, 341–365 (2005). https://doi.org/10.1007/s00236-005-0161-7
Issue Date:
DOI: https://doi.org/10.1007/s00236-005-0161-7