Abstract
Fault-tolerant programs are typically not only difficult to implement but also incur extra costs in tenors of performance or resource consumption. Failures are typically relatively rare but the fault-tolerance overhead must be paid regardless if any failures occur during the program execution. This paper presents an approach that reduces the cost of fault-tolerance, namely, adaptations to a change in failure model. In particular, a program that assumes no failures (or only benign failures) is combined with a component that is responsible for detecting if failures occur and then switching to a fault-tolerant algorithm. Provided that the detection and adaptation mechanisms are not too expensive, this approach results in a program with smaller fault-tolerance overhead and thus a better performance than a traditional fault-tolerant program. Thus, the high cost of fault-tolerance is only paid when failures actually occur.
This work supported in part by the National Science Foundation under grant CCR-9633336 and Defense Advanced Research Projects Agency under grants F30602-96-1-0342 and N6600197-C-8518.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
A. Arora, M. Gouda, and T. Herman. Composite routing protocols. In Proceedings of the IEEE Symposium on Parallel and Distributed Processing, Dec 1990.
P. Bell and K. Jabbour. Review of point-to-point network routing algorithms. IEEE Communications Magazine, 24(1):34–38, 1986.
R. Bianchini, K. Goodwin, and D. Nydick. Practical application and implementation of distributed system-level diagnosis theory. In Proceedings of the 20th Symposium on Fault-Tolerant Computing, pages 332–339, Jun 1990.
A. Bondavalli, F. Di Giandomenico, and J. Xu. A cost-effective and flexible scheme for software fault tolerance. Journal of Computer Systems Science and Engineering, 8:234–244, 1993.
F. Cristian. Reaching agreement on processor-group membership in synchronous distributed systems. Distributed Computing, 4:175–187, 1991.
F. Cristian, H. Aghili, R. Strong, and D. Dolev. Atomic broadcast: From simple message diffusion to Byzantine agreement. In Proceedings of the 15th Symposium on Fault-Tolerant Computing, pages 200–206, Ann Arbor, MI, Jun 1985.
F. Cristian, H. Aghili, R. Strong, and D. Dolev. Atomic broadcast: From simple message diffusion to Byzantine agreement. Information and Computation, 118(1):158–179, 1995.
E. W. Dijkstra. Self-stabilization in spite of distributed control. Communications of the ACM, 17(11):643–644, Nov 1974.
V. Estivill-Castro and D. Woods. A survey of adaptive sorting algorithms. ACM Computing Surveys, 24(4):441–476, Dec 1992.
J. Goldberg, l. Greenberg, and T. Lawrence. Adaptive fault tolerance. In Proceedings of the IEEE Workshop on Advances in Parallel and Distributed Systems, pages 127–132, Princeton, NJ, Oct 1993.
A. Gopal and K. Pery. Unifying self-stabilization and fault-tolerance. In Proceedings of the 12th ACMSymposium on Principles of Distributed Computing, pages 195–206,1993.
M. Gouda and T. Herman. Adaptive programming. IEEE Transactions on Software Engineering, SE-17:911–921, 1991.
M. Hiltunen, X. Han, and R. Schlichting. Real-time issues in Cactus. In Proceedings of the IEEE Workshop on Middleware for Distributed Real-Time Systems and Services, pages 214–221, San Francisco, CA, Dec 1997.
M. Hiltunen and R. Schlichting. Adaptive distributed and fault-tolerant systems. Computer Systems Science and Engineering, 11(5):125–133, Sep 1996.
H. Kopetz, G. Grunsteidl, and J. Reisinger. Fault-tolerant membership service in a synchronous distributed real-time system. In A. Avizienis and J. Laprie, editors, Dependable Computingfor Critical Applications, pages 411–429. Springer-Verlag, Wien, 1991.
L. Lamport, R. Shostak, and P. M. The Byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4(3):382–401, Jul 1982.
L. Peterson, N. Buchholz, and R. Schlichting. Preserving and using context information in interprocess communication. ACM Transactions on Computer Systems, 7(3):217–246, Aug 1989.
F. Preparata, G. Metze, and R. Chien. On the connection assignment problem of diagnosable systems. IEEE Transactions on Electronic Computer, EC-16(6):848–854, Dec 1967.
R. Rajkumar, S. Fakhouri, and F. Jahanian. Processor group membership protocols: Specification, design, and implementation. In Proceedings of the 12th Symposium on Reliable Distributed Systems, pages 2–11, Princeton, NJ, Oct 1993.
R. Schlichting and M. Hiltunen. The Cactus project. http://www.cs.arizona.edu/cactus/.
M. Schneider. Self-stabilization. ACM Computing Surveys, 25(1):45–67, Mar 1993.
C. Walter, M. Hugue, and N. Suri. Continual on-line diagnosis of hybrid faults. In F. Cristian, G. Le Lann, and T. Lunt, editors, Dependable Computing for Critical Applications 4, pages 233–249. Springer-Verlag, Wien, 1995. *** DIRECT SUPPORT *** A0008D07 00019
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chang, I., Hiltunen, M.A., Schlichting, R.D. (1998). Affordable fault tolerance through adaptation. In: Rolim, J. (eds) Parallel and Distributed Processing. IPPS 1998. Lecture Notes in Computer Science, vol 1388. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-64359-1_730
Download citation
DOI: https://doi.org/10.1007/3-540-64359-1_730
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64359-3
Online ISBN: 978-3-540-69756-5
eBook Packages: Springer Book Archive