Skip to main content

Deterministic fault injection of distributed systems

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 938))

Abstract

Ensuring that a system meets its prescribed specification is a growing challenge that confronts software developers and system engineers. Meeting this challenge is particularly important for distributed systems with strict dependability and timeliness constraints. This paper presents a technique, called script-driven probing and fault injection, for the evaluation and validation of dependable protocols. The proposed approach can be used to demonstrate three aspects of a target protocol: i) detection of design or implementation errors, ii) identification of violations of protocol specifications, and iii) insight into design decisions made by the implementors. To demonstrate the capabilities of this technique, the paper briefly describes a probing and fault injection tool, called the PFI tool, and several experiments on two protocols: the Transmission Control Protocol (TCP) [4, 24] and the Group Membership Protocol (GMP) [19]. The tool can be used to delay, drop, reorder, duplicate, and modify messages. It can also introduce new messages into the system to probe participants. In the case of TCP, we used the PFI tool to duplicate the experiments reported in [7] on several TCP implementations without access to the vendors' TCP source code in a very short time. We also ran several new experiments that are difficult to perform using past approaches based on packet monitoring and filtering. In the case of GMP, we used the tool to test the fault-tolerance capabilities of an implementation under various failure models including daemon/link crash, send/receive omissions, and timing failures. Furthermore, by selective reordering of messages and spontaneous transmission of new messages, we were able to guide a distributed computation into hard to reach global states without instrumenting the protocol implementation.

This work is supported in part by a research grant from the U.S. Office of Naval Research.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. J. Arlat, Y. Crouzet, and J.-C. Laprie. Fault injection for dependability validation of fault-tolerant computing systems. In Proc. Int'l Symp. on Fault-Tolerant Computing, pages 348–355, June 1989.

    Google Scholar 

  2. Jean Arlat, Martine Aguera, Yves Crouzet, Jean-Charles Fabre, Eliane Martins, and David Powell. Experimental evaluation of the fault tolerance of an atomic multicast system. IEEE Trans. Reliability, 39(4):455–467, October 1990.

    Google Scholar 

  3. D. Avresky, J. Arlat, J.C. Laprie, and Yves Crouzet. Fault injection for the formal testing of fault tolerance. In Proc. Int'l Symp. on Fault-Tolerant Computing, pages 345–354. IEEE, 1992.

    Google Scholar 

  4. R. Braden. RFC-1122: Requirements for internet hosts. Request for Comments, October 1989. Network Information Center.

    Google Scholar 

  5. R. Chillarege and N. S. Bowen. Understanding large system failures — a fault injection experiment. In Proc. Int'l Symp. on Fault-Tolerant Computing, pages 356–363, June 1989.

    Google Scholar 

  6. G. Choi, R. Iyer, and V. Carreno. Simulated fault injection: A methodology to evaluate fault tolerant microprocessor architectures. IEEE Trans. Reliability, 39(4):486–490, October 1990.

    Google Scholar 

  7. Douglas E. Comer and John C. Lin. Probing TCP implementations. In Proc. Summer USENIX Conference, June 1994.

    Google Scholar 

  8. F. Cristian. Reaching agreement on processor-group membership in synchronous distributed systems. Distributed Computing, (4):175–187, 1991.

    Google Scholar 

  9. E. Czeck and D. Siewiorek. Effects of transient gate-level faults on program behaviour. In Proc. Int'l Symp. on Fault-Tolerant Computing, pages 236–243. IEEE, 1990.

    Google Scholar 

  10. Scott Dawson and Farnam Jahanian. Probing and fault injection of protocol implementations. Technical Report CSE-TR-217-94, The University of Michigan, October 1994.

    Google Scholar 

  11. K. Echtle and Y. Chen. Evaluation of deterministic fault injection for faulttolerant protocol testing. In Proc. Int'l Symp. on Fault-Tolerant Computing, pages 418–425. IEEE, 1991.

    Google Scholar 

  12. Klaus Echtle and Martin Leu. The EFA fault injector for fault-tolerant distributed system testing. In Workshop on Fault-Tolerant Parallel and Distributed Systems, pages 28–35. IEEE, 1992.

    Google Scholar 

  13. G. Finelli. Characterization of fault recovery through fault injection on ftmp. IEEE Trans. Reliability, 36(2):164–170, June 1987.

    Google Scholar 

  14. K. Goswami and R. Iyer. Simulation of software behaviour under hardware faults. In Proc. Int'l Symp. on Fault-Tolerant Computing, pages 218–227. IEEE, 1993.

    Google Scholar 

  15. Vassos Hadzilacos and Sam Toueg. Fault-tolerant broadcasts and related problems. In Sape Mullender, editor, Distributed Systems. Addison Wesley, 1993. Second Edition.

    Google Scholar 

  16. Seungjae Han, Harold A. Rosenberg, and Kang G. Shin. DOCTOR: An IntegrateD sOftware fault injeCTOn enviRonment. Technical Report CSE-TR-192-93, The University of Michigan, December 1993.

    Google Scholar 

  17. Norman C. Hutchinson and Larry L. Peterson. The x-Kernel: An architecture for implementing network protocols. IEEE Trans. Software Engineering, 17(1):1–13, January 1991.

    Google Scholar 

  18. David B. Ingham and Graham D. Parrington. Delayline: A wide-area network emulation tool. Computing Systems, 7(3):313–332, Summer 1994.

    Google Scholar 

  19. Farnam Jahanian, Ragunathan Rajkumar, and Sameh Fakhouri. Processor group membership protocols: Specification, design and implementation. In Proceedings of the 12th Symposium on Reliable Distributed Systems, pages 2–11, Princeton, New Jersey, October 1993.

    Google Scholar 

  20. G.A Kanawati, N.A. Kanawati, and J.A. Abraham. FERRARI: A tool for the validation of system dependability properties. In Proc. Int'l Symp. on Fault-Tolerant Computing, pages 336–344. IEEE, 1992.

    Google Scholar 

  21. Steven McCanne and Van Jacobson. The bsd packet filter: A new architecture for user-level packet capture. In Winter USENIX Conference, pages 259–269, January 1993.

    Google Scholar 

  22. Shivakant Mishra, Larry L. Peterson, and Richard D. Schlichtung. A membership protocol based on partial order. In Second Working Conference on Dependable Computing for Critical Applications, February 1990.

    Google Scholar 

  23. J. Mogul, R. Rashid, and M. Accetta. The packet filter: An efficient mechanism for user-level network code. In Proc. ACM Symp. on Operating Systems Principles, pages 39–51, Austin, TX, November 1987. ACM.

    Google Scholar 

  24. Jon Postel. RFC-793: Transmission control protocol. Request for Comments, September 1981. Network Information Center.

    Google Scholar 

  25. A. M. Ricciardi and K. P. Birman. Using process groups to implement failure detection in asynchronous environments. In Proceedings of the 11th ACM Symposium on Principles of Distributed Computing, Montreal, Quebec, August 1991.

    Google Scholar 

  26. Z. Segall et al. Fiat — fault injection based automated testing environment. In FTCS-18, pages 102–107, 1988.

    Google Scholar 

  27. K. G. Shin and Y. H. Lee. Measurement and application of fault latency. IEEE Trans. Computers, C-35(4):370–375, April 1986.

    Google Scholar 

  28. Masanobu Yuhara, Brian N. Bershad, Chris Maeda, and J. Eliot B. Moss. Efficient packet demultiplexing for multiple endpoints and large messages. In Winter USENIX Conference, January 1994. Second Edition.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Kenneth P. Birman Friedemann Mattern André Schiper

Rights and permissions

Reprints and permissions

Copyright information

© 1995 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Dawson, S., Jahanian, F. (1995). Deterministic fault injection of distributed systems. In: Birman, K.P., Mattern, F., Schiper, A. (eds) Theory and Practice in Distributed Systems. Lecture Notes in Computer Science, vol 938. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-60042-6_13

Download citation

  • DOI: https://doi.org/10.1007/3-540-60042-6_13

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-60042-8

  • Online ISBN: 978-3-540-49409-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics