Skip to main content
Log in

Using Data Flow Information to Obtain Efficient Check Sets for Algorithm-Based Fault Tolerance

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Algorithm-Based Fault Tolerance (ABFT) is a well known technique for achieving fault and error detection in multiprocessor systems. We examine several issues concerning ABFT systems when the data flow information for the underlying multiprocessor computation is available. Our results show that this finergrained information can be exploited to obtain test schemes involving fewer checks, in some cases, dramatically fewer checks. We address both the analysis and design of ABFT systems when the data flow information is available. The analysis problem for a given ABFT system is to determine the fault detectability and the fault locatability (maximum number of detectable and locatable faulty processors) of the system. We show that the analysis problem can be solved efficiently when the number of faults is fixed. We also address the computational difficulty of this problem when the number of faults is not fixed. The design problem is concerned with the construction of a minimal collection of checks which can detect or locate a specified number of faults for a given multiprocessor computation. We examine some special classes of data flow graphs and establish upper and lower bounds on the number of checks needed to detect or locate a given number of faults. We also address the computational difficulty of this design problem for several cases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

REFERENCES

  1. K. H. Huang and J. A. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Trans. Computers C-33(6):518–528 (June 1984).

    Google Scholar 

  2. J. Y. Jou and J. A. Abraham, Fault-tolerant matrix arithmetic and signal processing on highly concurrent computing structures, Proc. IEEE 74(5):732–741 (May 1986).

    Google Scholar 

  3. J. I. Khan, W. Lin, and Y. Y. Yun, Adaptive algorithm-based fault tolerance for parallel computing in linear systems, Proc. 23rd Int'l. Conf. Parallel Processing (ICPP'94), Boca Raton, Florida, pp. 176–183 (August 1994).

  4. F. T. Luk and H. Park, An analysis of algorithm-based fault tolerance techniques, J. Parallel Distribut. Comput. 5(2):172–184 ( April 1988).

    Google Scholar 

  5. C. G. Oh, N. Y. Youn, and V. K. Raj, Rearranged hamming checksum for matrix computations with algorithm-based fault tolerance, Proc. IEEE Workshop on Fault-Tolerant Parallel and Distribut. Syst., Amherst, Massachusetts, pp. 185–192 (July 1992).

  6. Y. M. Yeh and T. Y. Feng, Algorithm-based fault tolerance for matrix inversion with maximum pivoting, J. Parallel Distribut. Comput. 14(4):373–389 (April 1992).

    Google Scholar 

  7. P. Banerjee and J. A. Abraham, Bounds on algorithm-based fault tolerance in multiple processor systems, IEEE Trans. Computers C-35(4):296–306 (April 1986).

    Google Scholar 

  8. J. Y. Jou and J. A. Abraham, Fault tolerant FFT networks, Proc. 15th Int'l. Symp. Fault Tolerant Computing (FTCS-15), pp. 338–343 (June 1985).

  9. M. Malek and Y. H. Choi, Fault tolerant FFT processors, Proc. 15th Int'l. Symp. Fault Tolerant Computing (FTCS-15), pp. 266–271 (June 1985).

  10. D. L. Tao, C. R. P. Hartmann, and Y. S. Chen, A novel concurrent error detection scheme for FFT networks, Proc. 20th Int’ l. Symp. Fault-Tolerant Comput. (FTCS-20), Newcastle upon Tyne, United Kingdom, pp. 114–121 ( June 1990).

    Google Scholar 

  11. S. J. Wang and N. K. Jha, Algorithm-based fault tolerance for FFT networks, IEEE Trans. Comput. 43(7):849–854 (July 1994).

    Google Scholar 

  12. R. B. Mueller-Thuns, D. McFarland, and P. A. Banerjee, Algorithm-based fault tolerance for adaptive least squares lattice filtering on a hypercube multiprocessor, Proc. 18th Int'l. Conf. Parallel Processing (ICPP'89), Vol. III, pp. 177–180, Chicago, Illinois (August 1989).

  13. P. Banerjee and J. A. Abraham, Concurrent fault diagnosis in multiple processor systems, Proc. 16th Int'l. Symp. Fault Tolerant Computing (FTCS-16), pp. 298–303 (July 1986).

  14. V. S. S. Nair and S. Venkatesan, Algorithm-based fault tolerance for non-computationally intensive applications, Proc. 38th Int'l. SPIE Symp., San Diego, California, pp. 751–759 (August 1994).

  15. S. Dutt and F. T. Asaad, Mantissa-preserving operations and robust algorithm-based fault tolerance for matrix computations, IEEE Trans. Computers 45(4):408–424 (April 1996).

    Google Scholar 

  16. A. Roy-Chowdhury and P. Banerjee, Algorithm-based fault location and recovery for matrix computations, Proc. 24th Ann. Int’ l. Symp. Fault-Tolerant Comput., Austin, Texas, pp. 38–47 (June 1994).

  17. G. Y. Song, Robust checksum test in algorithm-based fault tolerance on 2D processor arrays, Ph.D. Thesis, Center for Advanced Computer Studies, University of Southwestern Louisiana, Lafayette, Louisiana (August 1995).

    Google Scholar 

  18. D. Y. D. Wei, J. H. Kim, and T. R. N. Rao, Complete tests in algorithm-based fault tolerant matrix operations on processor arrays, Proc. IEEE Int'l. Workshop on Defect and Fault Tolerance in VLSI Syst., Venice, Italy, pp. 255–262 (October 1993).

  19. Q. Zhang and J. H. Kim, An efficient method to reduce roundoff error in matrix multiplication with algorithm-based fault tolerance, Proc Sixth Ann. IEEE Int'l. Conf. Wafer Scale Integration, San Francisco, California, pp. 32–39 (January 1994).

  20. J. S. Plank, Y. Kim and J. J. Dongarra, Algorithm-based diskless checkpointing for faulttolerant matrix operations, Proc. 25th Int'l. Symp. Fault-Tolerant Comput. (FTCS-25), Pasadena, California, pp. 351–360 (June 1995).

  21. J. Rexford and N. K. Jha, Partitioned encoding schemes for algorithm-based fault tolerance in massively parallel systems, IEEE Trans. Parallel and Distribut. Syst. 5(6):649–653 (June 1994).

    Google Scholar 

  22. S. Yajnik and N. K. Jha, Design and analysis of algorithm-based fault-tolerant multiprocessor systems. In Foundations of Dependable Computing: Paradigms for Dependable Applications, Kluwer Academic Publishers, Boston, Massachusetts (1994).

    Google Scholar 

  23. R. K. Acree, N. Ullah, A. Karia, J. T. Rahmeh, and J. A. Abraham, An object-oriented approach for implementing algorithm-based fault tolerance, Proc. 12th Ann. Int'l. Phoenix Conf. Computers and Commun., Phoenix, Arizona, pp. 210–216 (March 1993).

  24. P. Banerjee, J. T. Rahmeh, C. B. Stunkel, V. S. S. Nair, K. Roy, and J. A. Abraham, Algorithm-based fault tolerance on a hypercube multiprocessor, IEEE Trans. Computers 39(9):1132–1145 (September 1990).

    Google Scholar 

  25. S. Yajnik and N. K. Jha, Analysis and randomized design of algorithm-based faulttolerant multiprocessor systems under and extended model, IEEE Trans. Parallel Distribut. Syst. 8(7):757–768 (July 1997).

    Google Scholar 

  26. R. K. Sitaraman and N. K. Jha, Optimal design of checks for error detection and location in fault-tolerantmultiprocessor systems, IEEE Trans. Computers 42(7):780–793 (July 1993).

    Google Scholar 

  27. Z. Manna, Mathematical Theory of Computation, McGraw-Hill, New York (1974).

    Google Scholar 

  28. V. S. S. Nair and J. A. Abraham, A model for the analysis of fault-tolerant signal processing architectures, Proc. 32nd Int'l. SPIE Symp., San Diego, California, pp. 246–257 (August 1988).

  29. D. M. Blough and A. Pelc, Almost certain fault diagnosis through algorithm-based fault tolerance, IEEE Trans. Parallel and Distribut. Syst. 5(5):532–539 (May 1994).

    Google Scholar 

  30. V. S. S. Nair, Analysis and design of algorithm-based fault-tolerant systems, Ph.D. thesis, Report CRHC–90–3, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign (August 1990).

  31. B. Vinnakota and N. K. Jha, Synthesis of algorithm-based fault tolerant systems from dependence graphs, IEEE Trans. Parallel Distribut. Syst. 4(4):864–874 (August 1993).

    Google Scholar 

  32. B. Vinnakota and N. K. Jha, Diagnosability and diagnosis of algorithm-based fault tolerant systems, IEEE Trans. Comput. 42(8):924–937 (August 1993).

    Google Scholar 

  33. B. Vinnakota and N. K. Jha, Design of algorithm-based fault tolerant multiprocessor systems for concurrent error detection and fault diagnosis, IEEE Trans. Parallel Distribut. Syst. 5(10):1099–1106 (October 1994).

    Google Scholar 

  34. S. Yajnik and N. K. Jha, Graceful degradation in algorithm-based fault-tolerant multiprocessor systems, IEEE Trans. Parallel Distribut. Syst. 8(2):137–153 (February 1997).

    Google Scholar 

  35. D. C. Gu, D. J. Rosenkrantz, and S. S. Ravi, Determining performance measures of algorithm-based fault-tolerant systems, J. Parallel Distribut. Comput. 18(1):56–70 (May 1993).

    Google Scholar 

  36. D. C. Gu, D. J. Rosenkrantz, and S. S. Ravi, Construction of check sets for algorithmbased fault tolerance, IEEE Trans. Computers 43(6):641 650 (June 1994).

    Google Scholar 

  37. M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness, W. H. Freeman and Co., San Francisco, California (1979).

    Google Scholar 

  38. S. Y. Kung, VLSI Array Processors, Prentice-Hall, Inc., Englewood Cliffs, New Jersey (1988).

    Google Scholar 

  39. G. M. Megson, An Introduction to Systolic Algorithm Design, Clarendon Press, Oxford (1992).

    Google Scholar 

Download references

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Narasimhan, R., Rosenkrantz, D.J. & Ravi, S.S. Using Data Flow Information to Obtain Efficient Check Sets for Algorithm-Based Fault Tolerance. International Journal of Parallel Programming 27, 289–323 (1999). https://doi.org/10.1023/A:1018793714426

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1018793714426

Navigation