Abstract
Algorithm-Based Fault Tolerance (ABFT) is a well known technique for achieving fault and error detection in multiprocessor systems. We examine several issues concerning ABFT systems when the data flow information for the underlying multiprocessor computation is available. Our results show that this finergrained information can be exploited to obtain test schemes involving fewer checks, in some cases, dramatically fewer checks. We address both the analysis and design of ABFT systems when the data flow information is available. The analysis problem for a given ABFT system is to determine the fault detectability and the fault locatability (maximum number of detectable and locatable faulty processors) of the system. We show that the analysis problem can be solved efficiently when the number of faults is fixed. We also address the computational difficulty of this problem when the number of faults is not fixed. The design problem is concerned with the construction of a minimal collection of checks which can detect or locate a specified number of faults for a given multiprocessor computation. We examine some special classes of data flow graphs and establish upper and lower bounds on the number of checks needed to detect or locate a given number of faults. We also address the computational difficulty of this design problem for several cases.
Similar content being viewed by others
REFERENCES
K. H. Huang and J. A. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Trans. Computers C-33(6):518–528 (June 1984).
J. Y. Jou and J. A. Abraham, Fault-tolerant matrix arithmetic and signal processing on highly concurrent computing structures, Proc. IEEE 74(5):732–741 (May 1986).
J. I. Khan, W. Lin, and Y. Y. Yun, Adaptive algorithm-based fault tolerance for parallel computing in linear systems, Proc. 23rd Int'l. Conf. Parallel Processing (ICPP'94), Boca Raton, Florida, pp. 176–183 (August 1994).
F. T. Luk and H. Park, An analysis of algorithm-based fault tolerance techniques, J. Parallel Distribut. Comput. 5(2):172–184 ( April 1988).
C. G. Oh, N. Y. Youn, and V. K. Raj, Rearranged hamming checksum for matrix computations with algorithm-based fault tolerance, Proc. IEEE Workshop on Fault-Tolerant Parallel and Distribut. Syst., Amherst, Massachusetts, pp. 185–192 (July 1992).
Y. M. Yeh and T. Y. Feng, Algorithm-based fault tolerance for matrix inversion with maximum pivoting, J. Parallel Distribut. Comput. 14(4):373–389 (April 1992).
P. Banerjee and J. A. Abraham, Bounds on algorithm-based fault tolerance in multiple processor systems, IEEE Trans. Computers C-35(4):296–306 (April 1986).
J. Y. Jou and J. A. Abraham, Fault tolerant FFT networks, Proc. 15th Int'l. Symp. Fault Tolerant Computing (FTCS-15), pp. 338–343 (June 1985).
M. Malek and Y. H. Choi, Fault tolerant FFT processors, Proc. 15th Int'l. Symp. Fault Tolerant Computing (FTCS-15), pp. 266–271 (June 1985).
D. L. Tao, C. R. P. Hartmann, and Y. S. Chen, A novel concurrent error detection scheme for FFT networks, Proc. 20th Int’ l. Symp. Fault-Tolerant Comput. (FTCS-20), Newcastle upon Tyne, United Kingdom, pp. 114–121 ( June 1990).
S. J. Wang and N. K. Jha, Algorithm-based fault tolerance for FFT networks, IEEE Trans. Comput. 43(7):849–854 (July 1994).
R. B. Mueller-Thuns, D. McFarland, and P. A. Banerjee, Algorithm-based fault tolerance for adaptive least squares lattice filtering on a hypercube multiprocessor, Proc. 18th Int'l. Conf. Parallel Processing (ICPP'89), Vol. III, pp. 177–180, Chicago, Illinois (August 1989).
P. Banerjee and J. A. Abraham, Concurrent fault diagnosis in multiple processor systems, Proc. 16th Int'l. Symp. Fault Tolerant Computing (FTCS-16), pp. 298–303 (July 1986).
V. S. S. Nair and S. Venkatesan, Algorithm-based fault tolerance for non-computationally intensive applications, Proc. 38th Int'l. SPIE Symp., San Diego, California, pp. 751–759 (August 1994).
S. Dutt and F. T. Asaad, Mantissa-preserving operations and robust algorithm-based fault tolerance for matrix computations, IEEE Trans. Computers 45(4):408–424 (April 1996).
A. Roy-Chowdhury and P. Banerjee, Algorithm-based fault location and recovery for matrix computations, Proc. 24th Ann. Int’ l. Symp. Fault-Tolerant Comput., Austin, Texas, pp. 38–47 (June 1994).
G. Y. Song, Robust checksum test in algorithm-based fault tolerance on 2D processor arrays, Ph.D. Thesis, Center for Advanced Computer Studies, University of Southwestern Louisiana, Lafayette, Louisiana (August 1995).
D. Y. D. Wei, J. H. Kim, and T. R. N. Rao, Complete tests in algorithm-based fault tolerant matrix operations on processor arrays, Proc. IEEE Int'l. Workshop on Defect and Fault Tolerance in VLSI Syst., Venice, Italy, pp. 255–262 (October 1993).
Q. Zhang and J. H. Kim, An efficient method to reduce roundoff error in matrix multiplication with algorithm-based fault tolerance, Proc Sixth Ann. IEEE Int'l. Conf. Wafer Scale Integration, San Francisco, California, pp. 32–39 (January 1994).
J. S. Plank, Y. Kim and J. J. Dongarra, Algorithm-based diskless checkpointing for faulttolerant matrix operations, Proc. 25th Int'l. Symp. Fault-Tolerant Comput. (FTCS-25), Pasadena, California, pp. 351–360 (June 1995).
J. Rexford and N. K. Jha, Partitioned encoding schemes for algorithm-based fault tolerance in massively parallel systems, IEEE Trans. Parallel and Distribut. Syst. 5(6):649–653 (June 1994).
S. Yajnik and N. K. Jha, Design and analysis of algorithm-based fault-tolerant multiprocessor systems. In Foundations of Dependable Computing: Paradigms for Dependable Applications, Kluwer Academic Publishers, Boston, Massachusetts (1994).
R. K. Acree, N. Ullah, A. Karia, J. T. Rahmeh, and J. A. Abraham, An object-oriented approach for implementing algorithm-based fault tolerance, Proc. 12th Ann. Int'l. Phoenix Conf. Computers and Commun., Phoenix, Arizona, pp. 210–216 (March 1993).
P. Banerjee, J. T. Rahmeh, C. B. Stunkel, V. S. S. Nair, K. Roy, and J. A. Abraham, Algorithm-based fault tolerance on a hypercube multiprocessor, IEEE Trans. Computers 39(9):1132–1145 (September 1990).
S. Yajnik and N. K. Jha, Analysis and randomized design of algorithm-based faulttolerant multiprocessor systems under and extended model, IEEE Trans. Parallel Distribut. Syst. 8(7):757–768 (July 1997).
R. K. Sitaraman and N. K. Jha, Optimal design of checks for error detection and location in fault-tolerantmultiprocessor systems, IEEE Trans. Computers 42(7):780–793 (July 1993).
Z. Manna, Mathematical Theory of Computation, McGraw-Hill, New York (1974).
V. S. S. Nair and J. A. Abraham, A model for the analysis of fault-tolerant signal processing architectures, Proc. 32nd Int'l. SPIE Symp., San Diego, California, pp. 246–257 (August 1988).
D. M. Blough and A. Pelc, Almost certain fault diagnosis through algorithm-based fault tolerance, IEEE Trans. Parallel and Distribut. Syst. 5(5):532–539 (May 1994).
V. S. S. Nair, Analysis and design of algorithm-based fault-tolerant systems, Ph.D. thesis, Report CRHC–90–3, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign (August 1990).
B. Vinnakota and N. K. Jha, Synthesis of algorithm-based fault tolerant systems from dependence graphs, IEEE Trans. Parallel Distribut. Syst. 4(4):864–874 (August 1993).
B. Vinnakota and N. K. Jha, Diagnosability and diagnosis of algorithm-based fault tolerant systems, IEEE Trans. Comput. 42(8):924–937 (August 1993).
B. Vinnakota and N. K. Jha, Design of algorithm-based fault tolerant multiprocessor systems for concurrent error detection and fault diagnosis, IEEE Trans. Parallel Distribut. Syst. 5(10):1099–1106 (October 1994).
S. Yajnik and N. K. Jha, Graceful degradation in algorithm-based fault-tolerant multiprocessor systems, IEEE Trans. Parallel Distribut. Syst. 8(2):137–153 (February 1997).
D. C. Gu, D. J. Rosenkrantz, and S. S. Ravi, Determining performance measures of algorithm-based fault-tolerant systems, J. Parallel Distribut. Comput. 18(1):56–70 (May 1993).
D. C. Gu, D. J. Rosenkrantz, and S. S. Ravi, Construction of check sets for algorithmbased fault tolerance, IEEE Trans. Computers 43(6):641 650 (June 1994).
M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness, W. H. Freeman and Co., San Francisco, California (1979).
S. Y. Kung, VLSI Array Processors, Prentice-Hall, Inc., Englewood Cliffs, New Jersey (1988).
G. M. Megson, An Introduction to Systolic Algorithm Design, Clarendon Press, Oxford (1992).
Rights and permissions
About this article
Cite this article
Narasimhan, R., Rosenkrantz, D.J. & Ravi, S.S. Using Data Flow Information to Obtain Efficient Check Sets for Algorithm-Based Fault Tolerance. International Journal of Parallel Programming 27, 289–323 (1999). https://doi.org/10.1023/A:1018793714426
Issue Date:
DOI: https://doi.org/10.1023/A:1018793714426