Using Data Flow Information to Obtain Efficient Check Sets for Algorithm-Based Fault Tolerance

Narasimhan, Ragini; Rosenkrantz, Daniel J.; Ravi, S. S.

doi:10.1023/A:1018793714426

Using Data Flow Information to Obtain Efficient Check Sets for Algorithm-Based Fault Tolerance

Published: August 1999

Volume 27, pages 289–323, (1999)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Ragini Narasimhan,
Daniel J. Rosenkrantz &
S. S. Ravi

53 Accesses
Explore all metrics

Abstract

Algorithm-Based Fault Tolerance (ABFT) is a well known technique for achieving fault and error detection in multiprocessor systems. We examine several issues concerning ABFT systems when the data flow information for the underlying multiprocessor computation is available. Our results show that this finergrained information can be exploited to obtain test schemes involving fewer checks, in some cases, dramatically fewer checks. We address both the analysis and design of ABFT systems when the data flow information is available. The analysis problem for a given ABFT system is to determine the fault detectability and the fault locatability (maximum number of detectable and locatable faulty processors) of the system. We show that the analysis problem can be solved efficiently when the number of faults is fixed. We also address the computational difficulty of this problem when the number of faults is not fixed. The design problem is concerned with the construction of a minimal collection of checks which can detect or locate a specified number of faults for a given multiprocessor computation. We examine some special classes of data flow graphs and establish upper and lower bounds on the number of checks needed to detect or locate a given number of faults. We also address the computational difficulty of this design problem for several cases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the conceptual foundations of comparative analysis and solution of self-diagnostic problems in multiprocessor systems under different unreliable testing models

Article 09 July 2015

Yu. K. Dimitriev

The Non-Inclusive Diagnosability of Regular Graphs

Article 28 April 2022

Yu-Long Wei, Tong-Tong Ding & Min Xu

Fault localization for automated program repair: effectiveness, performance, repair correctness

Article 26 March 2016

Fatmah Yousef Assiri & James M. Bieman

REFERENCES

K. H. Huang and J. A. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Trans. Computers C-33(6):518–528 (June 1984).
Google Scholar
J. Y. Jou and J. A. Abraham, Fault-tolerant matrix arithmetic and signal processing on highly concurrent computing structures, Proc. IEEE 74(5):732–741 (May 1986).
Google Scholar
J. I. Khan, W. Lin, and Y. Y. Yun, Adaptive algorithm-based fault tolerance for parallel computing in linear systems, Proc. 23rd Int'l. Conf. Parallel Processing (ICPP'94), Boca Raton, Florida, pp. 176–183 (August 1994).
F. T. Luk and H. Park, An analysis of algorithm-based fault tolerance techniques, J. Parallel Distribut. Comput. 5(2):172–184 ( April 1988).
Google Scholar
C. G. Oh, N. Y. Youn, and V. K. Raj, Rearranged hamming checksum for matrix computations with algorithm-based fault tolerance, Proc. IEEE Workshop on Fault-Tolerant Parallel and Distribut. Syst., Amherst, Massachusetts, pp. 185–192 (July 1992).
Y. M. Yeh and T. Y. Feng, Algorithm-based fault tolerance for matrix inversion with maximum pivoting, J. Parallel Distribut. Comput. 14(4):373–389 (April 1992).
Google Scholar
P. Banerjee and J. A. Abraham, Bounds on algorithm-based fault tolerance in multiple processor systems, IEEE Trans. Computers C-35(4):296–306 (April 1986).
Google Scholar
J. Y. Jou and J. A. Abraham, Fault tolerant FFT networks, Proc. 15th Int'l. Symp. Fault Tolerant Computing (FTCS-15), pp. 338–343 (June 1985).
M. Malek and Y. H. Choi, Fault tolerant FFT processors, Proc. 15th Int'l. Symp. Fault Tolerant Computing (FTCS-15), pp. 266–271 (June 1985).
D. L. Tao, C. R. P. Hartmann, and Y. S. Chen, A novel concurrent error detection scheme for FFT networks, Proc. 20th Int’ l. Symp. Fault-Tolerant Comput. (FTCS-20), Newcastle upon Tyne, United Kingdom, pp. 114–121 ( June 1990).
Google Scholar
S. J. Wang and N. K. Jha, Algorithm-based fault tolerance for FFT networks, IEEE Trans. Comput. 43(7):849–854 (July 1994).
Google Scholar
R. B. Mueller-Thuns, D. McFarland, and P. A. Banerjee, Algorithm-based fault tolerance for adaptive least squares lattice filtering on a hypercube multiprocessor, Proc. 18th Int'l. Conf. Parallel Processing (ICPP'89), Vol. III, pp. 177–180, Chicago, Illinois (August 1989).
P. Banerjee and J. A. Abraham, Concurrent fault diagnosis in multiple processor systems, Proc. 16th Int'l. Symp. Fault Tolerant Computing (FTCS-16), pp. 298–303 (July 1986).
V. S. S. Nair and S. Venkatesan, Algorithm-based fault tolerance for non-computationally intensive applications, Proc. 38th Int'l. SPIE Symp., San Diego, California, pp. 751–759 (August 1994).
S. Dutt and F. T. Asaad, Mantissa-preserving operations and robust algorithm-based fault tolerance for matrix computations, IEEE Trans. Computers 45(4):408–424 (April 1996).
Google Scholar
A. Roy-Chowdhury and P. Banerjee, Algorithm-based fault location and recovery for matrix computations, Proc. 24th Ann. Int’ l. Symp. Fault-Tolerant Comput., Austin, Texas, pp. 38–47 (June 1994).
G. Y. Song, Robust checksum test in algorithm-based fault tolerance on 2D processor arrays, Ph.D. Thesis, Center for Advanced Computer Studies, University of Southwestern Louisiana, Lafayette, Louisiana (August 1995).
Google Scholar
D. Y. D. Wei, J. H. Kim, and T. R. N. Rao, Complete tests in algorithm-based fault tolerant matrix operations on processor arrays, Proc. IEEE Int'l. Workshop on Defect and Fault Tolerance in VLSI Syst., Venice, Italy, pp. 255–262 (October 1993).
Q. Zhang and J. H. Kim, An efficient method to reduce roundoff error in matrix multiplication with algorithm-based fault tolerance, Proc Sixth Ann. IEEE Int'l. Conf. Wafer Scale Integration, San Francisco, California, pp. 32–39 (January 1994).
J. S. Plank, Y. Kim and J. J. Dongarra, Algorithm-based diskless checkpointing for faulttolerant matrix operations, Proc. 25th Int'l. Symp. Fault-Tolerant Comput. (FTCS-25), Pasadena, California, pp. 351–360 (June 1995).
J. Rexford and N. K. Jha, Partitioned encoding schemes for algorithm-based fault tolerance in massively parallel systems, IEEE Trans. Parallel and Distribut. Syst. 5(6):649–653 (June 1994).
Google Scholar
S. Yajnik and N. K. Jha, Design and analysis of algorithm-based fault-tolerant multiprocessor systems. In Foundations of Dependable Computing: Paradigms for Dependable Applications, Kluwer Academic Publishers, Boston, Massachusetts (1994).
Google Scholar
R. K. Acree, N. Ullah, A. Karia, J. T. Rahmeh, and J. A. Abraham, An object-oriented approach for implementing algorithm-based fault tolerance, Proc. 12th Ann. Int'l. Phoenix Conf. Computers and Commun., Phoenix, Arizona, pp. 210–216 (March 1993).
P. Banerjee, J. T. Rahmeh, C. B. Stunkel, V. S. S. Nair, K. Roy, and J. A. Abraham, Algorithm-based fault tolerance on a hypercube multiprocessor, IEEE Trans. Computers 39(9):1132–1145 (September 1990).
Google Scholar
S. Yajnik and N. K. Jha, Analysis and randomized design of algorithm-based faulttolerant multiprocessor systems under and extended model, IEEE Trans. Parallel Distribut. Syst. 8(7):757–768 (July 1997).
Google Scholar
R. K. Sitaraman and N. K. Jha, Optimal design of checks for error detection and location in fault-tolerantmultiprocessor systems, IEEE Trans. Computers 42(7):780–793 (July 1993).
Google Scholar
Z. Manna, Mathematical Theory of Computation, McGraw-Hill, New York (1974).
Google Scholar
V. S. S. Nair and J. A. Abraham, A model for the analysis of fault-tolerant signal processing architectures, Proc. 32nd Int'l. SPIE Symp., San Diego, California, pp. 246–257 (August 1988).
D. M. Blough and A. Pelc, Almost certain fault diagnosis through algorithm-based fault tolerance, IEEE Trans. Parallel and Distribut. Syst. 5(5):532–539 (May 1994).
Google Scholar
V. S. S. Nair, Analysis and design of algorithm-based fault-tolerant systems, Ph.D. thesis, Report CRHC–90–3, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign (August 1990).
B. Vinnakota and N. K. Jha, Synthesis of algorithm-based fault tolerant systems from dependence graphs, IEEE Trans. Parallel Distribut. Syst. 4(4):864–874 (August 1993).
Google Scholar
B. Vinnakota and N. K. Jha, Diagnosability and diagnosis of algorithm-based fault tolerant systems, IEEE Trans. Comput. 42(8):924–937 (August 1993).
Google Scholar
B. Vinnakota and N. K. Jha, Design of algorithm-based fault tolerant multiprocessor systems for concurrent error detection and fault diagnosis, IEEE Trans. Parallel Distribut. Syst. 5(10):1099–1106 (October 1994).
Google Scholar
S. Yajnik and N. K. Jha, Graceful degradation in algorithm-based fault-tolerant multiprocessor systems, IEEE Trans. Parallel Distribut. Syst. 8(2):137–153 (February 1997).
Google Scholar
D. C. Gu, D. J. Rosenkrantz, and S. S. Ravi, Determining performance measures of algorithm-based fault-tolerant systems, J. Parallel Distribut. Comput. 18(1):56–70 (May 1993).
Google Scholar
D. C. Gu, D. J. Rosenkrantz, and S. S. Ravi, Construction of check sets for algorithmbased fault tolerance, IEEE Trans. Computers 43(6):641 650 (June 1994).
Google Scholar
M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness, W. H. Freeman and Co., San Francisco, California (1979).
Google Scholar
S. Y. Kung, VLSI Array Processors, Prentice-Hall, Inc., Englewood Cliffs, New Jersey (1988).
Google Scholar
G. M. Megson, An Introduction to Systolic Algorithm Design, Clarendon Press, Oxford (1992).
Google Scholar

Download references

Authors

Ragini Narasimhan
View author publications
You can also search for this author in PubMed Google Scholar
Daniel J. Rosenkrantz
View author publications
You can also search for this author in PubMed Google Scholar
S. S. Ravi
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Narasimhan, R., Rosenkrantz, D.J. & Ravi, S.S. Using Data Flow Information to Obtain Efficient Check Sets for Algorithm-Based Fault Tolerance. International Journal of Parallel Programming 27, 289–323 (1999). https://doi.org/10.1023/A:1018793714426

Download citation

Issue Date: August 1999
DOI: https://doi.org/10.1023/A:1018793714426

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using Data Flow Information to Obtain Efficient Check Sets for Algorithm-Based Fault Tolerance

Abstract

Access this article

Similar content being viewed by others

On the conceptual foundations of comparative analysis and solution of self-diagnostic problems in multiprocessor systems under different unreliable testing models

The Non-Inclusive Diagnosability of Regular Graphs

Fault localization for automated program repair: effectiveness, performance, repair correctness

REFERENCES

Rights and permissions

About this article

Cite this article

Navigation

Using Data Flow Information to Obtain Efficient Check Sets for Algorithm-Based Fault Tolerance

Abstract

Access this article

Similar content being viewed by others

On the conceptual foundations of comparative analysis and solution of self-diagnostic problems in multiprocessor systems under different unreliable testing models

The Non-Inclusive Diagnosability of Regular Graphs

Fault localization for automated program repair: effectiveness, performance, repair correctness

REFERENCES

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation