Abstract
The primary focus in the analysis of massively parallel supercomputers has traditionally been on their performance. However, their complex network topologies, large number of processors, and sophisticated system software can make them very unreliable. If every failure of one of the many components of a massively parallel computer could shut down the machine, the machine would be useless. Therefore fault tolerance is required. The basis of effective mechanisms for fault tolerance is an efficient diagnosis.
This paper deals with concurrent and hierarchical system level diagnosis for a particular massively parallel architecture and with a simulation-based method to validate the proposed diagnosis algorithm. The diagnosis algorithm is presented and we describe a simulation-based method to test and verify the algorithms for fault tolerance already during the design phase of the target machine.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bianchini R., Buskens R. Implementation of On-Line Distributed System-Level Diagnosis Theory IEEE Transaction on computer. vol. C-41, No. 5, pp 616–626, May 1992
Bieker B., Deconinck G., Maehle E., Vouncks J. Reconfiguration and Checkpointing in Massively Parallel Systems Submitted to EDCC-1 1994
Bobbio A. Dependability Analysis of Fault-Tolerant Systems: a Literature Survey in Microprocessing and Microprogramming 29 (1990), pp 1–13, North-Holland, 1990.
Dal Cin M., Hofmann F., Grygier A., Hessenauer H., Hildebrand U., Linster C.U., Thiel T., Turowski S. MEMSY — A Modular Expandable Multiprocessor System in A. Bode, M. Dal Cin (eds), Parallel Computer Architectures, pp 15–30, Springer LNCS 732, 1993.
Goswami, Kumar K., Ravi K. Iyer. The DEPEND Reference Manual. Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, 1991.
Goswami, Kumar K. Design for Dependability: A Simulation-Based Approach. Ph.D. Thesis, Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, 1993.
Grand Challenges High Performance Computing and Communication. The Fiscal Year 1992 U.S. Research and Development Program. Report by the Committee on Physical, Mathematical, and Engeneering Sciences, NFS Washington 1992.
Hein, Axel. SimParGC — Ein Simulator zur Leistungs-und Zuverlässigkeits-Analyse des Multiprozessorsystems Parsytec GC, Version 1.0. Internal Report, IMMD 3, University of Erlangen-Nürnberg, 1994.
Hosseini S, Kuhl J.G., Reddy S.M. A Diagnosis Algorithm for Distributed Computing Systems with Dynamic Failure and Repair IEEE Transaction on computer. vol. C-33, pp 223–233. Mar. 1984
Inmos The T9000 Transputer Hardware Reference Manual INMOS Limited 1993.
Kuhl, J.G; Reddy, S.M. Distibuted fault tolerance for large multiprocessor systems ACM-Sigarch Newsletter 8, No.3, pp23–30, 1980
Kuhl, J.; Reddy, S. Fault-diagnosis in fully distributed systems FTCS 11, Fault tolerant computing: the 11th international symposium, pp. 100–105, 1981
Marsan, M. Ajmone, G. Balbo und G. Conte. Performance Models of Multiprocessor Systems. Cambridge; London: The MIT Press. 1986.
Meyer, F.J.; Masson, G. An efficient fault diagnosis algorithm fot symetric multiprocessor architecture IEEE Transaction on computer, vol. C-27, pp. 1059–1063, Nov. 1978
Parsytec Computer GmbH. The Parsytec GC Technical Summary, Version 1.0. Aachen (Germany), 1991.
Parsytec Computer GmbH. PARIX Release 1.2. Reference Manual. Aachen (Germany), 1993.
Preparata, F.P.; Metze, G.; Chien, R.T On the connection assignment problem of diagnosable systems IEEE Trans.Electronic Computing. Vol. EC-16, pp 848–854, December 1967
Stahl, M.; Buskens, R.; Bianchini, R. Jr. On-line diagnosis in general topology networks Workshop on fault tolerant Parallel and Distributed Systems, pp. 114–121 IEEE Computer Society, Massachusetts July 1992
Stroustrup, Bjarne. The C++ Programming Language, Second Edition. New York; London [u.a.]: Addison-Wesley Publishing Company, 1991.
Trivedi, Kishor S. Probability and Statistics with Reliability, Queuing, and Computer Science Applications. Englewood Cliffs: NJ Prentice Hall, 1982.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1994 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Altmann, J., Balbach, F., Hein, A. (1994). An approach for hierarchical system level diagnosis of massively parallel computers combined with a simulation-based method for dependability analysis. In: Echtle, K., Hammer, D., Powell, D. (eds) Dependable Computing — EDCC-1. EDCC 1994. Lecture Notes in Computer Science, vol 852. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-58426-9_142
Download citation
DOI: https://doi.org/10.1007/3-540-58426-9_142
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-58426-1
Online ISBN: 978-3-540-48785-2
eBook Packages: Springer Book Archive