An approach for hierarchical system level diagnosis of massively parallel computers combined with a simulation-based method for dependability analysis

Altmann, J.; Balbach, F.; Hein, A.

doi:10.1007/3-540-58426-9_142

J. Altmann¹,
F. Balbach¹ &
A. Hein¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 852))

Included in the following conference series:

European Dependable Computing Conference

133 Accesses
3 Altmetric

Abstract

The primary focus in the analysis of massively parallel supercomputers has traditionally been on their performance. However, their complex network topologies, large number of processors, and sophisticated system software can make them very unreliable. If every failure of one of the many components of a massively parallel computer could shut down the machine, the machine would be useless. Therefore fault tolerance is required. The basis of effective mechanisms for fault tolerance is an efficient diagnosis.

This paper deals with concurrent and hierarchical system level diagnosis for a particular massively parallel architecture and with a simulation-based method to validate the proposed diagnosis algorithm. The diagnosis algorithm is presented and we describe a simulation-based method to test and verify the algorithms for fault tolerance already during the design phase of the target machine.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Parallel Model-Based Diagnosis

Test-Based Diagnosis of Faults in Data Exchange Addressing in Computer Systems Using Parallel Model

Article 01 May 2018

Technology and Tools for Developing Industrial Software Test Suites Based on Formal Models and Implementing Scalable Testing Process on Supercomputer

References

Bianchini R., Buskens R. Implementation of On-Line Distributed System-Level Diagnosis Theory IEEE Transaction on computer. vol. C-41, No. 5, pp 616–626, May 1992
Article Google Scholar
Bieker B., Deconinck G., Maehle E., Vouncks J. Reconfiguration and Checkpointing in Massively Parallel Systems Submitted to EDCC-1 1994
Google Scholar
Bobbio A. Dependability Analysis of Fault-Tolerant Systems: a Literature Survey in Microprocessing and Microprogramming 29 (1990), pp 1–13, North-Holland, 1990.
Google Scholar
Dal Cin M., Hofmann F., Grygier A., Hessenauer H., Hildebrand U., Linster C.U., Thiel T., Turowski S. MEMSY — A Modular Expandable Multiprocessor System in A. Bode, M. Dal Cin (eds), Parallel Computer Architectures, pp 15–30, Springer LNCS 732, 1993.
Google Scholar
Goswami, Kumar K., Ravi K. Iyer. The DEPEND Reference Manual. Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, 1991.
Google Scholar
Goswami, Kumar K. Design for Dependability: A Simulation-Based Approach. Ph.D. Thesis, Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, 1993.
Google Scholar
Grand Challenges High Performance Computing and Communication. The Fiscal Year 1992 U.S. Research and Development Program. Report by the Committee on Physical, Mathematical, and Engeneering Sciences, NFS Washington 1992.
Google Scholar
Hein, Axel. SimParGC — Ein Simulator zur Leistungs-und Zuverlässigkeits-Analyse des Multiprozessorsystems Parsytec GC, Version 1.0. Internal Report, IMMD 3, University of Erlangen-Nürnberg, 1994.
Google Scholar
Hosseini S, Kuhl J.G., Reddy S.M. A Diagnosis Algorithm for Distributed Computing Systems with Dynamic Failure and Repair IEEE Transaction on computer. vol. C-33, pp 223–233. Mar. 1984
Google Scholar
Inmos The T9000 Transputer Hardware Reference Manual INMOS Limited 1993.
Google Scholar
Kuhl, J.G; Reddy, S.M. Distibuted fault tolerance for large multiprocessor systems ACM-Sigarch Newsletter 8, No.3, pp23–30, 1980
Google Scholar
Kuhl, J.; Reddy, S. Fault-diagnosis in fully distributed systems FTCS 11, Fault tolerant computing: the 11th international symposium, pp. 100–105, 1981
Google Scholar
Marsan, M. Ajmone, G. Balbo und G. Conte. Performance Models of Multiprocessor Systems. Cambridge; London: The MIT Press. 1986.
Google Scholar
Meyer, F.J.; Masson, G. An efficient fault diagnosis algorithm fot symetric multiprocessor architecture IEEE Transaction on computer, vol. C-27, pp. 1059–1063, Nov. 1978
Google Scholar
Parsytec Computer GmbH. The Parsytec GC Technical Summary, Version 1.0. Aachen (Germany), 1991.
Google Scholar
Parsytec Computer GmbH. PARIX Release 1.2. Reference Manual. Aachen (Germany), 1993.
Google Scholar
Preparata, F.P.; Metze, G.; Chien, R.T On the connection assignment problem of diagnosable systems IEEE Trans.Electronic Computing. Vol. EC-16, pp 848–854, December 1967
Google Scholar
Stahl, M.; Buskens, R.; Bianchini, R. Jr. On-line diagnosis in general topology networks Workshop on fault tolerant Parallel and Distributed Systems, pp. 114–121 IEEE Computer Society, Massachusetts July 1992
Google Scholar
Stroustrup, Bjarne. The C++ Programming Language, Second Edition. New York; London [u.a.]: Addison-Wesley Publishing Company, 1991.
Google Scholar
Trivedi, Kishor S. Probability and Statistics with Reliability, Queuing, and Computer Science Applications. Englewood Cliffs: NJ Prentice Hall, 1982.
Google Scholar

Download references

Author information

Authors and Affiliations

Institut für Mathematische Maschinen und Datenverarbeitung (IMMD) III, Universität Erlangen-Nürnberg, Martensstr. 3, 91058, Erlangen, Germany
J. Altmann, F. Balbach & A. Hein

Authors

J. Altmann
View author publications
You can also search for this author in PubMed Google Scholar
F. Balbach
View author publications
You can also search for this author in PubMed Google Scholar
A. Hein
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Klaus Echtle Dieter Hammer David Powell

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Altmann, J., Balbach, F., Hein, A. (1994). An approach for hierarchical system level diagnosis of massively parallel computers combined with a simulation-based method for dependability analysis. In: Echtle, K., Hammer, D., Powell, D. (eds) Dependable Computing — EDCC-1. EDCC 1994. Lecture Notes in Computer Science, vol 852. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-58426-9_142

Download citation

DOI: https://doi.org/10.1007/3-540-58426-9_142
Published: 07 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-58426-1
Online ISBN: 978-3-540-48785-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

An approach for hierarchical system level diagnosis of massively parallel computers combined with a simulation-based method for dependability analysis

Abstract

Access this chapter

Preview

Similar content being viewed by others

Parallel Model-Based Diagnosis

Test-Based Diagnosis of Faults in Data Exchange Addressing in Computer Systems Using Parallel Model

Technology and Tools for Developing Industrial Software Test Suites Based on Formal Models and Implementing Scalable Testing Process on Supercomputer

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

An approach for hierarchical system level diagnosis of massively parallel computers combined with a simulation-based method for dependability analysis

Abstract

Access this chapter

Preview

Similar content being viewed by others

Parallel Model-Based Diagnosis

Test-Based Diagnosis of Faults in Data Exchange Addressing in Computer Systems Using Parallel Model

Technology and Tools for Developing Industrial Software Test Suites Based on Formal Models and Implementing Scalable Testing Process on Supercomputer

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation