Skip to main content

An approach for hierarchical system level diagnosis of massively parallel computers combined with a simulation-based method for dependability analysis

  • Session 9: Parallel systems
  • Conference paper
  • First Online:
Dependable Computing — EDCC-1 (EDCC 1994)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 852))

Included in the following conference series:

Abstract

The primary focus in the analysis of massively parallel supercomputers has traditionally been on their performance. However, their complex network topologies, large number of processors, and sophisticated system software can make them very unreliable. If every failure of one of the many components of a massively parallel computer could shut down the machine, the machine would be useless. Therefore fault tolerance is required. The basis of effective mechanisms for fault tolerance is an efficient diagnosis.

This paper deals with concurrent and hierarchical system level diagnosis for a particular massively parallel architecture and with a simulation-based method to validate the proposed diagnosis algorithm. The diagnosis algorithm is presented and we describe a simulation-based method to test and verify the algorithms for fault tolerance already during the design phase of the target machine.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Bianchini R., Buskens R. Implementation of On-Line Distributed System-Level Diagnosis Theory IEEE Transaction on computer. vol. C-41, No. 5, pp 616–626, May 1992

    Article  Google Scholar 

  2. Bieker B., Deconinck G., Maehle E., Vouncks J. Reconfiguration and Checkpointing in Massively Parallel Systems Submitted to EDCC-1 1994

    Google Scholar 

  3. Bobbio A. Dependability Analysis of Fault-Tolerant Systems: a Literature Survey in Microprocessing and Microprogramming 29 (1990), pp 1–13, North-Holland, 1990.

    Google Scholar 

  4. Dal Cin M., Hofmann F., Grygier A., Hessenauer H., Hildebrand U., Linster C.U., Thiel T., Turowski S. MEMSY — A Modular Expandable Multiprocessor System in A. Bode, M. Dal Cin (eds), Parallel Computer Architectures, pp 15–30, Springer LNCS 732, 1993.

    Google Scholar 

  5. Goswami, Kumar K., Ravi K. Iyer. The DEPEND Reference Manual. Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, 1991.

    Google Scholar 

  6. Goswami, Kumar K. Design for Dependability: A Simulation-Based Approach. Ph.D. Thesis, Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, 1993.

    Google Scholar 

  7. Grand Challenges High Performance Computing and Communication. The Fiscal Year 1992 U.S. Research and Development Program. Report by the Committee on Physical, Mathematical, and Engeneering Sciences, NFS Washington 1992.

    Google Scholar 

  8. Hein, Axel. SimParGC — Ein Simulator zur Leistungs-und Zuverlässigkeits-Analyse des Multiprozessorsystems Parsytec GC, Version 1.0. Internal Report, IMMD 3, University of Erlangen-Nürnberg, 1994.

    Google Scholar 

  9. Hosseini S, Kuhl J.G., Reddy S.M. A Diagnosis Algorithm for Distributed Computing Systems with Dynamic Failure and Repair IEEE Transaction on computer. vol. C-33, pp 223–233. Mar. 1984

    Google Scholar 

  10. Inmos The T9000 Transputer Hardware Reference Manual INMOS Limited 1993.

    Google Scholar 

  11. Kuhl, J.G; Reddy, S.M. Distibuted fault tolerance for large multiprocessor systems ACM-Sigarch Newsletter 8, No.3, pp23–30, 1980

    Google Scholar 

  12. Kuhl, J.; Reddy, S. Fault-diagnosis in fully distributed systems FTCS 11, Fault tolerant computing: the 11th international symposium, pp. 100–105, 1981

    Google Scholar 

  13. Marsan, M. Ajmone, G. Balbo und G. Conte. Performance Models of Multiprocessor Systems. Cambridge; London: The MIT Press. 1986.

    Google Scholar 

  14. Meyer, F.J.; Masson, G. An efficient fault diagnosis algorithm fot symetric multiprocessor architecture IEEE Transaction on computer, vol. C-27, pp. 1059–1063, Nov. 1978

    Google Scholar 

  15. Parsytec Computer GmbH. The Parsytec GC Technical Summary, Version 1.0. Aachen (Germany), 1991.

    Google Scholar 

  16. Parsytec Computer GmbH. PARIX Release 1.2. Reference Manual. Aachen (Germany), 1993.

    Google Scholar 

  17. Preparata, F.P.; Metze, G.; Chien, R.T On the connection assignment problem of diagnosable systems IEEE Trans.Electronic Computing. Vol. EC-16, pp 848–854, December 1967

    Google Scholar 

  18. Stahl, M.; Buskens, R.; Bianchini, R. Jr. On-line diagnosis in general topology networks Workshop on fault tolerant Parallel and Distributed Systems, pp. 114–121 IEEE Computer Society, Massachusetts July 1992

    Google Scholar 

  19. Stroustrup, Bjarne. The C++ Programming Language, Second Edition. New York; London [u.a.]: Addison-Wesley Publishing Company, 1991.

    Google Scholar 

  20. Trivedi, Kishor S. Probability and Statistics with Reliability, Queuing, and Computer Science Applications. Englewood Cliffs: NJ Prentice Hall, 1982.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Klaus Echtle Dieter Hammer David Powell

Rights and permissions

Reprints and permissions

Copyright information

© 1994 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Altmann, J., Balbach, F., Hein, A. (1994). An approach for hierarchical system level diagnosis of massively parallel computers combined with a simulation-based method for dependability analysis. In: Echtle, K., Hammer, D., Powell, D. (eds) Dependable Computing — EDCC-1. EDCC 1994. Lecture Notes in Computer Science, vol 852. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-58426-9_142

Download citation

  • DOI: https://doi.org/10.1007/3-540-58426-9_142

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-58426-1

  • Online ISBN: 978-3-540-48785-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics