Skip to main content

The FTMPS-project: Design and implementation of fault-tolerance techniques for massively parallel systems

  • Monitoring, Debugging, and Fault Tolerance
  • Conference paper
  • First Online:
High-Performance Computing and Networking (HPCN-Europe 1994)

Abstract

The FTMPS-project provides a solution to the need for faulttolerance in large systems. A complete fault-tolerance approach is developed and being implemented. The built-in hardware error-detection features combined with software error-detection techniques provide a high coverage of transient as well as permanent failures. Combined with the diagnosis software, the necessary information for the OSS (statistics and visualisation) and the possibly reconfiguration is collected. Backward error recovery based on checkpointing and rollback, is implemented.

Supported by the EC as ESPRIT-project 6731

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Castillo F., Siewiorek D.P.: Workload, Performance, and Reliability of Digital Computer Systems. IEEE Proc. of FTCS-11, pp. 84–89, June 1981.

    Google Scholar 

  2. Deconinck G., Vounckx J., Lauwereins R., Peperstraete J.A.: Survey of Backward Error Recovery Rechniques for Multicomputers Based on Checkpointing and Rollback. IASTED Intl. Conf. on Modelling and Simulation, Pittsburgh, PA, USA, May 1993, pp. 262–265.

    Google Scholar 

  3. Esser R., Knecht R.: Intel Paragon XP/S — Architecture and Software Environment. Proceedings of Supercomputer 93, Mannheim, June 1993.

    Google Scholar 

  4. Iyer R.K., Rossetti D.J.: A Measurement-Based Model for Workload Dependence of CPU Errors. IEEE Trans. on Computers, C35(6):511–519, June 1986.

    Google Scholar 

  5. Mahmood A.: Concurrent Error Detection Using Watchdog Processors — A Survey. IEEE Trans. on Computers, 37(2), 1990.

    Google Scholar 

  6. Maehle E., Obelör W.: DELTA-T, a User-Transparent Software-Monitoring Tool for Multi-Transputer Systems. Proc. EUROMICRO 92, Microprocessing and Microprogramming, 32(9):245–252, Sep. 1992.

    Google Scholar 

  7. Parsytec GmbH: Technical Summary Parsytec GC, Version 1.0. Parsytec GmbH, 1991.

    Google Scholar 

  8. Tiedt F.: Parsytec GCel Supercomputer, Technical Report, Parsytec GmbH, 1991.

    Google Scholar 

  9. van Leeuwen J., Tan R. B.: Routing with Compact Routing Tables. Technical Report RUU-CS-83-16 Rijksuniversiteit Utrecht, Nov. 1983.

    Google Scholar 

  10. Vounckx J., Deconinck G., Cuyvers R., Lauwereins R., Peperstraete J.A.: Network fault-tolerance with Interval Routing Devices. Proc. of the 11th IASTED Int. Symp. Applied Informatics, pp. 293–296, Annecy, France, May 1993.

    Google Scholar 

  11. Vounckx J., Deconinck G., Cuyvers R., Lauwereins R., Peperstraete J.A.: Multi-processor Routing techniques. Deliverable O3.1.1/L of ESPRIT Project 6731, July 1993.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Wolfgang Gentzsch Uwe Harms

Rights and permissions

Reprints and permissions

Copyright information

© 1994 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Vounckx, J. et al. (1994). The FTMPS-project: Design and implementation of fault-tolerance techniques for massively parallel systems. In: Gentzsch, W., Harms, U. (eds) High-Performance Computing and Networking. HPCN-Europe 1994. Lecture Notes in Computer Science, vol 797. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-57981-8_151

Download citation

  • DOI: https://doi.org/10.1007/3-540-57981-8_151

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-57981-6

  • Online ISBN: 978-3-540-48408-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics