The FTMPS-project: Design and implementation of fault-tolerance techniques for massively parallel systems

Vounckx, Johan; Deconinck, G.; Lauwereins, R.; Viehöver, G.; Wagner, R.; Madeira, H.; Silva, J. G.; Balbach, F.; Altmann, J.; Bieker, B.; Willeke, H.

doi:10.1007/3-540-57981-8_151

Johan Vounckx¹,
G. Deconinck¹,
R. Lauwereins¹,
G. Viehöver²,
R. Wagner²,
H. Madeira³,
J. G. Silva³,
F. Balbach⁴,
J. Altmann⁴,
B. Bieker⁵ &
…
H. Willeke⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 797))

Included in the following conference series:

International Conference on High-Performance Computing and Networking

160 Accesses
1 Citations

Abstract

The FTMPS-project provides a solution to the need for faulttolerance in large systems. A complete fault-tolerance approach is developed and being implemented. The built-in hardware error-detection features combined with software error-detection techniques provide a high coverage of transient as well as permanent failures. Combined with the diagnosis software, the necessary information for the OSS (statistics and visualisation) and the possibly reconfiguration is collected. Backward error recovery based on checkpointing and rollback, is implemented.

Supported by the EC as ESPRIT-project 6731

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Legio: fault resiliency for embarrassingly parallel MPI applications

Article Open access 25 June 2021

Scheduling for Fault-Tolerance: An Introduction

Software approaches for resilience of high performance computing systems: a survey

Article 12 December 2022

References

Castillo F., Siewiorek D.P.: Workload, Performance, and Reliability of Digital Computer Systems. IEEE Proc. of FTCS-11, pp. 84–89, June 1981.
Google Scholar
Deconinck G., Vounckx J., Lauwereins R., Peperstraete J.A.: Survey of Backward Error Recovery Rechniques for Multicomputers Based on Checkpointing and Rollback. IASTED Intl. Conf. on Modelling and Simulation, Pittsburgh, PA, USA, May 1993, pp. 262–265.
Google Scholar
Esser R., Knecht R.: Intel Paragon XP/S — Architecture and Software Environment. Proceedings of Supercomputer 93, Mannheim, June 1993.
Google Scholar
Iyer R.K., Rossetti D.J.: A Measurement-Based Model for Workload Dependence of CPU Errors. IEEE Trans. on Computers, C35(6):511–519, June 1986.
Google Scholar
Mahmood A.: Concurrent Error Detection Using Watchdog Processors — A Survey. IEEE Trans. on Computers, 37(2), 1990.
Google Scholar
Maehle E., Obelör W.: DELTA-T, a User-Transparent Software-Monitoring Tool for Multi-Transputer Systems. Proc. EUROMICRO 92, Microprocessing and Microprogramming, 32(9):245–252, Sep. 1992.
Google Scholar
Parsytec GmbH: Technical Summary Parsytec GC, Version 1.0. Parsytec GmbH, 1991.
Google Scholar
Tiedt F.: Parsytec GCel Supercomputer, Technical Report, Parsytec GmbH, 1991.
Google Scholar
van Leeuwen J., Tan R. B.: Routing with Compact Routing Tables. Technical Report RUU-CS-83-16 Rijksuniversiteit Utrecht, Nov. 1983.
Google Scholar
Vounckx J., Deconinck G., Cuyvers R., Lauwereins R., Peperstraete J.A.: Network fault-tolerance with Interval Routing Devices. Proc. of the 11th IASTED Int. Symp. Applied Informatics, pp. 293–296, Annecy, France, May 1993.
Google Scholar
Vounckx J., Deconinck G., Cuyvers R., Lauwereins R., Peperstraete J.A.: Multi-processor Routing techniques. Deliverable O3.1.1/L of ESPRIT Project 6731, July 1993.
Google Scholar

Download references

Author information

Authors and Affiliations

Katholieke Universiteit Leuven, ESAT, K. Mercierlaan 94, 3001, Heverlee, Belgium
Johan Vounckx, G. Deconinck & R. Lauwereins (Senior Research Associate of the Belgian National Science Foundation)
Parsytec GmbH (D), Deutschland
G. Viehöver & R. Wagner
Universidade de Coimbra (P), Portugal
H. Madeira & J. G. Silva
F.A. Universität Erlangen-Nürnberg (D), Deutschland
F. Balbach & J. Altmann
Universität-GH Paderborn (D), Deutschland
B. Bieker & H. Willeke

Authors

Johan Vounckx
View author publications
You can also search for this author in PubMed Google Scholar
G. Deconinck
View author publications
You can also search for this author in PubMed Google Scholar
R. Lauwereins
View author publications
You can also search for this author in PubMed Google Scholar
G. Viehöver
View author publications
You can also search for this author in PubMed Google Scholar
R. Wagner
View author publications
You can also search for this author in PubMed Google Scholar
H. Madeira
View author publications
You can also search for this author in PubMed Google Scholar
J. G. Silva
View author publications
You can also search for this author in PubMed Google Scholar
F. Balbach
View author publications
You can also search for this author in PubMed Google Scholar
J. Altmann
View author publications
You can also search for this author in PubMed Google Scholar
B. Bieker
View author publications
You can also search for this author in PubMed Google Scholar
H. Willeke
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Wolfgang Gentzsch Uwe Harms

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vounckx, J. et al. (1994). The FTMPS-project: Design and implementation of fault-tolerance techniques for massively parallel systems. In: Gentzsch, W., Harms, U. (eds) High-Performance Computing and Networking. HPCN-Europe 1994. Lecture Notes in Computer Science, vol 797. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-57981-8_151

Download citation

DOI: https://doi.org/10.1007/3-540-57981-8_151
Published: 26 May 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-57981-6
Online ISBN: 978-3-540-48408-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics