Abstract
The FTMPS-project provides a solution to the need for faulttolerance in large systems. A complete fault-tolerance approach is developed and being implemented. The built-in hardware error-detection features combined with software error-detection techniques provide a high coverage of transient as well as permanent failures. Combined with the diagnosis software, the necessary information for the OSS (statistics and visualisation) and the possibly reconfiguration is collected. Backward error recovery based on checkpointing and rollback, is implemented.
Supported by the EC as ESPRIT-project 6731
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Castillo F., Siewiorek D.P.: Workload, Performance, and Reliability of Digital Computer Systems. IEEE Proc. of FTCS-11, pp. 84–89, June 1981.
Deconinck G., Vounckx J., Lauwereins R., Peperstraete J.A.: Survey of Backward Error Recovery Rechniques for Multicomputers Based on Checkpointing and Rollback. IASTED Intl. Conf. on Modelling and Simulation, Pittsburgh, PA, USA, May 1993, pp. 262–265.
Esser R., Knecht R.: Intel Paragon XP/S — Architecture and Software Environment. Proceedings of Supercomputer 93, Mannheim, June 1993.
Iyer R.K., Rossetti D.J.: A Measurement-Based Model for Workload Dependence of CPU Errors. IEEE Trans. on Computers, C35(6):511–519, June 1986.
Mahmood A.: Concurrent Error Detection Using Watchdog Processors — A Survey. IEEE Trans. on Computers, 37(2), 1990.
Maehle E., Obelör W.: DELTA-T, a User-Transparent Software-Monitoring Tool for Multi-Transputer Systems. Proc. EUROMICRO 92, Microprocessing and Microprogramming, 32(9):245–252, Sep. 1992.
Parsytec GmbH: Technical Summary Parsytec GC, Version 1.0. Parsytec GmbH, 1991.
Tiedt F.: Parsytec GCel Supercomputer, Technical Report, Parsytec GmbH, 1991.
van Leeuwen J., Tan R. B.: Routing with Compact Routing Tables. Technical Report RUU-CS-83-16 Rijksuniversiteit Utrecht, Nov. 1983.
Vounckx J., Deconinck G., Cuyvers R., Lauwereins R., Peperstraete J.A.: Network fault-tolerance with Interval Routing Devices. Proc. of the 11th IASTED Int. Symp. Applied Informatics, pp. 293–296, Annecy, France, May 1993.
Vounckx J., Deconinck G., Cuyvers R., Lauwereins R., Peperstraete J.A.: Multi-processor Routing techniques. Deliverable O3.1.1/L of ESPRIT Project 6731, July 1993.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1994 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Vounckx, J. et al. (1994). The FTMPS-project: Design and implementation of fault-tolerance techniques for massively parallel systems. In: Gentzsch, W., Harms, U. (eds) High-Performance Computing and Networking. HPCN-Europe 1994. Lecture Notes in Computer Science, vol 797. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-57981-8_151
Download citation
DOI: https://doi.org/10.1007/3-540-57981-8_151
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-57981-6
Online ISBN: 978-3-540-48408-0
eBook Packages: Springer Book Archive