Article

Fault tolerant high performance computing by a coding approach

Authors:
Zizhong Chen

University of Tennessee, Knoxville, TN

University of Tennessee, Knoxville, TN
View Profile

,
Graham E. Fagg

University of Tennessee, Knoxville, TN

University of Tennessee, Knoxville, TN
View Profile

,
Edgar Gabriel

University of Tennessee, Knoxville, TN

University of Tennessee, Knoxville, TN
View Profile

,
Julien Langou

University of Tennessee, Knoxville, TN

University of Tennessee, Knoxville, TN
View Profile

,
Thara Angskun

University of Tennessee, Knoxville, TN

University of Tennessee, Knoxville, TN
View Profile

,
George Bosilca

University of Tennessee, Knoxville, TN

University of Tennessee, Knoxville, TN
View Profile

,
Jack Dongarra

University of Tennessee, Knoxville, TN

University of Tennessee, Knoxville, TN
View Profile

PPoPP '05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programmingJune 2005Pages 213–223https://doi.org/10.1145/1065944.1065973

Published:15 June 2005Publication History

PPoPP '05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming

Pages 213–223

ABSTRACT

As the number of processors in today's high performance computers continues to grow, the mean-time-to-failure of these computers are becoming significantly shorter than the execution time of many current high performance computing applications. Although today's architectures are usually robust enough to survive node failures without suffering complete system failure, most today's high performance computing applications can not survive node failures and, therefore, whenever a node fails, have to abort themselves and restart from the beginning or a stable-storage-based checkpoint.This paper explores the use of the floating-point arithmetic coding approach to build fault survivable high performance computing applications so that they can adapt to node failures without aborting themselves. Despite the use of erasure codes over Galois field has been theoretically attempted before in diskless checkpointing, few actual implementations exist. This probably derives from concerns related to both the efficiency and the complexity of implementing such codes in high performance computing applications. In this paper, we introduce the simple but efficient floating-point arithmetic coding approach into diskless checkpointing and address the associated round-off error issue. We also implement a floating-point arithmetic version of the Reed-Solomon coding scheme into a conjugate gradient equation solver and evaluate both the performance and the numerical impact of this scheme. Experimental results demonstrate that the proposed floating-point arithmetic coding approach is able to survive a small number of simultaneous node failures with low performance overhead and little numerical impact.

References

N. R. Adiga and et al. An overview of the BlueGene/L supercomputer. In Proceedings of the Supercomputing Conference (SC'2002), Baltimore MD, USA, pages 1--22, 2002.]] Google ScholarDigital Library
R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. V. der Vorst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition. SIAM, Philadelphia, PA, 1994.]]Google ScholarCross Ref
G. Bosilca, Z. Chen, J. Dongarra, and J. Langou. Recovery patterns for iterative methods in a parallel unstable environment. Technical Report ut-cs-04-538, University of Tennessee, Knoxville, Tennessee, USA, 2004.]]Google Scholar
Z. Chen and J. Dongarra. Condition numbers of gaussian random matrices. Technical Report ut-cs-04-539, University of Tennessee, Knoxville, Tennessee, USA, 2004.]]Google Scholar
Z. Chen, J. Dongarra, P. Luszczek, and K. Roche. Self-adapting software for numerical linear algebra and LAPACK for clusters. Parallel Computing, 29(11-12):1723--1743, November-December 2003.]] Google ScholarDigital Library
T. cker Chiueh and P. Deng. Evaluation of checkpoint mechanisms for massively parallel machines. In FTCS, pages 370--379, 1996.]] Google ScholarDigital Library
J. Dongarra, H. Meuer, and E. Strohmaier. TOP500 Supercomputer Sites, 24th edition. In Proceedings of the Supercomputing Conference (SC'2004), Pittsburgh PA, USA. ACM, 2004.]] Google ScholarDigital Library
A. Edelman. Eigenvalues and condition numbers of random matrices. SIAM J. Matrix Anal. Appl., 9(4):543--560, 1988.]] Google ScholarDigital Library
G. E. Fagg and J. Dongarra. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In PVM/MPI 2000, pages 346--353, 2000.]] Google ScholarDigital Library
G. E. Fagg, E. Gabriel, G. Bosilca, T. Angskun, Z. Chen, J. Pjesivac-Grbovic, K. London, and J. J. Dongarra. Extending the MPI specification for process fault tolerance on high performance computing systems. In Proceedings of the International Supercomputer Conference, Heidelberg, Germany, 2004.]]Google Scholar
G. E. Fagg, E. Gabriel, Z. Chen, , T. Angskun, G. Bosilca, J. Pjesivac-Grbovic, and J. J. Dongarra. Process fault-tolerance: Semantics, design and applications for high performance computing. Submitted to International Journal of High Performance Computing Applications, 2004.]]Google Scholar
A. Geist and C. Engelmann. Development of naturally fault tolerant algortihms for computing on 100,000 processors. Submited to J. Parallel Distrib. Comput., 2002.]]Google Scholar
E. Gelenbe. On the optimum checkpoint interval. J. ACM, 26(2):259--270, 1979.]] Google ScholarDigital Library
W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing, 22(6):789--828, September 1996.]] Google ScholarDigital Library
I. Foster and C. Kesselman. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kauffman, San Francisco, 1999.]] Google ScholarDigital Library
Y. Kim. Fault Tolerant Matrix Operations for Parallel and Distributed Systems. Ph.D. dissertation, University of Tennessee, Knoxville, June 1996.]] Google ScholarDigital Library
Message Passing Interface Forum. MPI: A Message Passing Interface Standard. Technical Report ut-cs-94-230, University of Tennessee, Knoxville, Tennessee, USA, 1994.]] Google ScholarDigital Library
J. S. Plank. A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Software -- Practice & Experience, 27(9):995--1012, September 1997.]] Google ScholarDigital Library
J. S. Plank, Y. Kim, and J. Dongarra. Fault-tolerant matrix operations for networks of workstations using diskless checkpointing. J. Parallel Distrib. Comput., 43(2):125--138, 1997.]] Google ScholarDigital Library
J. S. Plank and K. Li. Faster checkpointing with n+1 parity. In FTCS, pages 288--297, 1994.]]Google ScholarCross Ref
J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE Trans. Parallel Distrib. Syst., 9(10):972--986, 1998.]] Google ScholarDigital Library
J. S. Plank and M. G. Thomason. Processor allocation and checkpoint interval selection in cluster computing systems. J. Parallel Distrib. Comput., 61(11):1570--1590, November 2001.]] Google ScholarDigital Library
L. M. Silva and J. G. Silva. An experimental study about diskless checkpointing. In EUROMICRO'98, pages 395--402, 1998.]] Google ScholarDigital Library
N. H. Vaidya. A case for two-level recovery schemes. IEEE Trans. Computers, 47(6):656--666, 1998.]] Google ScholarDigital Library
J. W. Young. A first order approximation to the optimal checkpoint interval. Commun. ACM, 17(9):530--531, 1974.]] Google ScholarDigital Library

Index Terms

Fault tolerant high performance computing by a coding approach
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

On switching policies for modular redundancy fault-tolerant computing systems

The objective of fault-tolerant computing systems is to provide an error-free operation in the presence of faults. The system has to recover from the effects of a fault by employing certain recovery procedures like program rollback, reload, and restart, ...
Read More
N-Level Diskless Checkpointing
HPCC '09: Proceedings of the 2009 11th IEEE International Conference on High Performance Computing and Communications

Diskless checkpointing is an efficient technique to tolerate a small number of processor failures in large parallel and distributed systems. In literature, a simultaneous failure of no more than N processors is often tolerated by using a one-level Reed-...
Read More
Fault tolerant processes

A process is said to be fault tolerant if the system provides proper service despite the failure of the process. For supporting fault-tolerant processes, measures have to be provided to recover messages lost due to the failure. One approach for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PPoPP '05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
June 2005
310 pages
ISBN:1595930809
DOI:10.1145/1065944
General Chair:
Keshav Pingali
Cornell University
,
Program Chairs:
Katherine Yelick
University of California, Berkeley and LBNL
,
Andrew Grimshaw
University of Virginia
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 June 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
fault tolerance
floating-point arithmetic coding
high performance computing
message passing interface
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate230of1,014submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 53
  Total Citations
  View Citations
- 673
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Fault tolerant high performance computing by a coding approach

PPoPP '05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming

ABSTRACT

References

Cited By

Index Terms

Recommendations

On switching policies for modular redundancy fault-tolerant computing systems

N-Level Diskless Checkpointing

Fault tolerant processes