ABSTRACT
As the number of processors in today's high performance computers continues to grow, the mean-time-to-failure of these computers are becoming significantly shorter than the execution time of many current high performance computing applications. Although today's architectures are usually robust enough to survive node failures without suffering complete system failure, most today's high performance computing applications can not survive node failures and, therefore, whenever a node fails, have to abort themselves and restart from the beginning or a stable-storage-based checkpoint.This paper explores the use of the floating-point arithmetic coding approach to build fault survivable high performance computing applications so that they can adapt to node failures without aborting themselves. Despite the use of erasure codes over Galois field has been theoretically attempted before in diskless checkpointing, few actual implementations exist. This probably derives from concerns related to both the efficiency and the complexity of implementing such codes in high performance computing applications. In this paper, we introduce the simple but efficient floating-point arithmetic coding approach into diskless checkpointing and address the associated round-off error issue. We also implement a floating-point arithmetic version of the Reed-Solomon coding scheme into a conjugate gradient equation solver and evaluate both the performance and the numerical impact of this scheme. Experimental results demonstrate that the proposed floating-point arithmetic coding approach is able to survive a small number of simultaneous node failures with low performance overhead and little numerical impact.
- N. R. Adiga and et al. An overview of the BlueGene/L supercomputer. In Proceedings of the Supercomputing Conference (SC'2002), Baltimore MD, USA, pages 1--22, 2002.]] Google ScholarDigital Library
- R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. V. der Vorst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition. SIAM, Philadelphia, PA, 1994.]]Google ScholarCross Ref
- G. Bosilca, Z. Chen, J. Dongarra, and J. Langou. Recovery patterns for iterative methods in a parallel unstable environment. Technical Report ut-cs-04-538, University of Tennessee, Knoxville, Tennessee, USA, 2004.]]Google Scholar
- Z. Chen and J. Dongarra. Condition numbers of gaussian random matrices. Technical Report ut-cs-04-539, University of Tennessee, Knoxville, Tennessee, USA, 2004.]]Google Scholar
- Z. Chen, J. Dongarra, P. Luszczek, and K. Roche. Self-adapting software for numerical linear algebra and LAPACK for clusters. Parallel Computing, 29(11-12):1723--1743, November-December 2003.]] Google ScholarDigital Library
- T. cker Chiueh and P. Deng. Evaluation of checkpoint mechanisms for massively parallel machines. In FTCS, pages 370--379, 1996.]] Google ScholarDigital Library
- J. Dongarra, H. Meuer, and E. Strohmaier. TOP500 Supercomputer Sites, 24th edition. In Proceedings of the Supercomputing Conference (SC'2004), Pittsburgh PA, USA. ACM, 2004.]] Google ScholarDigital Library
- A. Edelman. Eigenvalues and condition numbers of random matrices. SIAM J. Matrix Anal. Appl., 9(4):543--560, 1988.]] Google ScholarDigital Library
- G. E. Fagg and J. Dongarra. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In PVM/MPI 2000, pages 346--353, 2000.]] Google ScholarDigital Library
- G. E. Fagg, E. Gabriel, G. Bosilca, T. Angskun, Z. Chen, J. Pjesivac-Grbovic, K. London, and J. J. Dongarra. Extending the MPI specification for process fault tolerance on high performance computing systems. In Proceedings of the International Supercomputer Conference, Heidelberg, Germany, 2004.]]Google Scholar
- G. E. Fagg, E. Gabriel, Z. Chen, , T. Angskun, G. Bosilca, J. Pjesivac-Grbovic, and J. J. Dongarra. Process fault-tolerance: Semantics, design and applications for high performance computing. Submitted to International Journal of High Performance Computing Applications, 2004.]]Google Scholar
- A. Geist and C. Engelmann. Development of naturally fault tolerant algortihms for computing on 100,000 processors. Submited to J. Parallel Distrib. Comput., 2002.]]Google Scholar
- E. Gelenbe. On the optimum checkpoint interval. J. ACM, 26(2):259--270, 1979.]] Google ScholarDigital Library
- W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing, 22(6):789--828, September 1996.]] Google ScholarDigital Library
- I. Foster and C. Kesselman. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kauffman, San Francisco, 1999.]] Google ScholarDigital Library
- Y. Kim. Fault Tolerant Matrix Operations for Parallel and Distributed Systems. Ph.D. dissertation, University of Tennessee, Knoxville, June 1996.]] Google ScholarDigital Library
- Message Passing Interface Forum. MPI: A Message Passing Interface Standard. Technical Report ut-cs-94-230, University of Tennessee, Knoxville, Tennessee, USA, 1994.]] Google ScholarDigital Library
- J. S. Plank. A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Software -- Practice & Experience, 27(9):995--1012, September 1997.]] Google ScholarDigital Library
- J. S. Plank, Y. Kim, and J. Dongarra. Fault-tolerant matrix operations for networks of workstations using diskless checkpointing. J. Parallel Distrib. Comput., 43(2):125--138, 1997.]] Google ScholarDigital Library
- J. S. Plank and K. Li. Faster checkpointing with n+1 parity. In FTCS, pages 288--297, 1994.]]Google ScholarCross Ref
- J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE Trans. Parallel Distrib. Syst., 9(10):972--986, 1998.]] Google ScholarDigital Library
- J. S. Plank and M. G. Thomason. Processor allocation and checkpoint interval selection in cluster computing systems. J. Parallel Distrib. Comput., 61(11):1570--1590, November 2001.]] Google ScholarDigital Library
- L. M. Silva and J. G. Silva. An experimental study about diskless checkpointing. In EUROMICRO'98, pages 395--402, 1998.]] Google ScholarDigital Library
- N. H. Vaidya. A case for two-level recovery schemes. IEEE Trans. Computers, 47(6):656--666, 1998.]] Google ScholarDigital Library
- J. W. Young. A first order approximation to the optimal checkpoint interval. Commun. ACM, 17(9):530--531, 1974.]] Google ScholarDigital Library
Index Terms
- Fault tolerant high performance computing by a coding approach
Recommendations
On switching policies for modular redundancy fault-tolerant computing systems
The objective of fault-tolerant computing systems is to provide an error-free operation in the presence of faults. The system has to recover from the effects of a fault by employing certain recovery procedures like program rollback, reload, and restart, ...
N-Level Diskless Checkpointing
HPCC '09: Proceedings of the 2009 11th IEEE International Conference on High Performance Computing and CommunicationsDiskless checkpointing is an efficient technique to tolerate a small number of processor failures in large parallel and distributed systems. In literature, a simultaneous failure of no more than N processors is often tolerated by using a one-level Reed-...
Fault tolerant processes
A process is said to be fault tolerant if the system provides proper service despite the failure of the process. For supporting fault-tolerant processes, measures have to be provided to recover messages lost due to the failure. One approach for ...
Comments