skip to main content
10.1145/1065944.1065973acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
Article

Fault tolerant high performance computing by a coding approach

Published:15 June 2005Publication History

ABSTRACT

As the number of processors in today's high performance computers continues to grow, the mean-time-to-failure of these computers are becoming significantly shorter than the execution time of many current high performance computing applications. Although today's architectures are usually robust enough to survive node failures without suffering complete system failure, most today's high performance computing applications can not survive node failures and, therefore, whenever a node fails, have to abort themselves and restart from the beginning or a stable-storage-based checkpoint.This paper explores the use of the floating-point arithmetic coding approach to build fault survivable high performance computing applications so that they can adapt to node failures without aborting themselves. Despite the use of erasure codes over Galois field has been theoretically attempted before in diskless checkpointing, few actual implementations exist. This probably derives from concerns related to both the efficiency and the complexity of implementing such codes in high performance computing applications. In this paper, we introduce the simple but efficient floating-point arithmetic coding approach into diskless checkpointing and address the associated round-off error issue. We also implement a floating-point arithmetic version of the Reed-Solomon coding scheme into a conjugate gradient equation solver and evaluate both the performance and the numerical impact of this scheme. Experimental results demonstrate that the proposed floating-point arithmetic coding approach is able to survive a small number of simultaneous node failures with low performance overhead and little numerical impact.

References

  1. N. R. Adiga and et al. An overview of the BlueGene/L supercomputer. In Proceedings of the Supercomputing Conference (SC'2002), Baltimore MD, USA, pages 1--22, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. V. der Vorst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition. SIAM, Philadelphia, PA, 1994.]]Google ScholarGoogle ScholarCross RefCross Ref
  3. G. Bosilca, Z. Chen, J. Dongarra, and J. Langou. Recovery patterns for iterative methods in a parallel unstable environment. Technical Report ut-cs-04-538, University of Tennessee, Knoxville, Tennessee, USA, 2004.]]Google ScholarGoogle Scholar
  4. Z. Chen and J. Dongarra. Condition numbers of gaussian random matrices. Technical Report ut-cs-04-539, University of Tennessee, Knoxville, Tennessee, USA, 2004.]]Google ScholarGoogle Scholar
  5. Z. Chen, J. Dongarra, P. Luszczek, and K. Roche. Self-adapting software for numerical linear algebra and LAPACK for clusters. Parallel Computing, 29(11-12):1723--1743, November-December 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T. cker Chiueh and P. Deng. Evaluation of checkpoint mechanisms for massively parallel machines. In FTCS, pages 370--379, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Dongarra, H. Meuer, and E. Strohmaier. TOP500 Supercomputer Sites, 24th edition. In Proceedings of the Supercomputing Conference (SC'2004), Pittsburgh PA, USA. ACM, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Edelman. Eigenvalues and condition numbers of random matrices. SIAM J. Matrix Anal. Appl., 9(4):543--560, 1988.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. E. Fagg and J. Dongarra. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In PVM/MPI 2000, pages 346--353, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. E. Fagg, E. Gabriel, G. Bosilca, T. Angskun, Z. Chen, J. Pjesivac-Grbovic, K. London, and J. J. Dongarra. Extending the MPI specification for process fault tolerance on high performance computing systems. In Proceedings of the International Supercomputer Conference, Heidelberg, Germany, 2004.]]Google ScholarGoogle Scholar
  11. G. E. Fagg, E. Gabriel, Z. Chen, , T. Angskun, G. Bosilca, J. Pjesivac-Grbovic, and J. J. Dongarra. Process fault-tolerance: Semantics, design and applications for high performance computing. Submitted to International Journal of High Performance Computing Applications, 2004.]]Google ScholarGoogle Scholar
  12. A. Geist and C. Engelmann. Development of naturally fault tolerant algortihms for computing on 100,000 processors. Submited to J. Parallel Distrib. Comput., 2002.]]Google ScholarGoogle Scholar
  13. E. Gelenbe. On the optimum checkpoint interval. J. ACM, 26(2):259--270, 1979.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing, 22(6):789--828, September 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. I. Foster and C. Kesselman. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kauffman, San Francisco, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Y. Kim. Fault Tolerant Matrix Operations for Parallel and Distributed Systems. Ph.D. dissertation, University of Tennessee, Knoxville, June 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Message Passing Interface Forum. MPI: A Message Passing Interface Standard. Technical Report ut-cs-94-230, University of Tennessee, Knoxville, Tennessee, USA, 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. S. Plank. A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Software -- Practice & Experience, 27(9):995--1012, September 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. S. Plank, Y. Kim, and J. Dongarra. Fault-tolerant matrix operations for networks of workstations using diskless checkpointing. J. Parallel Distrib. Comput., 43(2):125--138, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. S. Plank and K. Li. Faster checkpointing with n+1 parity. In FTCS, pages 288--297, 1994.]]Google ScholarGoogle ScholarCross RefCross Ref
  21. J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE Trans. Parallel Distrib. Syst., 9(10):972--986, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. S. Plank and M. G. Thomason. Processor allocation and checkpoint interval selection in cluster computing systems. J. Parallel Distrib. Comput., 61(11):1570--1590, November 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. L. M. Silva and J. G. Silva. An experimental study about diskless checkpointing. In EUROMICRO'98, pages 395--402, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. N. H. Vaidya. A case for two-level recovery schemes. IEEE Trans. Computers, 47(6):656--666, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. W. Young. A first order approximation to the optimal checkpoint interval. Commun. ACM, 17(9):530--531, 1974.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fault tolerant high performance computing by a coding approach

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PPoPP '05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
        June 2005
        310 pages
        ISBN:1595930809
        DOI:10.1145/1065944

        Copyright © 2005 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 June 2005

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate230of1,014submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader