skip to main content
10.1145/2089002.2089006acmconferencesArticle/Chapter ViewAbstractPublication PagesmiddlewareConference Proceedingsconference-collections
research-article

A hybrid fault tolerance scheme for EasyGrid MPI applications

Published:12 December 2011Publication History

ABSTRACT

Writing applications capable of executing efficiently in distributed systems is extremely difficult and tedious for inexperienced users. The resources may be heterogeneous, non-dedicated, and offered without any performance or availability guarantees. Systems capable of adapting the execution of an application to these characteristics are essential. The EasyGrid Application Management System (AMS) transforms cluster-based MPI applications into autonomic ones capable executing robustly and efficiently in distributed environments. This work describes a strategy to endow these autonomic MPI applications with the property of self-healing and thus be capable of withstanding multiple simultaneous crash faults of processes and/or processors. The extremely low intrusion cost of the proposed hybrid solution might now facilitate acceptance of fault tolerance techniques in large scale high performance applications.

References

  1. C. Boeres and V. E. F. Rebello. EasyGrid: Towards a framework for the automatic grid enabling of legacy MPI applications. Concurrency and Computation: Practice and Experience, 16(5):425--432, April 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43:225--267, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. E. N. Dorband, M. Hemsendorf, and D. Merritt. Systolic and hyper-systolic algorithms for the gravitational n-body problem, with an application to brownian motion. Journal of Computational Physics, 185(2):484--511, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 34(3):375--408, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. E. Fagg and J. Dongarra. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, volume LNCS 1908, pages 346--353. Springer, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. L. Graham, S.-E. Choi, D. J. Daniel, N. N. Desai, R. G. Minnich, C. E. Rasmussen, L. D. Risinger, and M. W. Sukalski. A network-failure-tolerant message-passing system for terascale clusters. International Journal of Parallel Programming, 31(4):285--303, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. H. Hargrove and J. C. Duell. Berkeley lab checkpoint/restart (BLCR) for linux clusters. In Proc. Conference on Scientific Discovery through Avanced Computing (SciDAC), pages 494--499, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  8. J. Hursey, J. M. Squyres, T. I. Mattox, and A. Lumsdaine. The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In Proc. 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2007.Google ScholarGoogle ScholarCross RefCross Ref
  9. S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. International Journal of High Performance Computing Applications, 19(4):479--493, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Sena, A. Nascimento, C. Boeres, and V. Rebello. EasyGrid enabling of iterative tightly-coupled parallel MPI applications. In Proc. International Symposium on Parallel and Distributed Processing with Applications (ISPA), pages 199--206, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. C. Sena, A. P. Nascimento, J. A. Silva, D. Q. C. Vianna, C. Boeres, and V. E. F. Rebello. On the advantages of an alternative MPI execution model for grids. In Proc. 7th IEEE International Symposium on Cluster Computing and the Grid (CCGrid), pages 575--582, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Sterritt, M. Parashar, H. Tianfield, and R. Unland. A concise introduction to autonomic computing. Adv. Engineering Informatics, 19(3):181--187, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Zhao, V. Lo, and C. GauthierDickey. Result verification and trust-based scheduling in peer-to-peer grids. In Proc. Fifth IEEE International Conference on Peer-to-Peer Computing, pages 31--38, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A hybrid fault tolerance scheme for EasyGrid MPI applications

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MGC '11: Proceedings of the 9th International Workshop on Middleware for Grids, Clouds and e-Science
          December 2011
          38 pages
          ISBN:9781450310680
          DOI:10.1145/2089002

          Copyright © 2011 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 12 December 2011

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          MGC '11 Paper Acceptance Rate5of13submissions,38%Overall Acceptance Rate14of36submissions,39%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader