research-article

A hybrid fault tolerance scheme for EasyGrid MPI applications

Authors:
Jacques A. da Silva

Universidade Federal Fluminense (UFF), Niterói, RJ, Brazil

Universidade Federal Fluminense (UFF), Niterói, RJ, Brazil
View Profile

,
Vinod E. F. Rebello

Universidade Federal Fluminense (UFF), Niterói, RJ, Brazil

Universidade Federal Fluminense (UFF), Niterói, RJ, Brazil
View Profile

MGC '11: Proceedings of the 9th International Workshop on Middleware for Grids, Clouds and e-ScienceDecember 2011Article No.: 4Pages 1–6https://doi.org/10.1145/2089002.2089006

Published:12 December 2011Publication History

MGC '11: Proceedings of the 9th International Workshop on Middleware for Grids, Clouds and e-Science

Pages 1–6

ABSTRACT

Writing applications capable of executing efficiently in distributed systems is extremely difficult and tedious for inexperienced users. The resources may be heterogeneous, non-dedicated, and offered without any performance or availability guarantees. Systems capable of adapting the execution of an application to these characteristics are essential. The EasyGrid Application Management System (AMS) transforms cluster-based MPI applications into autonomic ones capable executing robustly and efficiently in distributed environments. This work describes a strategy to endow these autonomic MPI applications with the property of self-healing and thus be capable of withstanding multiple simultaneous crash faults of processes and/or processors. The extremely low intrusion cost of the proposed hybrid solution might now facilitate acceptance of fault tolerance techniques in large scale high performance applications.

References

C. Boeres and V. E. F. Rebello. EasyGrid: Towards a framework for the automatic grid enabling of legacy MPI applications. Concurrency and Computation: Practice and Experience, 16(5):425--432, April 2004. Google ScholarDigital Library
T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43:225--267, 1996. Google ScholarDigital Library
E. N. Dorband, M. Hemsendorf, and D. Merritt. Systolic and hyper-systolic algorithms for the gravitational n-body problem, with an application to brownian motion. Journal of Computational Physics, 185(2):484--511, 2003. Google ScholarDigital Library
M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 34(3):375--408, 2002. Google ScholarDigital Library
G. E. Fagg and J. Dongarra. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, volume LNCS 1908, pages 346--353. Springer, 2000. Google ScholarDigital Library
R. L. Graham, S.-E. Choi, D. J. Daniel, N. N. Desai, R. G. Minnich, C. E. Rasmussen, L. D. Risinger, and M. W. Sukalski. A network-failure-tolerant message-passing system for terascale clusters. International Journal of Parallel Programming, 31(4):285--303, 2003. Google ScholarDigital Library
P. H. Hargrove and J. C. Duell. Berkeley lab checkpoint/restart (BLCR) for linux clusters. In Proc. Conference on Scientific Discovery through Avanced Computing (SciDAC), pages 494--499, 2006.Google ScholarCross Ref
J. Hursey, J. M. Squyres, T. I. Mattox, and A. Lumsdaine. The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In Proc. 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2007.Google ScholarCross Ref
S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. International Journal of High Performance Computing Applications, 19(4):479--493, 2005.Google ScholarDigital Library
A. Sena, A. Nascimento, C. Boeres, and V. Rebello. EasyGrid enabling of iterative tightly-coupled parallel MPI applications. In Proc. International Symposium on Parallel and Distributed Processing with Applications (ISPA), pages 199--206, 2008. Google ScholarDigital Library
A. C. Sena, A. P. Nascimento, J. A. Silva, D. Q. C. Vianna, C. Boeres, and V. E. F. Rebello. On the advantages of an alternative MPI execution model for grids. In Proc. 7th IEEE International Symposium on Cluster Computing and the Grid (CCGrid), pages 575--582, 2007. Google ScholarDigital Library
R. Sterritt, M. Parashar, H. Tianfield, and R. Unland. A concise introduction to autonomic computing. Adv. Engineering Informatics, 19(3):181--187, 2005. Google ScholarDigital Library
S. Zhao, V. Lo, and C. GauthierDickey. Result verification and trust-based scheduling in peer-to-peer grids. In Proc. Fifth IEEE International Conference on Peer-to-Peer Computing, pages 31--38, 2005. Google ScholarDigital Library

Index Terms

A hybrid fault tolerance scheme for EasyGrid MPI applications
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
  2. Dependable and fault-tolerant systems and networks
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Communications management
        Message passing
    2. Software system structures
      1. Distributed systems organizing principles

Recommendations

Evaluating User-Level Fault Tolerance for MPI Applications
EuroMPI/ASIA '14: Proceedings of the 21st European MPI Users' Group Meeting

The User Level Failure Mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in MPI. Previous work has presented performance evaluations of the interface; yet questions related to its programability and applicability remain ...
Read More
Fault Tolerance in Multiprocessor Systems Without Dedicated Redundancy

An algorithm called RAFT (recursive algorithm for fault tolerance) for achieving fault tolerance in multiprocessor systems is described. Through the use of a combination of dynamic space- and time- redundancy techniques, RAFT achieves fault tolerance in ...
Read More
enhancing fault-tolerance of large-scale MPI scientific applications
PaCT'07: Proceedings of the 9th international conference on Parallel Computing Technologies

The running times of large-scale computational science and engineering parallel applications, executed on clusters or Grid platforms, are usually longer than the mean-time-between-failures (MTBF). Therefore, hardware failures must be tolerated to ensure ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MGC '11: Proceedings of the 9th International Workshop on Middleware for Grids, Clouds and e-Science
December 2011
38 pages
ISBN:9781450310680
DOI:10.1145/2089002
Editors:
Bruno Schulze
National Lab for Scientific Computing (LNCC), Brazil
,
Omer Rana
Cardiff University, UK
,
Edmundo Madeira
State University of Campinas (UNICAMP), Brazil
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 December 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
autonomic MPI applications
computational grids
fault tolerance
Qualifiers
- research-article
Conference

Acceptance Rates
MGC '11 Paper Acceptance Rate5of13submissions,38%Overall Acceptance Rate14of36submissions,39%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 102
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A hybrid fault tolerance scheme for EasyGrid MPI applications

MGC '11: Proceedings of the 9th International Workshop on Middleware for Grids, Clouds and e-Science

ABSTRACT

References

Cited By

Index Terms

Recommendations

Evaluating User-Level Fault Tolerance for MPI Applications

Fault Tolerance in Multiprocessor Systems Without Dedicated Redundancy

enhancing fault-tolerance of large-scale MPI scientific applications