ABSTRACT
Writing applications capable of executing efficiently in distributed systems is extremely difficult and tedious for inexperienced users. The resources may be heterogeneous, non-dedicated, and offered without any performance or availability guarantees. Systems capable of adapting the execution of an application to these characteristics are essential. The EasyGrid Application Management System (AMS) transforms cluster-based MPI applications into autonomic ones capable executing robustly and efficiently in distributed environments. This work describes a strategy to endow these autonomic MPI applications with the property of self-healing and thus be capable of withstanding multiple simultaneous crash faults of processes and/or processors. The extremely low intrusion cost of the proposed hybrid solution might now facilitate acceptance of fault tolerance techniques in large scale high performance applications.
- C. Boeres and V. E. F. Rebello. EasyGrid: Towards a framework for the automatic grid enabling of legacy MPI applications. Concurrency and Computation: Practice and Experience, 16(5):425--432, April 2004. Google ScholarDigital Library
- T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43:225--267, 1996. Google ScholarDigital Library
- E. N. Dorband, M. Hemsendorf, and D. Merritt. Systolic and hyper-systolic algorithms for the gravitational n-body problem, with an application to brownian motion. Journal of Computational Physics, 185(2):484--511, 2003. Google ScholarDigital Library
- M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 34(3):375--408, 2002. Google ScholarDigital Library
- G. E. Fagg and J. Dongarra. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, volume LNCS 1908, pages 346--353. Springer, 2000. Google ScholarDigital Library
- R. L. Graham, S.-E. Choi, D. J. Daniel, N. N. Desai, R. G. Minnich, C. E. Rasmussen, L. D. Risinger, and M. W. Sukalski. A network-failure-tolerant message-passing system for terascale clusters. International Journal of Parallel Programming, 31(4):285--303, 2003. Google ScholarDigital Library
- P. H. Hargrove and J. C. Duell. Berkeley lab checkpoint/restart (BLCR) for linux clusters. In Proc. Conference on Scientific Discovery through Avanced Computing (SciDAC), pages 494--499, 2006.Google ScholarCross Ref
- J. Hursey, J. M. Squyres, T. I. Mattox, and A. Lumsdaine. The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In Proc. 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2007.Google ScholarCross Ref
- S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. International Journal of High Performance Computing Applications, 19(4):479--493, 2005.Google ScholarDigital Library
- A. Sena, A. Nascimento, C. Boeres, and V. Rebello. EasyGrid enabling of iterative tightly-coupled parallel MPI applications. In Proc. International Symposium on Parallel and Distributed Processing with Applications (ISPA), pages 199--206, 2008. Google ScholarDigital Library
- A. C. Sena, A. P. Nascimento, J. A. Silva, D. Q. C. Vianna, C. Boeres, and V. E. F. Rebello. On the advantages of an alternative MPI execution model for grids. In Proc. 7th IEEE International Symposium on Cluster Computing and the Grid (CCGrid), pages 575--582, 2007. Google ScholarDigital Library
- R. Sterritt, M. Parashar, H. Tianfield, and R. Unland. A concise introduction to autonomic computing. Adv. Engineering Informatics, 19(3):181--187, 2005. Google ScholarDigital Library
- S. Zhao, V. Lo, and C. GauthierDickey. Result verification and trust-based scheduling in peer-to-peer grids. In Proc. Fifth IEEE International Conference on Peer-to-Peer Computing, pages 31--38, 2005. Google ScholarDigital Library
Index Terms
- A hybrid fault tolerance scheme for EasyGrid MPI applications
Recommendations
Evaluating User-Level Fault Tolerance for MPI Applications
EuroMPI/ASIA '14: Proceedings of the 21st European MPI Users' Group MeetingThe User Level Failure Mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in MPI. Previous work has presented performance evaluations of the interface; yet questions related to its programability and applicability remain ...
Fault Tolerance in Multiprocessor Systems Without Dedicated Redundancy
An algorithm called RAFT (recursive algorithm for fault tolerance) for achieving fault tolerance in multiprocessor systems is described. Through the use of a combination of dynamic space- and time- redundancy techniques, RAFT achieves fault tolerance in ...
enhancing fault-tolerance of large-scale MPI scientific applications
PaCT'07: Proceedings of the 9th international conference on Parallel Computing TechnologiesThe running times of large-scale computational science and engineering parallel applications, executed on clusters or Grid platforms, are usually longer than the mean-time-between-failures (MTBF). Therefore, hardware failures must be tolerated to ensure ...
Comments