Abstract
Today’s high performance cluster computing technologies demand extreme robustness against unexpected failures to finish aggressively parallelized work in a given time constraint. Although there has been a steady effort in developing hardware and software tools to increase fault-resilience of cluster environments, a successful solution has yet to be delivered to commercial vendors. This paper presents SHIELD, a practical and easily-deployable fault-tolerant MPI and management system of MPI for an Infiniband cluster. SHIELD provides a novel framework that can be easily used in real cluster systems, and it has different design perspectives than those proposed by other fault-tolerant MPI. We show that SHIELD provides robust fault-resilience to fault-vulnerable cluster systems and that the design features of SHIELD are useful wherever fault-resilience is regarded as the matter of utmost importance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov, A.: MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes. In: Proceedings of the 2002 ACM/IEEE Supercomputing Conference (2002)
Bouteiller, B., Cappello, F., Herault, T., Krawezik, K., Lemarinier, P., Magniette, M.: MPICH-V2: A Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging. In: Proceedings of the 2003 ACM/IEEE Supercomputing Conference (2003)
Fagg, G.E., Dongarra, J.: FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. In: Proceedings of the 7th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface (2000)
Garcia-Molina, H.: Elections in a Distributed Computing System. IEEE Transactions on Computers (1982)
InfiniBand Trade Association, InfiniBand Architecture Specification, Release (2004), http://www.infinibandta.org
Jung, H., Shin, D., Han, H., Kim, J.W., Yeom, H.Y., Lee, J.: Design and Implementation of Multiple Fault-Tolerant MPI over Myrine. In: Proceedings of the 2005 ACM/IEEE Supercomputing Conference (2005)
Kim, H.S., Yeom, H.Y.: A User-Transparent Recoverable File System for Distributed Computing Environment. In: Challenges of Large Applications in Distributed Environments (CLADE 2005) (2005)
Liu, J., Wu, J., Kini, S.P., Wyckoff, P., Panda, D.K.: High Performance RDMA-based MPI Implementation over InfiniBand. In: ICS 2003: Proceedings of the 17th annual international conference on Supercomputing (2003)
Oh, K.J., Klein, M.L.: A General Purpose Parallel Molecular Dynamics Simulation Program. Computer Physics Communication (2006)
Stellner, G.: CoCheck: Checkpointing and Process Migration for MPI. In: Proceedings of the International Parallel Processing Symposium (1996)
Woo, N., Jung, H., Yeom, H.Y., Park, T., Park, H.: MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes. IEICE Transactions on Information and Systems (2004)
Woo, N., Jung, H., Shin, D., Han, H., Yeom, H.Y., Park, T.: Performance Evaluation of Consistent Recovery Protocols Using MPICH-GF. In: Proceedings of the 5th European Dependable Computing Conference (2005)
Zandy, V.: Ckpt, http://www.cs.wisc.edu/~zandy/ckpt/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Han, H. et al. (2006). SHIELD: A Fault-Tolerant MPI for an Infiniband Cluster. In: Gerndt, M., Kranzlmüller, D. (eds) High Performance Computing and Communications. HPCC 2006. Lecture Notes in Computer Science, vol 4208. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11847366_90
Download citation
DOI: https://doi.org/10.1007/11847366_90
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-39368-9
Online ISBN: 978-3-540-39372-6
eBook Packages: Computer ScienceComputer Science (R0)