Skip to main content

SHIELD: A Fault-Tolerant MPI for an Infiniband Cluster

  • Conference paper
High Performance Computing and Communications (HPCC 2006)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4208))

  • 806 Accesses

Abstract

Today’s high performance cluster computing technologies demand extreme robustness against unexpected failures to finish aggressively parallelized work in a given time constraint. Although there has been a steady effort in developing hardware and software tools to increase fault-resilience of cluster environments, a successful solution has yet to be delivered to commercial vendors. This paper presents SHIELD, a practical and easily-deployable fault-tolerant MPI and management system of MPI for an Infiniband cluster. SHIELD provides a novel framework that can be easily used in real cluster systems, and it has different design perspectives than those proposed by other fault-tolerant MPI. We show that SHIELD provides robust fault-resilience to fault-vulnerable cluster systems and that the design features of SHIELD are useful wherever fault-resilience is regarded as the matter of utmost importance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov, A.: MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes. In: Proceedings of the 2002 ACM/IEEE Supercomputing Conference (2002)

    Google Scholar 

  2. Bouteiller, B., Cappello, F., Herault, T., Krawezik, K., Lemarinier, P., Magniette, M.: MPICH-V2: A Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging. In: Proceedings of the 2003 ACM/IEEE Supercomputing Conference (2003)

    Google Scholar 

  3. Fagg, G.E., Dongarra, J.: FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. In: Proceedings of the 7th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface (2000)

    Google Scholar 

  4. Garcia-Molina, H.: Elections in a Distributed Computing System. IEEE Transactions on Computers (1982)

    Google Scholar 

  5. InfiniBand Trade Association, InfiniBand Architecture Specification, Release (2004), http://www.infinibandta.org

  6. Jung, H., Shin, D., Han, H., Kim, J.W., Yeom, H.Y., Lee, J.: Design and Implementation of Multiple Fault-Tolerant MPI over Myrine. In: Proceedings of the 2005 ACM/IEEE Supercomputing Conference (2005)

    Google Scholar 

  7. Kim, H.S., Yeom, H.Y.: A User-Transparent Recoverable File System for Distributed Computing Environment. In: Challenges of Large Applications in Distributed Environments (CLADE 2005) (2005)

    Google Scholar 

  8. Liu, J., Wu, J., Kini, S.P., Wyckoff, P., Panda, D.K.: High Performance RDMA-based MPI Implementation over InfiniBand. In: ICS 2003: Proceedings of the 17th annual international conference on Supercomputing (2003)

    Google Scholar 

  9. Oh, K.J., Klein, M.L.: A General Purpose Parallel Molecular Dynamics Simulation Program. Computer Physics Communication (2006)

    Google Scholar 

  10. Stellner, G.: CoCheck: Checkpointing and Process Migration for MPI. In: Proceedings of the International Parallel Processing Symposium (1996)

    Google Scholar 

  11. Woo, N., Jung, H., Yeom, H.Y., Park, T., Park, H.: MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes. IEICE Transactions on Information and Systems (2004)

    Google Scholar 

  12. Woo, N., Jung, H., Shin, D., Han, H., Yeom, H.Y., Park, T.: Performance Evaluation of Consistent Recovery Protocols Using MPICH-GF. In: Proceedings of the 5th European Dependable Computing Conference (2005)

    Google Scholar 

  13. Zandy, V.: Ckpt, http://www.cs.wisc.edu/~zandy/ckpt/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Han, H. et al. (2006). SHIELD: A Fault-Tolerant MPI for an Infiniband Cluster. In: Gerndt, M., Kranzlmüller, D. (eds) High Performance Computing and Communications. HPCC 2006. Lecture Notes in Computer Science, vol 4208. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11847366_90

Download citation

  • DOI: https://doi.org/10.1007/11847366_90

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-39368-9

  • Online ISBN: 978-3-540-39372-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics