Skip to main content

Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance

  • Conference paper
Recent Advances in the Message Passing Interface (EuroMPI 2011)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 6960))

Included in the following conference series:

Abstract

The MPI standard lacks semantics and interfaces for sustained application execution in the presence of process failures. Exascale HPC systems may require scalable, fault resilient MPI applications. The mission of the MPI Forum’s Fault Tolerance Working Group is to enhance the standard to enable the development of scalable, fault tolerant HPC applications. This paper presents an overview of the Run-Through Stabilization proposal. This proposal allows an application to continue execution even if MPI processes fail during execution. The discussion introduces the implications on point-to-point and collective operations over communicators, though the full proposal addresses all aspects of the MPI standard.

Special thanks to the MPI Forum and Fault Tolerance Working Group members that contributed to the run-through stabilization proposal. Their comments and insights continue to help strengthen the developing proposals targeted for inclusion in the Message Passing Interface (MPI) standard. Research sponsored by the Office of Advanced Scientific Computing Research; Office of Science; Mathematical, Information, and Computational Sciences Division at Oak Ridge National Laboratory; U.S. Department of Energy, under Contract No. DE-AC05- 00OR22725 with UT-Battelle, LLC; U.S. Department of Energy, under Contract No. DE-AC02-06CH11357; U.S. Department of Energy, under Contract No. DEAC52- 07NA27344 by Lawrence Livermore National Laboratory; The ARRA / DoE - Early Career Research Program; and by award #CCF-0816909 from the National Science Foundation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Barborak, M., Dahbura, A., Malek, M.: The consensus problem in fault-tolerant computing. ACM Computing Surveys 25, 171–220 (1993)

    Article  Google Scholar 

  2. Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., Snir, M.: Toward exascale resilience. International Journal of High Performance Computing Applications 23(4), 374–388 (2009)

    Article  Google Scholar 

  3. Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed systems. Journal of the ACM 43, 225–267 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  4. Fault Tolerance Working Group: Run-though stabilization proposal, http://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization

  5. Huang, K.H., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers 33(6), 518–528 (1984)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hursey, J., Graham, R.L., Bronevetsky, G., Buntinas, D., Pritchard, H., Solt, D.G. (2011). Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance. In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds) Recent Advances in the Message Passing Interface. EuroMPI 2011. Lecture Notes in Computer Science, vol 6960. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24449-0_40

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-24449-0_40

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-24448-3

  • Online ISBN: 978-3-642-24449-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics