Skip to main content

Transparent Redundant Computing with MPI

  • Conference paper
Recent Advances in the Message Passing Interface (EuroMPI 2010)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 6305))

Included in the following conference series:

Abstract

Extreme-scale parallel systems will require alternative methods for applications to maintain current levels of uninterrupted execution. Redundant computation is one approach to consider, if the benefits of increased resiliency outweigh the cost of consuming additional resources. We describe a transparent redundancy approach for MPI applications and detail two different implementations that provide the ability to tolerate a range of failure scenarios, including loss of application processes and connectivity. We compare these two approaches and show performance results from micro-benchmarks that bound worst-case message passing performance degradation. We propose several enhancements that could lower the overhead of providing resiliency through redundancy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ferreira, K., Riesen, R., Oldfield, R., Stearley, J., Laros, J., Pedretti, K., Brightwell, R., Kordenbrock, T.: Increasing fault resiliency in a message-passing environment. Technical report SAND2009-6753, Sandia National Laboratories (2009)

    Google Scholar 

  2. Riesen, R., Ferreira, K., Stearley, J.: See applications run and throughput jump: The case for redundant computing in HPC. In: 1st International Workshop on Fault-Tolerance for HPC at Extreme Scale, FTXS 2010 (2010)

    Google Scholar 

  3. Network-Based Computing Laboratory, Ohio State University: OSU MPI benchmarks, OMB (2010), http://mvapich.cse.ohio-state.edu/benchmarks/

  4. Schroeder, B., Gibson, G.A.: Understanding failures in petascale computers. Journal of Physics: Conference Series 78(1), 188–198 (2007)

    Google Scholar 

  5. Zheng, Z., Lan, Z.: Reliability-aware scalability models for high performance computing, In: Proceedings of the IEEE conference on Cluster Computing (2009)

    Google Scholar 

  6. He, X., Ou, L., Engelmann, C., Chen, X., Scott, S.L.: Symmetric active/active metadata service for high availability parallel file systems. Journal of Parallel and Distributed Computing (JPDC) 69(12), 961–973 (2009)

    Article  Google Scholar 

  7. Fagg, G.E., Dongarra, J.: FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Proceedings of the 7th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 346–353 (2000)

    Google Scholar 

  8. Gropp, W., Lusk, E.: Fault tolerance in message passing interface programs. International Journal of High Performance Computing Applications 18(3) (2004)

    Google Scholar 

  9. Bouteiller, A., Cappello, F., Herault, T., Krawezik, G., Lemarinier, P., Magniette, F.: MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In: Proceedings of the ACM/IEEE International Conference on High Performance Computing and Networking (2003)

    Google Scholar 

  10. Hursey, J., Squyres, J., Mattox, T., Lumsdaine, A.: The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In: Proceedings of the IEEE International Parallel and Distributed Processing Symposium (2007)

    Google Scholar 

  11. Santos, G., Duarte, A., Rexachs, D., Luque, E.: Providing non-stop service for message-passing based parallel applications with RADIC. In: Luque, E., Margalef, T., Benítez, D. (eds.) Euro-Par 2008. LNCS, vol. 5168, pp. 58–67. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  12. Genaud, S., Rattanapoka, C.: P2P-MPI: A peer-to-peer framework for robust execution of message passing parallel programs on grids. J. Grid Comput. 5(1), 27–42 (2007)

    Article  Google Scholar 

  13. Genaud, S., Jeannot, E., Rattanapoka, C.: Fault-management in P2P-MPI. Int. J. Parallel Program. 37(5), 433–461 (2009)

    Article  MATH  Google Scholar 

  14. Farreras, M., Cortes, T., Labarta, J., Almasi, G.: Scaling MPI to short-memory MPPs such as BG/L. In: Proceeding of the International Conference on Supercomputing, pp. 209–218 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Brightwell, R., Ferreira, K., Riesen, R. (2010). Transparent Redundant Computing with MPI. In: Keller, R., Gabriel, E., Resch, M., Dongarra, J. (eds) Recent Advances in the Message Passing Interface. EuroMPI 2010. Lecture Notes in Computer Science, vol 6305. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15646-5_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15646-5_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15645-8

  • Online ISBN: 978-3-642-15646-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics