Abstract
Extreme-scale parallel systems will require alternative methods for applications to maintain current levels of uninterrupted execution. Redundant computation is one approach to consider, if the benefits of increased resiliency outweigh the cost of consuming additional resources. We describe a transparent redundancy approach for MPI applications and detail two different implementations that provide the ability to tolerate a range of failure scenarios, including loss of application processes and connectivity. We compare these two approaches and show performance results from micro-benchmarks that bound worst-case message passing performance degradation. We propose several enhancements that could lower the overhead of providing resiliency through redundancy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ferreira, K., Riesen, R., Oldfield, R., Stearley, J., Laros, J., Pedretti, K., Brightwell, R., Kordenbrock, T.: Increasing fault resiliency in a message-passing environment. Technical report SAND2009-6753, Sandia National Laboratories (2009)
Riesen, R., Ferreira, K., Stearley, J.: See applications run and throughput jump: The case for redundant computing in HPC. In: 1st International Workshop on Fault-Tolerance for HPC at Extreme Scale, FTXS 2010 (2010)
Network-Based Computing Laboratory, Ohio State University: OSU MPI benchmarks, OMB (2010), http://mvapich.cse.ohio-state.edu/benchmarks/
Schroeder, B., Gibson, G.A.: Understanding failures in petascale computers. Journal of Physics: Conference Series 78(1), 188–198 (2007)
Zheng, Z., Lan, Z.: Reliability-aware scalability models for high performance computing, In: Proceedings of the IEEE conference on Cluster Computing (2009)
He, X., Ou, L., Engelmann, C., Chen, X., Scott, S.L.: Symmetric active/active metadata service for high availability parallel file systems. Journal of Parallel and Distributed Computing (JPDC) 69(12), 961–973 (2009)
Fagg, G.E., Dongarra, J.: FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Proceedings of the 7th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 346–353 (2000)
Gropp, W., Lusk, E.: Fault tolerance in message passing interface programs. International Journal of High Performance Computing Applications 18(3) (2004)
Bouteiller, A., Cappello, F., Herault, T., Krawezik, G., Lemarinier, P., Magniette, F.: MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In: Proceedings of the ACM/IEEE International Conference on High Performance Computing and Networking (2003)
Hursey, J., Squyres, J., Mattox, T., Lumsdaine, A.: The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In: Proceedings of the IEEE International Parallel and Distributed Processing Symposium (2007)
Santos, G., Duarte, A., Rexachs, D., Luque, E.: Providing non-stop service for message-passing based parallel applications with RADIC. In: Luque, E., Margalef, T., BenÃtez, D. (eds.) Euro-Par 2008. LNCS, vol. 5168, pp. 58–67. Springer, Heidelberg (2008)
Genaud, S., Rattanapoka, C.: P2P-MPI: A peer-to-peer framework for robust execution of message passing parallel programs on grids. J. Grid Comput. 5(1), 27–42 (2007)
Genaud, S., Jeannot, E., Rattanapoka, C.: Fault-management in P2P-MPI. Int. J. Parallel Program. 37(5), 433–461 (2009)
Farreras, M., Cortes, T., Labarta, J., Almasi, G.: Scaling MPI to short-memory MPPs such as BG/L. In: Proceeding of the International Conference on Supercomputing, pp. 209–218 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Brightwell, R., Ferreira, K., Riesen, R. (2010). Transparent Redundant Computing with MPI. In: Keller, R., Gabriel, E., Resch, M., Dongarra, J. (eds) Recent Advances in the Message Passing Interface. EuroMPI 2010. Lecture Notes in Computer Science, vol 6305. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15646-5_22
Download citation
DOI: https://doi.org/10.1007/978-3-642-15646-5_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15645-8
Online ISBN: 978-3-642-15646-5
eBook Packages: Computer ScienceComputer Science (R0)