Abstract
The Los Alamos Message Passing Interface (LA-MPI) is an end-to-end network-failure-tolerant message-passing system designed for terascale clusters. LA-MPI is a standard-compliant implementation of MPI designed to tolerate network-related failures including I/O bus errors, network card errors, and wire-transmission errors. This paper details the distinguishing features of LA-MPI, including support for concurrent use of multiple types of network interface, and reliable message transmission utilizing multiple network paths and routes between a given source and destination. In addition, performance measurements on production-grade platforms are presented.
Similar content being viewed by others
References
Jerome H. Saltzer, David P. Reed, and David D. Clark, End-to-End Arguments in System Design, ACM Transactions on Computer Systems, 2(4):277–288 (November 1984).
Jonathan Stone and Craig Partridge, When the CRC and TCP Check-Sum Disagree, SIGCOMM, pp. 309–319 (2000).
Jack J. Dongarra and David Walker, MPI: A Standard Message Passing Interface, Supercomputer, 12(1):56–68 (January 1996).
http://www.myri.com/.
http://www.cs.sandia.gov/cplant/.
George Bosilca, Aurelien Bouteiller, Franck Cappello, Samir Djailali, Gilles Fedak, Cecile Germain, Thomas Herault, Pierre Lemarinier, Oleg Lodygensky, Frederic Magniette, Vincent Neri, and Anton Selikhov, MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes.
Rajeev Thakur, William Gropp, and Ewing Lusk, Users Guide for ROMIO: A High-Performance, Portable MPI-IO Implementation, Mathematics and Computer Science Division, Argonne National Laboratory (October 1997). ANL/MCS-TM-234.
Georg Stellner, CoCheck: Checkpointing and Process Migration for MPI, Proceedings of the 10th International Parallel Processing Symposium (IPPS '96), Honolulu, Hawaii (1996).
M. Litzkow, M. Livny, and M. Mutka, Condor–A Hunter of Idle Workstations, 8th International Conference on Distributed Computing System, IEEE Computer Society Press, pp. 108–111 (1988).
A. Agbaria and R. Friedman, Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations, 8th IEEE International Symposium on High Performance Distributed Computing (1999).
Graham E. Fagg, Keith Moore, and Jack J. Dongarra, Scalable Networked Information Processing Environment (SNIPE), Future Generation Computer Systems, 15(5–6): 595–605 (1999).
G. Fagg and J. Dongarra, FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World, EuroPVM/MPI User's Group Meeting 2000, Springer-Verlag, Berlin, Germany (2000).
Robbie T. Aulwes, David J. Daniel, Nehal N. Desai, Richard L. Graham, L. Dean Risinger, and Mitchel W. Sukalski, LA-MPI: The Design and Implementation of a Network-Fault-Tolerant MPI for Terascale Clusters, Los Alamos report LA-UR–03–0929, submitted to SC2003.
J. Postel, Transmission Control Protocol, Internet Engineering Task Force, RFC 793 (1981).
W. R. Stevens, TCP/IP Illustrated, Volume 2; The Implementation, Addison–Wesley, Reading (1995).
A. Denis, Variable Reliability Protocol in Globus-Nexus, Technical Report, Information Science Institute (ISI), University of Southern California (1999).
E. C. Hunke and W. H. Lipscomb, CICE: the Los Alamos Sea Ice Model. Technical Report LA-CC–98–16, Los Alamos National Laboratory, 1999.
Ron Minnich and Karen Reid, Supermon: High Performance Monitoring for Linux Clusters, The Fifth Annual Linux Showcase and Conference (November 2001).
Erik A. Hendriks, BProc: The Beowulf Distributed Process Space, 16th Annual ACM International Conference on Supercomputing (2002).
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Graham, R.L., Choi, SE., Daniel, D.J. et al. A Network-Failure-Tolerant Message-Passing System for Terascale Clusters. International Journal of Parallel Programming 31, 285–303 (2003). https://doi.org/10.1023/A:1024504726988
Issue Date:
DOI: https://doi.org/10.1023/A:1024504726988