Skip to main content
Log in

A Network-Failure-Tolerant Message-Passing System for Terascale Clusters

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

The Los Alamos Message Passing Interface (LA-MPI) is an end-to-end network-failure-tolerant message-passing system designed for terascale clusters. LA-MPI is a standard-compliant implementation of MPI designed to tolerate network-related failures including I/O bus errors, network card errors, and wire-transmission errors. This paper details the distinguishing features of LA-MPI, including support for concurrent use of multiple types of network interface, and reliable message transmission utilizing multiple network paths and routes between a given source and destination. In addition, performance measurements on production-grade platforms are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Jerome H. Saltzer, David P. Reed, and David D. Clark, End-to-End Arguments in System Design, ACM Transactions on Computer Systems, 2(4):277–288 (November 1984).

    Google Scholar 

  2. Jonathan Stone and Craig Partridge, When the CRC and TCP Check-Sum Disagree, SIGCOMM, pp. 309–319 (2000).

  3. Jack J. Dongarra and David Walker, MPI: A Standard Message Passing Interface, Supercomputer, 12(1):56–68 (January 1996).

    Google Scholar 

  4. http://www.myri.com/.

  5. http://www.cs.sandia.gov/cplant/.

  6. George Bosilca, Aurelien Bouteiller, Franck Cappello, Samir Djailali, Gilles Fedak, Cecile Germain, Thomas Herault, Pierre Lemarinier, Oleg Lodygensky, Frederic Magniette, Vincent Neri, and Anton Selikhov, MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes.

  7. Rajeev Thakur, William Gropp, and Ewing Lusk, Users Guide for ROMIO: A High-Performance, Portable MPI-IO Implementation, Mathematics and Computer Science Division, Argonne National Laboratory (October 1997). ANL/MCS-TM-234.

  8. Georg Stellner, CoCheck: Checkpointing and Process Migration for MPI, Proceedings of the 10th International Parallel Processing Symposium (IPPS '96), Honolulu, Hawaii (1996).

  9. M. Litzkow, M. Livny, and M. Mutka, Condor–A Hunter of Idle Workstations, 8th International Conference on Distributed Computing System, IEEE Computer Society Press, pp. 108–111 (1988).

  10. A. Agbaria and R. Friedman, Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations, 8th IEEE International Symposium on High Performance Distributed Computing (1999).

  11. Graham E. Fagg, Keith Moore, and Jack J. Dongarra, Scalable Networked Information Processing Environment (SNIPE), Future Generation Computer Systems, 15(5–6): 595–605 (1999).

    Google Scholar 

  12. G. Fagg and J. Dongarra, FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World, EuroPVM/MPI User's Group Meeting 2000, Springer-Verlag, Berlin, Germany (2000).

    Google Scholar 

  13. Robbie T. Aulwes, David J. Daniel, Nehal N. Desai, Richard L. Graham, L. Dean Risinger, and Mitchel W. Sukalski, LA-MPI: The Design and Implementation of a Network-Fault-Tolerant MPI for Terascale Clusters, Los Alamos report LA-UR–03–0929, submitted to SC2003.

  14. J. Postel, Transmission Control Protocol, Internet Engineering Task Force, RFC 793 (1981).

  15. W. R. Stevens, TCP/IP Illustrated, Volume 2; The Implementation, Addison–Wesley, Reading (1995).

    Google Scholar 

  16. A. Denis, Variable Reliability Protocol in Globus-Nexus, Technical Report, Information Science Institute (ISI), University of Southern California (1999).

  17. E. C. Hunke and W. H. Lipscomb, CICE: the Los Alamos Sea Ice Model. Technical Report LA-CC–98–16, Los Alamos National Laboratory, 1999.

  18. Ron Minnich and Karen Reid, Supermon: High Performance Monitoring for Linux Clusters, The Fifth Annual Linux Showcase and Conference (November 2001).

  19. Erik A. Hendriks, BProc: The Beowulf Distributed Process Space, 16th Annual ACM International Conference on Supercomputing (2002).

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Graham, R.L., Choi, SE., Daniel, D.J. et al. A Network-Failure-Tolerant Message-Passing System for Terascale Clusters. International Journal of Parallel Programming 31, 285–303 (2003). https://doi.org/10.1023/A:1024504726988

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1024504726988

Keywords

Navigation