A Network-Failure-Tolerant Message-Passing System for Terascale Clusters

Graham, Richard L.; Choi, Sung-Eun; Daniel, David J.; Desai, Nehal N.; Minnich, Ronald G.; Rasmussen, Craig E.; Risinger, L. Dean; Sukalski, Mitchel W.

doi:10.1023/A:1024504726988

A Network-Failure-Tolerant Message-Passing System for Terascale Clusters

Published: August 2003

Volume 31, pages 285–303, (2003)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Richard L. Graham¹,
Sung-Eun Choi¹,
David J. Daniel¹,
Nehal N. Desai¹,
Ronald G. Minnich¹,
Craig E. Rasmussen¹,
L. Dean Risinger¹ &
…
Mitchel W. Sukalski¹

161 Accesses
32 Citations
Explore all metrics

Abstract

The Los Alamos Message Passing Interface (LA-MPI) is an end-to-end network-failure-tolerant message-passing system designed for terascale clusters. LA-MPI is a standard-compliant implementation of MPI designed to tolerate network-related failures including I/O bus errors, network card errors, and wire-transmission errors. This paper details the distinguishing features of LA-MPI, including support for concurrent use of multiple types of network interface, and reliable message transmission utilizing multiple network paths and routes between a given source and destination. In addition, performance measurements on production-grade platforms are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A brief introduction to distributed systems

Article Open access 16 August 2016

Maarten van Steen & Andrew S. Tanenbaum

A survey on the evolution of stream processing systems

Article Open access 22 November 2023

Marios Fragkoulis, Paris Carbone, … Asterios Katsifodimos

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Article Open access 06 April 2024

Peter Thoman & Philip Salzmann

References

Jerome H. Saltzer, David P. Reed, and David D. Clark, End-to-End Arguments in System Design, ACM Transactions on Computer Systems, 2(4):277–288 (November 1984).
Google Scholar
Jonathan Stone and Craig Partridge, When the CRC and TCP Check-Sum Disagree, SIGCOMM, pp. 309–319 (2000).
Jack J. Dongarra and David Walker, MPI: A Standard Message Passing Interface, Supercomputer, 12(1):56–68 (January 1996).
Google Scholar
http://www.myri.com/.
http://www.cs.sandia.gov/cplant/.
George Bosilca, Aurelien Bouteiller, Franck Cappello, Samir Djailali, Gilles Fedak, Cecile Germain, Thomas Herault, Pierre Lemarinier, Oleg Lodygensky, Frederic Magniette, Vincent Neri, and Anton Selikhov, MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes.
Rajeev Thakur, William Gropp, and Ewing Lusk, Users Guide for ROMIO: A High-Performance, Portable MPI-IO Implementation, Mathematics and Computer Science Division, Argonne National Laboratory (October 1997). ANL/MCS-TM-234.
Georg Stellner, CoCheck: Checkpointing and Process Migration for MPI, Proceedings of the 10th International Parallel Processing Symposium (IPPS '96), Honolulu, Hawaii (1996).
M. Litzkow, M. Livny, and M. Mutka, Condor–A Hunter of Idle Workstations, 8th International Conference on Distributed Computing System, IEEE Computer Society Press, pp. 108–111 (1988).
A. Agbaria and R. Friedman, Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations, 8th IEEE International Symposium on High Performance Distributed Computing (1999).
Graham E. Fagg, Keith Moore, and Jack J. Dongarra, Scalable Networked Information Processing Environment (SNIPE), Future Generation Computer Systems, 15(5–6): 595–605 (1999).
Google Scholar
G. Fagg and J. Dongarra, FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World, EuroPVM/MPI User's Group Meeting 2000, Springer-Verlag, Berlin, Germany (2000).
Google Scholar
Robbie T. Aulwes, David J. Daniel, Nehal N. Desai, Richard L. Graham, L. Dean Risinger, and Mitchel W. Sukalski, LA-MPI: The Design and Implementation of a Network-Fault-Tolerant MPI for Terascale Clusters, Los Alamos report LA-UR–03–0929, submitted to SC2003.
J. Postel, Transmission Control Protocol, Internet Engineering Task Force, RFC 793 (1981).
W. R. Stevens, TCP/IP Illustrated, Volume 2; The Implementation, Addison–Wesley, Reading (1995).
Google Scholar
A. Denis, Variable Reliability Protocol in Globus-Nexus, Technical Report, Information Science Institute (ISI), University of Southern California (1999).
E. C. Hunke and W. H. Lipscomb, CICE: the Los Alamos Sea Ice Model. Technical Report LA-CC–98–16, Los Alamos National Laboratory, 1999.
Ron Minnich and Karen Reid, Supermon: High Performance Monitoring for Linux Clusters, The Fifth Annual Linux Showcase and Conference (November 2001).
Erik A. Hendriks, BProc: The Beowulf Distributed Process Space, 16th Annual ACM International Conference on Supercomputing (2002).

Download references

Author information

Authors and Affiliations

Advanced Computing Laboratory, MS-B287, Los Alamos National Laboratory, Los Alamos, New Mexico, 87545
Richard L. Graham, Sung-Eun Choi, David J. Daniel, Nehal N. Desai, Ronald G. Minnich, Craig E. Rasmussen, L. Dean Risinger & Mitchel W. Sukalski

Authors

Richard L. Graham
View author publications
You can also search for this author in PubMed Google Scholar
Sung-Eun Choi
View author publications
You can also search for this author in PubMed Google Scholar
David J. Daniel
View author publications
You can also search for this author in PubMed Google Scholar
Nehal N. Desai
View author publications
You can also search for this author in PubMed Google Scholar
Ronald G. Minnich
View author publications
You can also search for this author in PubMed Google Scholar
Craig E. Rasmussen
View author publications
You can also search for this author in PubMed Google Scholar
L. Dean Risinger
View author publications
You can also search for this author in PubMed Google Scholar
Mitchel W. Sukalski
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Graham, R.L., Choi, SE., Daniel, D.J. et al. A Network-Failure-Tolerant Message-Passing System for Terascale Clusters. International Journal of Parallel Programming 31, 285–303 (2003). https://doi.org/10.1023/A:1024504726988

Download citation

Issue Date: August 2003
DOI: https://doi.org/10.1023/A:1024504726988

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Network-Failure-Tolerant Message-Passing System for Terascale Clusters

Abstract

Access this article

Similar content being viewed by others

A brief introduction to distributed systems

A survey on the evolution of stream processing systems

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Network-Failure-Tolerant Message-Passing System for Terascale Clusters

Abstract

Access this article

Similar content being viewed by others

A brief introduction to distributed systems

A survey on the evolution of stream processing systems

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation