Approaches for Parallel Applications Fault Tolerance

Graham, Richard L.

doi:10.1007/11846802_2

Richard L. Graham²⁰

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 4192))

Included in the following conference series:

European Parallel Virtual Machine / Message Passing Interface Users’ Group Meeting

1212 Accesses

Abstract

System component failure – hardware and software, permanent and transient – are an integral part of the life cycle of any computer system. The degree to which a system suffers from these failures depends on factors such as system complexity, system design and implementation, and system size. These errors may lead to catastrophic application failure (termination of an application run with a CPU failure), silent application errors (such as network data corruption), or application hangs (such as when network interface card (NIC) malfunction), all wasting valuable computer time. For certain classes of computer systems, dealing with these failures is a requirement to provide a simulation environment reliable enough to meet end-user needs. Also, the more automated these solutions are, requiring minimal or no end-user intervention, the more likely they are to be used to achieve the required application stability. Dealing with failure, or fault tolerance, while minimizing application performance degradation, is an active research area, with no consensus as to what are optimal solution strategies, or even what failures need to be considered. Errors include items such as transient data transmission errors (dropped or corrupt packets), transient and permanent network failures (NIC), and process failure, to list a few. The current MPI standard addresses a limited number of failure scenarios, with application termination being the default response to failure. While the standard provide a mechanism for users to override this default response, it does not define error codes that provide information on system level failures – hardware or software. None-the-less, these need to be addressed to provide end-users with systems that meet their computing needs. Building on experience gained in the LA-MPI, FT-MPI, and LAM/MPI projects, the Open MPI collaboration has implemented, and is continuing to implement optional solutions that deal with a number of failure scenarios, to decrease the application mean-time-to-failure rate, to acceptable rates. The types of errors currently being dealt with include transient network data transmission errors, transient and permanent NIC failures, and process failure. The talk will discuss fault detection, fault recovery methods, and the degree to which applications need to be modified to benefit fromthese, if any. In addition, the performance impact of these solutions on several applications will be discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Software approaches for resilience of high performance computing systems: a survey

Article 12 December 2022

Towards Fault Tolerance and Resilience in the Sequential Codelet Model

HPC-SFI: System-Level Fault Injection for High Performance Computing Systems

Author information

Authors and Affiliations

Advanced Computing Laboratory, Los Alamos National Laboratory, Los Alamos, NM, 87544, USA
Richard L. Graham

Authors

Richard L. Graham
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Forschungszentrum Jülich, ZAM, 52425, Jülich, Germany
Bernd Mohr
NEC Europe Ltd., NEC Laboratories Europe, Rathausallee 10, D-53757, Sankt Augustin, Germany
Jesper Larsson Träff
Dolphin Interconnect Solutions ASA R&D Germany, Siebengebirgsblick 26, 53343, Wachtberg, Germany
Joachim Worringen
Computer Science Department, University of Tennessee, 37996-3450, Knoxville, TN, USA
Jack Dongarra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Graham, R.L. (2006). Approaches for Parallel Applications Fault Tolerance. In: Mohr, B., Träff, J.L., Worringen, J., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2006. Lecture Notes in Computer Science, vol 4192. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11846802_2

Download citation

DOI: https://doi.org/10.1007/11846802_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-39110-4
Online ISBN: 978-3-540-39112-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Approaches for Parallel Applications Fault Tolerance

Abstract

Access this chapter

Similar content being viewed by others

Software approaches for resilience of high performance computing systems: a survey

Towards Fault Tolerance and Resilience in the Sequential Codelet Model

HPC-SFI: System-Level Fault Injection for High Performance Computing Systems

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Approaches for Parallel Applications Fault Tolerance

Abstract

Access this chapter

Similar content being viewed by others

Software approaches for resilience of high performance computing systems: a survey

Towards Fault Tolerance and Resilience in the Sequential Codelet Model

HPC-SFI: System-Level Fault Injection for High Performance Computing Systems

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation