skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Partial differential equations preconditioner resilient to soft and hard faults

Journal Article · · International Journal of High Performance Computing Applications
 [1];  [1];  [1];  [2];  [1];  [2];  [2];  [1]
  1. Sandia National Lab. (SNL-CA), Livermore, CA (United States)
  2. Duke Univ., Durham, NC (United States)

We present a domain-decomposition-based preconditioner for the solution of partial differential equations (PDEs) that is resilient to both soft and hard faults. The algorithm reformulates the PDE as a sampling problem, followed by a solution update through data manipulation that is resilient to both soft and hard faults. This reformulation allows us to recast the problem as a set of independent tasks, and exploit data locality to reduce global communication. We discuss two different parallel implementations: (a) a single program multiple data (SPMD) version based on a one-to-one mapping between subdomain and MPI processes responsible for both state and computation; and (b) an asynchronous server–client implementation where all state information is held by the servers and clients are designed solely as computational units. We present a scalability comparison of both implementations under nominal conditions, showing efficiency within ~80% for up to 12,000 cores. We present a resilience analysis under different fault scenarios based on the server–client implementation. This framework provides resiliency to hard faults such that if a client crashes, it stops asking for work, and the servers simply distribute the work among all of the other clients alive. Erroneous subdomain solves (e.g. due to soft faults) appear as corrupted data, which is either rejected if that causes a task to fail, or is seamlessly filtered out during the regression stage through a suitable noise model. Three different types of faults are modeled: hard faults modeling nodes (or clients) crashing; soft faults occurring during the communication of the tasks between server and clients; and soft faults occurring during task execution. We demonstrate the resiliency of the approach for a 2D elliptic PDE, and explore the effect of the faults at various failure rates.

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States). National Energy Research Scientific Computing Center (NERSC); Univ. of California, Oakland, CA (United States); Lockheed Martin Corporation, Littleton, CO (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
Grant/Contract Number:
AC02-05CH11231; AC04-94AL85000
OSTI ID:
1544016
Journal Information:
International Journal of High Performance Computing Applications, Vol. 32, Issue 5; ISSN 1094-3420
Publisher:
SAGECopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 2 works
Citation information provided by
Web of Science

References (18)

A case for two-level distributed recovery schemes
  • Vaidya, Nitin H.
  • Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems - SIGMETRICS '95/PERFORMANCE '95 https://doi.org/10.1145/223587.223596
conference January 1995
Algorithm-based fault tolerance applied to high performance computing journal April 2009
Understanding the propagation of hard errors to software and implications for resilient system design journal March 2008
Analyzing the soft error resilience of linear solvers on multicore multiprocessors conference April 2010
Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems conference October 2013
Matrix Multiplication on GPUs with On-Line Fault Tolerance
  • Ding, Chong; Karlsson, Christer; Liu, Hui
  • 2011 IEEE 9th International Symposium on Parallel and Distributed Processing with Applications (ISPA), 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications https://doi.org/10.1109/ISPA.2011.50
conference May 2011
A Large-Scale Study of Failures in High-Performance Computing Systems journal October 2010
Fault Resilient Domain Decomposition Preconditioner for PDEs journal January 2015
Application Level Fault Recovery: Using Fault-Tolerant Open MPI in a PDE Solver conference May 2014
Algorithm-based fault tolerance for dense matrix factorizations
  • Du, Peng; Bouteiller, Aurelien; Bosilca, George
  • Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming - PPoPP '12 https://doi.org/10.1145/2145816.2145845
conference January 2012
Partial Differential Equations Preconditioner Resilient to Soft and Hard Faults conference September 2015
Abstract Machine Models and Proxy Architectures for Exascale Computing conference November 2014
Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance
  • Shye, Alex; Moseley, Tipp; Reddi, Vijay Janapa
  • 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07) https://doi.org/10.1109/DSN.2007.98
conference June 2007
Algorithm-based recovery for iterative methods without checkpointing conference January 2011
Toward Exascale Resilience journal September 2009
Error log analysis: statistical modeling and heuristic trend analysis journal January 1990
Keeping checkpoint/restart viable for exascale systems. report September 2011
Failure data analysis of a large-scale heterogeneous server environment conference January 2004