skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Node failure resiliency for Uintah without checkpointing

Journal Article · · Concurrency and Computation. Practice and Experience
DOI:https://doi.org/10.1002/cpe.5340· OSTI ID:1637354

The frequency of failures in upcoming exascale supercomputers may well be greater than at present due to many-core architectures if component failure rates remain unchanged. This potential increase in failure frequency coupled with I/O challenges at exascale may prove problematic for current resiliency approaches such as checkpoint restarting, although the use of fast intermediate memory may help. Algorithm-Based Fault Tolerance (ABFT) using Adaptive Mesh Refinement (AMR) is one resiliency approach used to address these challenges. For adaptive mesh codes, a coarse mesh version of the solu- tion may be used to restore the fine mesh solution. This paper addresses the implementation of the ABFT approach within the Uintah software framework: both at a software level within Uintah and in the data reconstruction method used for the recovery of lost data. This method has two problems: inaccuracies introduced during the reconstruction propagate forward in time, and the physical consistency of variables such as positivity or boundedness may be violated during interpolation. These challenges can be addressed by the combination of two techniques: 1. a fault-tolerant MPI implementation to recover from runtime node failures, and 2. high-order interpolation schemes to preserve the physical solution and reconstruct lost data. Here, the approach considered here uses a "Limited Essentially Non-Oscillatory" (LENO) scheme along with AMR to rebuild the lost data without checkpointing using Uintah. Experiments were carried out using a fault-tolerant MPI - ULFM to recover from runtime failure, and LENO to recover data on patches belonging to failed ranks, while the simulation was continued to the end. Results show that this ABFT approach is up to 10x faster than the traditional checkpointing method. The new interpolation approach is more accurate than linear interpolation and not subject to the overshoots found in other interpolation methods.

Research Organization:
Univ. of Utah, Salt Lake City, UT (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA); National Science Foundation (NSF)
Grant/Contract Number:
NA0002375; 1337145
OSTI ID:
1637354
Journal Information:
Concurrency and Computation. Practice and Experience, Vol. 31, Issue 20; Conference: Society of Architectural Historians (SAH) 2019, Providence, RI (United States), 24-28 Apr 2019; ISSN 1532-0626
Publisher:
WileyCopyright Statement
Country of Publication:
United States
Language:
English

References (45)

A Performance and Energy Comparison of Fault Tolerance Techniques for Exascale Computing Systems conference December 2016
Fault tolerance using lower fidelity data in adaptive mesh applications conference January 2013
Correcting soft errors online in LU factorization
  • Davies, Teresa; Chen, Zizhong
  • HPDC'13: The 22nd International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 22nd international symposium on High-performance parallel and distributed computing https://doi.org/10.1145/2462902.2462920
conference October 2018
Partial Redundancy in HPC Systems with Non-Uniform Node Reliabilities conference November 2018
Investigating applications portability with the Uintah DAG-based runtime system on PetaScale supercomputers
  • Meng, Qingyu; Humphrey, Alan; Schmidt, John
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503250
conference January 2013
Resilience for Massively Parallel Multigrid Solvers journal January 2016
Scalable, fault tolerant membership for MPI tasks on HPC systems conference January 2006
Hybrid Checkpointing for MPI Jobs in HPC Environments conference December 2010
Exploring versioned distributed arrays for resilience in scientific applications: global view resilience journal September 2016
A Cell-Centered Adaptive Projection Method for the Incompressible Euler Equations journal September 2000
Improving Uintah's Scalability Through the Use of Portable Kokkos-Based Data Parallel Tasks
  • Holmen, John K.; Humphrey, Alan; Sunderland, Daniel
  • PEARC17: Practice and Experience in Advanced Research Computing 2017, Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact https://doi.org/10.1145/3093338.3093388
conference July 2017
Reducing Network Congestion and Synchronization Overhead During Aggregation of Hierarchical Data conference December 2017
MOL solvers for hyperbolic PDEs with source terms journal May 2001
On spatial adaptivity and interpolation when using the method of lines journal January 1998
Improving the performance of Uintah: A large-scale adaptive meshing computational framework conference April 2010
A node-centered local refinement algorithm for Poisson's equation in complex geometries journal November 2004
A scalable double in-memory checkpoint and restart scheme towards exascale
  • Zheng, Gengbin; Kale, Laxmikant V.
  • 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) https://doi.org/10.1109/DSNW.2012.6264677
conference June 2012
Optimizing Checkpoints Using NVM as Virtual Memory
  • Kannan, Sudarsun; Gavrilovska, Ada; Schwan, Karsten
  • 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.69
conference May 2013
High Order ENO and WENO Schemes for Computational Fluid Dynamics book January 1999
MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes conference January 2002
High performance linpack benchmark: a fault tolerant implementation without checkpointing conference January 2011
Berkeley lab checkpoint/restart (BLCR) for Linux clusters journal September 2006
McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression journal January 2013
Radiative Heat Transfer Calculation on 16384 GPUs Using a Reverse Monte Carlo Ray Tracing Approach with Adaptive Mesh Refinement conference May 2016
Granularity and the Cost of Error Recovery in Resilient AMR Scientific Applications
  • Dubey, Anshu; Fujita, Hajime; Graves, Daniel T.
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.41
conference November 2016
PIDX: Efficient Parallel I/O for Multi-resolution Multi-dimensional Scientific Datasets conference September 2011
Design and modeling of a non-blocking checkpointing system
  • Sato, Kento; Maruyama, Naoya; Mohror, Kathryn
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.46
conference November 2012
Correcting soft errors online in LU factorization conference January 2013
Uniformly high order accurate essentially non-oscillatory schemes, III journal August 1987
Algorithm-Based Fault Tolerance for Matrix Operations journal June 1984
On the history of multivariate polynomial interpolation journal October 2000
Extending the Uintah Framework through the Petascale Modeling of Detonation in Arrays of High Explosive Devices journal January 2016
Failures in large scale systems: long-term measurement, analysis, and implications
  • Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126937
conference January 2017
A study of numerical methods for hyperbolic conservation laws with stiff source terms journal January 1990
Uniformly High Order Accurate Essentially Non-oscillatory Schemes, III journal February 1997
Compiler-enhanced incremental checkpointing for OpenMP applications conference May 2009
Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters
  • Martino, Catello Di; Kalbarczyk, Zbigniew; Iyer, Ravishankar K.
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.62
conference June 2014
Adapting grid applications to safety using fault-tolerant methods: Design, implementation and evaluations journal February 2010
Resilience for Stencil Computations with Latent Errors conference August 2017
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems conference January 2009
Compiler-enhanced incremental checkpointing for OpenMP applications
  • Bronevetsky, Greg; Marques, Daniel J.; Pingali, Keshav K.
  • Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming - PPoPP '08 https://doi.org/10.1145/1345206.1345253
conference January 2008
Addressing Global Data Dependencies in Heterogeneous Asynchronous Runtime Systems on GPUs
  • Peterson, Brad; Humphrey, Alan; Schmidt, John
  • Proceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware - ESPM2'17 https://doi.org/10.1145/3152041.3152082
conference January 2017
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World book January 2000
Adaptive Polynomial Interpolation on Evenly Spaced Meshes journal January 2007
Preserving Nonnegativity in Discontinuous Galerkin Approximations to Scalar Transport via Truncation and Mass Aware Rescaling (TMAR) journal November 2016

Figures / Tables (8)