skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: RDPM: An Extensible Tool for Resilience Design Patterns Modelling

Conference ·

Resilience to faults, errors, and failures in extreme-scale high-performance computing (HPC) systems is a critical challenge. Resilience design patterns offer a new, structured hardware and software design approach for improving resilience. While prior work focused on developing performance, reliability, and availability models for resilience design patterns, this paper extends it by providing a Resilience Design Patterns Modeling (RDPM) tool which allows (1) exploring performance, reliability, and availability of each resilience design pattern, (2) offering customization of parameters to optimize performance, reliability, and availability, and (3) allowing investigation of trade-off models for combining multiple patterns for practical resilience solutions.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1872868
Resource Relation:
Journal Volume: 13098; Conference: 14th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids - Lisbon, , Portugal - 8/30/2021 4:00:00 AM-9/3/2021 4:00:00 AM
Country of Publication:
United States
Language:
English

References (15)

Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing
  • Ashraf, Rizwan A.; Hukerikar, Saurabh; Engelmann, Christian
  • ICPE '18: ACM/SPEC International Conference on Performance Engineering, Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering https://doi.org/10.1145/3184407.3184421
conference March 2018
Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery conference March 2018
Basic concepts and taxonomy of dependable and secure computing journal January 2004
A higher order estimate of the optimum checkpoint interval for restart dumps journal February 2006
Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales
  • Di, Sheng; Bautista-Gome, Leonardo; Cappello, Franck
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.79
conference November 2014
Combining Partial Redundancy and Checkpointing for HPC conference June 2012
Detection and correction of silent data corruption for large-scale high-performance computing
  • Fiala, David; Mueller, Frank; Engelmann, Christian
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.49
conference November 2012
A Pattern Language for High-Performance Computing Resilience
  • Hukerikar, Saurabh; Engelmann, Christian
  • EuroPLoP '17: European Conference on Pattern Languages of Programs, Proceedings of the 22nd European Conference on Pattern Languages of Programs https://doi.org/10.1145/3147704.3147718
conference July 2017
Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale journal October 2017
Models for Resilience Design Patterns conference November 2020
An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart conference January 2016
GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability conference November 2020
Addressing failures in exascale computing journal March 2014
Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems
  • Tiwari, Devesh; Gupta, Saurabh; Vazhkudai, Sudharshan S.
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.101
conference June 2014
A first order approximation to the optimum checkpoint interval journal September 1974

Similar Records

Related Subjects