Orchestrating Fault Prediction with Live Migration and Checkpointing

Behera, Subhendu; Wan, Lipeng; Mueller, Frank; Wolf, Matthew D.; Klasky, Scott A.

Title: Orchestrating Fault Prediction with Live Migration and Checkpointing

Conference · Mon Jun 01 00:00:00 EDT 2020

OSTI ID:1648858

Behera, Subhendu ^[1];

^[2]; Mueller, Frank ^[1];

^[2];

^[2]

North Carolina State University (NCSU), Raleigh
ORNL

Checkpoint/Restart (C/R) is widely used to provide fault tolerance on High-Performance Computing (HPC) systems. However, Parallel File System (PFS) overhead and failure uncertainty cause significant application overhead. This paper develops an adaptive multi-level C/R model that incorporates a failure prediction and analysis model, which orchestrates failure prediction, checkpointing, checkpoint frequency, and proactive live migration along with the additional benefit of Burst Buffers (BB). It effectively reduces the overheads due to failures, checkpointing, and recovery. Simulation results for the Summit supercomputer yield a reduction of ~20%-86% in application overhead due to BBs, orchestrated failure prediction, and migration. We also observe a ~29% decrease in checkpoint writes to BBs, which can increase the longevity of the BB storage devices.

View Conference

Cite

Export

Save

Research Organization:: Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE Office of Science (SC)

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1648858

Resource Relation:: Conference: International Symposium on High-Performance Parallel and Distributed Computing (HPDC '20) - Stokholm, , Sweden - 6/23/2020 4:00:00 AM-6/26/2020 4:00:00 AM

Country of Publication:: United States

Language:: English

Similar Records

Proactive Fault Tolerance for HPC with Xen Virtualization

Conference · Mon Jan 01 00:00:00 EST 2007 · OSTI ID:1648858

Nagarajan, Arun Babu; Mueller, Frank; Engelmann, Christian; +1 more

McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression

Journal Article · Tue Jan 01 00:00:00 EST 2013 · Scientific Programming · OSTI ID:1648858

Islam, Tanzima Zerin; Mohror, Kathryn; Bagchi, Saurabh; +3 more

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance

Conference · Mon Jan 01 00:00:00 EST 2007 · OSTI ID:1648858

Wang, Chao; Mueller, Frank; Engelmann, Christian; +1 more

Title: Orchestrating Fault Prediction with Live Migration and Checkpointing

Citation Formats

Similar Records

Related Subjects