A checkpoint compression study for high-performance computing systems
- Univ. of New Mexico, Albuquerque, NM (United States). Dept. of Computer Science
- Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States). Scalable System Software Dept.
As high-performance computing systems continue to increase in size and complexity, higher failure rates and increased overheads for checkpoint/restart (CR) protocols have raised concerns about the practical viability of CR protocols for future systems. Previously, compression has proven to be a viable approach for reducing checkpoint data volumes and, thereby, reducing CR protocol overhead leading to improved application performance. In this article, we further explore compression-based CR optimization by exploring its baseline performance and scaling properties, evaluating whether improved compression algorithms might lead to even better application performance and comparing checkpoint compression against and alongside other software- and hardware-based optimizations. Our results highlights are: (1) compression is a very viable CR optimization; (2) generic, text-based compression algorithms appear to perform near optimally for checkpoint data compression and faster compression algorithms will not lead to better application performance; (3) compression-based optimizations fare well against and alongside other software-based optimizations; and (4) while hardware-based optimizations outperform software-based ones, they are not as cost effective.
- Research Organization:
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
- Sponsoring Organization:
- USDOE National Nuclear Security Administration (NNSA)
- DOE Contract Number:
- AC04-94AL85000
- OSTI ID:
- 1426906
- Report Number(s):
- SAND2014-15140J; 534304
- Journal Information:
- International Journal of High Performance Computing Applications, Vol. 29, Issue 4; ISSN 1094-3420
- Publisher:
- SAGE
- Country of Publication:
- United States
- Language:
- English
A Mathematical Theory of Communication
|
journal | July 1948 |
A case for two-level distributed recovery schemes
|
conference | January 1995 |
Diskless checkpointing
|
journal | January 1998 |
A large-scale study of failures in high-performance computing systems
|
conference | January 2006 |
Understanding failures in petascale computers
|
journal | July 2007 |
CLIP: a checkpointing tool for message-passing parallel programs
|
conference | January 1997 |
PLFS: a checkpoint filesystem for parallel applications | conference | January 2009 |
Compiler-enhanced incremental checkpointing for OpenMP applications
|
conference | May 2009 |
Memory exclusion: optimizing the performance of checkpointing systems
|
journal | February 1999 |
stdchk: A Checkpoint Storage System for Desktop Grid Computing
|
conference | June 2008 |
MCREngine: A scalable checkpointing system using data-aware aggregation and compression
|
conference | November 2012 |
Exploring NVIDIA-CUDA for video coding
|
conference | January 2010 |
Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments
|
conference | May 2008 |
Efficient System-Level Remote Checkpointing Technique for BLCR
|
conference | April 2011 |
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
|
conference | November 2010 |
Process hijacking
|
conference | January 1999 |
Checkpointing strategies for parallel jobs
|
conference | January 2011 |
A survey of rollback-recovery protocols in message-passing systems
|
journal | September 2002 |
Optimizing Checkpoints Using NVM as Virtual Memory
|
conference | May 2013 |
ickp: a consistent checkpointer for multicomputers
|
journal | July 1994 |
A higher order estimate of the optimum checkpoint interval for restart dumps
|
journal | February 2006 |
libhashckpt: Hash-Based Incremental Checkpointing Using GPU’s
|
book | January 2011 |
The performance of consistent checkpointing
|
conference | January 1992 |
CATCH-compiler-assisted techniques for checkpointing
|
conference | January 1990 |
Evaluating the viability of process replication reliability for exascale systems
|
conference | January 2011 |
CoCheck: checkpointing and process migration for MPI
|
conference | January 1996 |
I/O performance challenges at leadership scale
|
conference | January 2009 |
A 1 PB/s file system to checkpoint three million MPI tasks
|
conference | January 2013 |
Low-latency, concurrent checkpointing for parallel programs
|
journal | January 1994 |
Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation
|
book | January 2004 |
A universal algorithm for sequential data compression
|
journal | May 1977 |
On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance
|
conference | September 2012 |
Similar Records
Toward an optimal online checkpoint solution under a two-level HPC checkpoint model
McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression