skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A checkpoint compression study for high-performance computing systems

Journal Article · · International Journal of High Performance Computing Applications
 [1];  [2];  [1]
  1. Univ. of New Mexico, Albuquerque, NM (United States). Dept. of Computer Science
  2. Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States). Scalable System Software Dept.

As high-performance computing systems continue to increase in size and complexity, higher failure rates and increased overheads for checkpoint/restart (CR) protocols have raised concerns about the practical viability of CR protocols for future systems. Previously, compression has proven to be a viable approach for reducing checkpoint data volumes and, thereby, reducing CR protocol overhead leading to improved application performance. In this article, we further explore compression-based CR optimization by exploring its baseline performance and scaling properties, evaluating whether improved compression algorithms might lead to even better application performance and comparing checkpoint compression against and alongside other software- and hardware-based optimizations. Our results highlights are: (1) compression is a very viable CR optimization; (2) generic, text-based compression algorithms appear to perform near optimally for checkpoint data compression and faster compression algorithms will not lead to better application performance; (3) compression-based optimizations fare well against and alongside other software-based optimizations; and (4) while hardware-based optimizations outperform software-based ones, they are not as cost effective.

Research Organization:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA)
DOE Contract Number:
AC04-94AL85000
OSTI ID:
1426906
Report Number(s):
SAND2014-15140J; 534304
Journal Information:
International Journal of High Performance Computing Applications, Vol. 29, Issue 4; ISSN 1094-3420
Publisher:
SAGE
Country of Publication:
United States
Language:
English

References (32)

A Mathematical Theory of Communication journal July 1948
A case for two-level distributed recovery schemes
  • Vaidya, Nitin H.
  • Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems - SIGMETRICS '95/PERFORMANCE '95 https://doi.org/10.1145/223587.223596
conference January 1995
Diskless checkpointing journal January 1998
A large-scale study of failures in high-performance computing systems conference January 2006
Understanding failures in petascale computers journal July 2007
CLIP: a checkpointing tool for message-passing parallel programs conference January 1997
PLFS: a checkpoint filesystem for parallel applications conference January 2009
Compiler-enhanced incremental checkpointing for OpenMP applications conference May 2009
Memory exclusion: optimizing the performance of checkpointing systems journal February 1999
stdchk: A Checkpoint Storage System for Desktop Grid Computing
  • Al-Kiswany, Samer; Ripeanu, Matei; Vazhkudai, Sudharshan S.
  • 2008 28th IEEE International Conference on Distributed Computing Systems (ICDCS), 2008 The 28th International Conference on Distributed Computing Systems https://doi.org/10.1109/ICDCS.2008.19
conference June 2008
MCREngine: A scalable checkpointing system using data-aware aggregation and compression
  • Islam, Tanzima Zerin; Mohror, Kathryn; Bagchi, Saurabh
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.77
conference November 2012
Exploring NVIDIA-CUDA for video coding conference January 2010
Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments
  • Naksinehaboon, N.; Leangsuksun, C.
  • 2008 8th International Symposium on Cluster Computing and the Grid (CCGRID '08), 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID) https://doi.org/10.1109/CCGRID.2008.109
conference May 2008
Efficient System-Level Remote Checkpointing Technique for BLCR conference April 2011
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
  • Moody, Adam; Bronevetsky, Greg; Mohror, Kathryn
  • 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2010.18
conference November 2010
Process hijacking conference January 1999
Checkpointing strategies for parallel jobs
  • Bougeret, Marin; Casanova, Henri; Rabie, Mikael
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063428
conference January 2011
A survey of rollback-recovery protocols in message-passing systems journal September 2002
Optimizing Checkpoints Using NVM as Virtual Memory
  • Kannan, Sudarsun; Gavrilovska, Ada; Schwan, Karsten
  • 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.69
conference May 2013
ickp: a consistent checkpointer for multicomputers journal July 1994
A higher order estimate of the optimum checkpoint interval for restart dumps journal February 2006
libhashckpt: Hash-Based Incremental Checkpointing Using GPU’s book January 2011
The performance of consistent checkpointing
  • Elnozahy, E. N.; Johnson, D. B.; Zwaenepoel, W.
  • [1992] 11th Symposium on Reliable Distributed Systems, [1992] Proceedings 11th Symposium on Reliable Distributed Systems https://doi.org/10.1109/RELDIS.1992.235144
conference January 1992
CATCH-compiler-assisted techniques for checkpointing conference January 1990
Evaluating the viability of process replication reliability for exascale systems
  • Ferreira, Kurt; Stearley, Jon; Laros, James H.
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063443
conference January 2011
CoCheck: checkpointing and process migration for MPI conference January 1996
I/O performance challenges at leadership scale conference January 2009
A 1 PB/s file system to checkpoint three million MPI tasks
  • Rajachandrasekar, Raghunath; Moody, Adam; Mohror, Kathryn
  • Proceedings of the 22nd international symposium on High-performance parallel and distributed computing - HPDC '13 https://doi.org/10.1145/2493123.2462908
conference January 2013
Low-latency, concurrent checkpointing for parallel programs journal January 1994
Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation book January 2004
A universal algorithm for sequential data compression journal May 1977
On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance conference September 2012

Similar Records

Checkpointing Strategies for Shared High-Performance Computing Platforms
Journal Article · Tue Jan 01 00:00:00 EST 2019 · International Journal of Networking and Computing · OSTI ID:1426906

Toward an optimal online checkpoint solution under a two-level HPC checkpoint model
Journal Article · Tue Mar 29 00:00:00 EDT 2016 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1426906

McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression
Journal Article · Tue Jan 01 00:00:00 EST 2013 · Scientific Programming · OSTI ID:1426906