skip to main content
10.1145/2465813.2465821acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

When is multi-version checkpointing needed?

Published: 18 June 2013 Publication History

Abstract

The scaling of semiconductor technology and increasing power concerns combined with system scale make fault management a growing concern in high performance computing systems. Greater variety of errors, higher error rates, longer detection intervals, and "silent" errors are all expected. Traditional checkpointing models and systems assume that error detection is nearly immediate and thus preserving a single checkpoint is sufficient for resilience. We define a richer model for future systems that captures the reality of latent errors, i.e. errors that go undetected for some time, and use it to derive optimal checkpoint intervals for systems with latent errors. With that model, we explore the importance of multi-version checkpoint systems. Our results highlight the limits of single checkpoint systems, showing that two to more than a dozen checkpoints may be needed to achieve acceptable error coverage. Further, to achieve reasonable system efficiency, multiple versions (two to seventeen) may be needed. We study several specific exascale machine scenarios, and the results show that two checkpoints are always beneficial, but when checkpoint overheads are reduced, as many as three checkpoints are beneficial.

References

[1]
Workshop on silicon errors in logic-system effects.
[2]
L. Bautista-Gomez and et al. FTI: high performance fault tolerance interface for hybrid systems. In Proceedings of Supercomputing, 2011.
[3]
K. Bergman and et al. Exascale computing study: Technology challenges in achieving exascale systems. DARPA IPTO Tech. Rep, 2008.
[4]
S. Borkar and A. A. Chien. The future of microprocessors. Commun. ACM, 54(5):67--77, 2011.
[5]
A. Bouteiller and et al. MPICH-V project: A multiprotocol automatic fault-tolerant MPI. International Journal of High Performance Computing Applications, 20(3):319--333, 2006.
[6]
P. G. Bridges and et al. Cooperative application/OS DRAM fault recovery. In Euro-Par Parallel Processing Workshops, 2011.
[7]
G. Bronevetsky and B. de Supinski. Soft error vulnerability of iterative linear algebra methods. In Proceedings of ICS, 2008.
[8]
S. Chen, P. B. Gibbons, M. Kozuch, and T. C. Mowry. Log-based architectures: using multicore to help software behave correctly. ACM SIGOPS Operating Systems Review, 45(1):84--91, 2011.
[9]
Z. Chen. Algorithm-based recovery for iterative methods without checkpointing. In Proceedings of HPDC, 2011.
[10]
J. T. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems, 22(3):303--312, 2006.
[11]
I. Doudalis and M. Prvulovic. Euripus: a flexible unified hardware memory checkpointing accelerator for bidirectional-debugging and reliability. In International Symposium on Computer Architecture, 2012.
[12]
J. Duell, P. H. Hargrove, and E. S. Roman. Requirements for linux checkpoint/restart. Berkeley Lab Technical Report, LBNL-49659, 2002.
[13]
E. N. Elnozahy and J. S. Plank. Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. IEEE Trans. Dependable and Secure Computing, 1(2):97--108, 2004.
[14]
K. Ferreira and et al. Evaluating the viability of process replication reliability for exascale systems. In Proceedings of Supercomputing, 2011.
[15]
D. Fiala and et al. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of Supercomputing, page 78, 2012.
[16]
S. K. S. Hari, S. V. Adve, and H. Naeimi. Low-cost program-level detectors for reducing silent data corruptions. In Proceedings of IPDPS, 2012.
[17]
M. Hoemmen and M. A. Heroux. Fault-tolerant iterative methods via selective reliability. In Proceedings of Supercomputing, 2011.
[18]
S. Hogan, J. Hammond, and A. Chien. An evaluation of difference and threshold techniques for efficient checkpoints. In Dependable Systems and Networks Workshops (DSN-W), 2012.
[19]
K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Trans. Computers, 100(6):518--528, 1984.
[20]
A. A. Hwang, I. A. Stefanovici, and B. Schroeder. Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design. In Proceedings of ASPLOS, 2012.
[21]
T. Z. Islam and et al. Mcrengine: a scalable checkpointing system using data-aware aggregation and compression. In Proceedings of Supercomputing, 2012.
[22]
W. M. Jones, J. T. Daly, and N. DeBardeleben. Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters. In Proceedings of HPDC, 2010.
[23]
J. Lidman, D. J. Quinlan, C. Liao, and S. A. McKee. Rose:: Fttransform-a source-to-source translation framework for exascale fault-tolerance research. In Proc. of DSN-W, 2012.
[24]
Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun, and S. Scott. An optimal checkpoint/restart model for a large scale high performance computing system. In Proceedings of IPDPS, 2008.
[25]
C.-d. Lu and D. A. Reed. Assessing fault sensitivity in mpi applications. In Proceedings of Supercomputing, 2004.
[26]
A. Moody, G. Bronevetsky, K. Mohror, and B. R. De Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proceedings of Supercomputing, 2010.
[27]
J. S. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt: Transparent checkpointing under Unix. University of Tennessee, Computer Science Department, 1994.
[28]
G. V. R. Project(GVR). http://gvr.cs.uchicago.edu.
[29]
E. Schendel and et al. Isobar preconditioner for effective and high-throughput lossless data compression. In International Conference on Data Engineering, 2012.
[30]
E. R. Schendel and et al. Isobar hybrid compression-i/o interleaving for large-scale parallel i/o optimization. In Proceedings of HPDC, 2012.
[31]
B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In Proceedings of DSN, 2006.
[32]
M. Shantharam, S. Srinivasmurthy, and P. Raghavan. Characterizing the impact of soft errors on iterative methods in scientific computing. In Proceedings of Supercomputing, 2011.
[33]
V. Sridharan and D. Liberty. A study of dram failures in the field. In Proceedings of Supercomputing, 2012.
[34]
E. Vlachos and et al. Paralog: Enabling and accelerating online parallel monitoring of multithreaded applications. In ACM SIGARCH Computer Architecture News, volume 38, pages 271--284, 2010.
[35]
J. W. Young. A first order approximation to the optimum checkpoint interval. Communications of the ACM, 17(9):530--531, 1974.
[36]
Z. Zheng, L. Yu, W. Tang, Z. Lan, R. Gupta, N. Desai, S. Coghlan, and D. Buettner. Co-analysis of ras log and job log on blue gene/p. In Proceedings of IPDPS, 2011.

Cited By

View all
  • (2024)A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?Future Generation Computer Systems10.1016/j.future.2024.07.022161(315-328)Online publication date: Dec-2024
  • (2023)Design and Implementation of Burst Buffer Over-Subscription Scheme for HPC Storage SystemsIEEE Access10.1109/ACCESS.2022.323382911(3386-3401)Online publication date: 2023
  • (2023)Response of HPC hardware to neutron radiation at the dawn of exascaleThe Journal of Supercomputing10.1007/s11227-023-05199-y79:12(13817-13838)Online publication date: 30-Mar-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
FTXS '13: Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
June 2013
64 pages
ISBN:9781450319836
DOI:10.1145/2465813
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. checkpointing
  2. error recovery
  3. high-performance computing
  4. reliability

Qualifiers

  • Research-article

Conference

HPDC'13
Sponsor:

Acceptance Rates

FTXS '13 Paper Acceptance Rate 7 of 10 submissions, 70%;
Overall Acceptance Rate 16 of 25 submissions, 64%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?Future Generation Computer Systems10.1016/j.future.2024.07.022161(315-328)Online publication date: Dec-2024
  • (2023)Design and Implementation of Burst Buffer Over-Subscription Scheme for HPC Storage SystemsIEEE Access10.1109/ACCESS.2022.323382911(3386-3401)Online publication date: 2023
  • (2023)Response of HPC hardware to neutron radiation at the dawn of exascaleThe Journal of Supercomputing10.1007/s11227-023-05199-y79:12(13817-13838)Online publication date: 30-Mar-2023
  • (2022)Checkpointing à la Young/Daly: An OverviewProceedings of the 2022 Fourteenth International Conference on Contemporary Computing10.1145/3549206.3549328(701-710)Online publication date: 4-Aug-2022
  • (2019)Application health monitoring for extreme‐scale resiliency using cooperative fault managementConcurrency and Computation: Practice and Experience10.1002/cpe.544932:2Online publication date: 25-Jul-2019
  • (2018)ABFRProceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3208040.3208046(27-39)Online publication date: 11-Jun-2018
  • (2018)Validation of a dynamic checkpoint mechanism for Apache Hadoop with failure scenarios2018 IEEE 19th Latin-American Test Symposium (LATS)10.1109/LATW.2018.8347240(1-6)Online publication date: Mar-2018
  • (2018)Quantification, Trade-off Analysis, and Optimal Checkpoint Placement for Reliability and Availability2018 IEEE 25th International Conference on High Performance Computing (HiPC)10.1109/HiPC.2018.00029(183-192)Online publication date: Dec-2018
  • (2017)Exploring versioned distributed arrays for resilience in scientific applicationsInternational Journal of High Performance Computing Applications10.1177/109434201666479631:6(564-590)Online publication date: 1-Nov-2017
  • (2017)Efficient checkpoint/verification patternsInternational Journal of High Performance Computing Applications10.1177/109434201559453131:1(52-65)Online publication date: 1-Jan-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media