research-article

When is multi-version checkpointing needed?

Authors:

Andrew A. ChienAuthors Info & Claims

FTXS '13: Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale

Pages 49 - 56

https://doi.org/10.1145/2465813.2465821

Published: 18 June 2013 Publication History

Abstract

The scaling of semiconductor technology and increasing power concerns combined with system scale make fault management a growing concern in high performance computing systems. Greater variety of errors, higher error rates, longer detection intervals, and "silent" errors are all expected. Traditional checkpointing models and systems assume that error detection is nearly immediate and thus preserving a single checkpoint is sufficient for resilience. We define a richer model for future systems that captures the reality of latent errors, i.e. errors that go undetected for some time, and use it to derive optimal checkpoint intervals for systems with latent errors. With that model, we explore the importance of multi-version checkpoint systems. Our results highlight the limits of single checkpoint systems, showing that two to more than a dozen checkpoints may be needed to achieve acceptable error coverage. Further, to achieve reasonable system efficiency, multiple versions (two to seventeen) may be needed. We study several specific exascale machine scenarios, and the results show that two checkpoints are always beneficial, but when checkpoint overheads are reduced, as many as three checkpoints are beneficial.

References

[1]

Workshop on silicon errors in logic-system effects.

[2]

L. Bautista-Gomez and et al. FTI: high performance fault tolerance interface for hybrid systems. In Proceedings of Supercomputing, 2011.

Digital Library

[3]

K. Bergman and et al. Exascale computing study: Technology challenges in achieving exascale systems. DARPA IPTO Tech. Rep, 2008.

[4]

S. Borkar and A. A. Chien. The future of microprocessors. Commun. ACM, 54(5):67--77, 2011.

Digital Library

[5]

A. Bouteiller and et al. MPICH-V project: A multiprotocol automatic fault-tolerant MPI. International Journal of High Performance Computing Applications, 20(3):319--333, 2006.

Digital Library

[6]

P. G. Bridges and et al. Cooperative application/OS DRAM fault recovery. In Euro-Par Parallel Processing Workshops, 2011.

Digital Library

[7]

G. Bronevetsky and B. de Supinski. Soft error vulnerability of iterative linear algebra methods. In Proceedings of ICS, 2008.

Digital Library

[8]

S. Chen, P. B. Gibbons, M. Kozuch, and T. C. Mowry. Log-based architectures: using multicore to help software behave correctly. ACM SIGOPS Operating Systems Review, 45(1):84--91, 2011.

Digital Library

[9]

Z. Chen. Algorithm-based recovery for iterative methods without checkpointing. In Proceedings of HPDC, 2011.

Digital Library

[10]

J. T. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems, 22(3):303--312, 2006.

Digital Library

[11]

I. Doudalis and M. Prvulovic. Euripus: a flexible unified hardware memory checkpointing accelerator for bidirectional-debugging and reliability. In International Symposium on Computer Architecture, 2012.

Digital Library

[12]

J. Duell, P. H. Hargrove, and E. S. Roman. Requirements for linux checkpoint/restart. Berkeley Lab Technical Report, LBNL-49659, 2002.

[13]

E. N. Elnozahy and J. S. Plank. Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. IEEE Trans. Dependable and Secure Computing, 1(2):97--108, 2004.

Digital Library

[14]

K. Ferreira and et al. Evaluating the viability of process replication reliability for exascale systems. In Proceedings of Supercomputing, 2011.

Digital Library

[15]

D. Fiala and et al. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of Supercomputing, page 78, 2012.

Digital Library

[16]

S. K. S. Hari, S. V. Adve, and H. Naeimi. Low-cost program-level detectors for reducing silent data corruptions. In Proceedings of IPDPS, 2012.

[17]

M. Hoemmen and M. A. Heroux. Fault-tolerant iterative methods via selective reliability. In Proceedings of Supercomputing, 2011.

[18]

S. Hogan, J. Hammond, and A. Chien. An evaluation of difference and threshold techniques for efficient checkpoints. In Dependable Systems and Networks Workshops (DSN-W), 2012.

[19]

K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Trans. Computers, 100(6):518--528, 1984.

Digital Library

[20]

A. A. Hwang, I. A. Stefanovici, and B. Schroeder. Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design. In Proceedings of ASPLOS, 2012.

Digital Library

[21]

T. Z. Islam and et al. Mcrengine: a scalable checkpointing system using data-aware aggregation and compression. In Proceedings of Supercomputing, 2012.

Digital Library

[22]

W. M. Jones, J. T. Daly, and N. DeBardeleben. Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters. In Proceedings of HPDC, 2010.

Digital Library

[23]

J. Lidman, D. J. Quinlan, C. Liao, and S. A. McKee. Rose:: Fttransform-a source-to-source translation framework for exascale fault-tolerance research. In Proc. of DSN-W, 2012.

[24]

Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun, and S. Scott. An optimal checkpoint/restart model for a large scale high performance computing system. In Proceedings of IPDPS, 2008.

[25]

C.-d. Lu and D. A. Reed. Assessing fault sensitivity in mpi applications. In Proceedings of Supercomputing, 2004.

Digital Library

[26]

A. Moody, G. Bronevetsky, K. Mohror, and B. R. De Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proceedings of Supercomputing, 2010.

Digital Library

[27]

J. S. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt: Transparent checkpointing under Unix. University of Tennessee, Computer Science Department, 1994.

[28]

G. V. R. Project(GVR). http://gvr.cs.uchicago.edu.

[29]

E. Schendel and et al. Isobar preconditioner for effective and high-throughput lossless data compression. In International Conference on Data Engineering, 2012.

Digital Library

[30]

E. R. Schendel and et al. Isobar hybrid compression-i/o interleaving for large-scale parallel i/o optimization. In Proceedings of HPDC, 2012.

Digital Library

[31]

B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In Proceedings of DSN, 2006.

Digital Library

[32]

M. Shantharam, S. Srinivasmurthy, and P. Raghavan. Characterizing the impact of soft errors on iterative methods in scientific computing. In Proceedings of Supercomputing, 2011.

Digital Library

[33]

V. Sridharan and D. Liberty. A study of dram failures in the field. In Proceedings of Supercomputing, 2012.

Digital Library

[34]

E. Vlachos and et al. Paralog: Enabling and accelerating online parallel monitoring of multithreaded applications. In ACM SIGARCH Computer Architecture News, volume 38, pages 271--284, 2010.

Digital Library

[35]

J. W. Young. A first order approximation to the optimum checkpoint interval. Communications of the ACM, 17(9):530--531, 1974.

Digital Library

[36]

Z. Zheng, L. Yu, W. Tang, Z. Lan, R. Gupta, N. Desai, S. Coghlan, and D. Buettner. Co-analysis of ras log and job log on blue gene/p. In Proceedings of IPDPS, 2011.

Digital Library

Cited By

Bautista-Gomez LBenoit ADi SHerault TRobert YSun H(2024)A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?Future Generation Computer Systems10.1016/j.future.2024.07.022161(315-328)Online publication date: Dec-2024
https://doi.org/10.1016/j.future.2024.07.022
Bang JSim ALockwood GEom HSung H(2023)Design and Implementation of Burst Buffer Over-Subscription Scheme for HPC Storage SystemsIEEE Access10.1109/ACCESS.2022.323382911(3386-3401)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2022.3233829
Bustos ARubio-Montero AMéndez RRivera SGonzález FCampo XAsorey HMayo-García R(2023)Response of HPC hardware to neutron radiation at the dawn of exascaleThe Journal of Supercomputing10.1007/s11227-023-05199-y79:12(13817-13838)Online publication date: 30-Mar-2023
https://doi.org/10.1007/s11227-023-05199-y
Show More Cited By

Index Terms

When is multi-version checkpointing needed?
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
2. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis
      1. Modeling methodologies

Recommendations

On the Combination of Silent Error Detection and Checkpointing
PRDC '13: Proceedings of the 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing

In this paper, we revisit traditional check pointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them ...
Reliable and Efficient Distributed Checkpointing System for Grid Environments

In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion ...
A fully informed model-based checkpointing protocol for preventing useless checkpoints

Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

FTXS '13: Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale

June 2013

64 pages

ISBN:9781450319836

DOI:10.1145/2465813

Program Chairs:
Nathan DeBardeleben
Los Alamos National Laboratory, USA
,
Jon Stearley
Sandia National Laboratory, USA
,
Franck Cappello
INRIA and University of Illinois at Urbana Champaign, France and USA

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

University of Arizona: University of Arizona
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

HPDC'13

Sponsor:

University of Arizona
SIGARCH

HPDC'13: The 22nd International Symposium on High-Performance Parallel and Distributed Computing

June 18, 2013

New York, New York, USA

Acceptance Rates

FTXS '13 Paper Acceptance Rate 7 of 10 submissions, 70%;

Overall Acceptance Rate 16 of 25 submissions, 64%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

39
Total Citations
View Citations
222
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bautista-Gomez LBenoit ADi SHerault TRobert YSun H(2024)A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?Future Generation Computer Systems10.1016/j.future.2024.07.022161(315-328)Online publication date: Dec-2024
https://doi.org/10.1016/j.future.2024.07.022
Bang JSim ALockwood GEom HSung H(2023)Design and Implementation of Burst Buffer Over-Subscription Scheme for HPC Storage SystemsIEEE Access10.1109/ACCESS.2022.323382911(3386-3401)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2022.3233829
Bustos ARubio-Montero AMéndez RRivera SGonzález FCampo XAsorey HMayo-García R(2023)Response of HPC hardware to neutron radiation at the dawn of exascaleThe Journal of Supercomputing10.1007/s11227-023-05199-y79:12(13817-13838)Online publication date: 30-Mar-2023
https://doi.org/10.1007/s11227-023-05199-y
Benoit ADu YHerault TMarchal LPallez GPerotin LRobert YSun HVivien FSahni SSaxena VIyengar S(2022)Checkpointing à la Young/Daly: An OverviewProceedings of the 2022 Fourteenth International Conference on Contemporary Computing10.1145/3549206.3549328(701-710)Online publication date: 4-Aug-2022
https://dl.acm.org/doi/10.1145/3549206.3549328
Agarwal PNaughton TPark BBernholdt DHursey JGeist A(2019)Application health monitoring for extreme‐scale resiliency using cooperative fault managementConcurrency and Computation: Practice and Experience10.1002/cpe.544932:2Online publication date: 25-Jul-2019
https://doi.org/10.1002/cpe.5449
Fang AChien AZhao MChandra ARamakrishnan L(2018)ABFRProceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3208040.3208046(27-39)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1145/3208040.3208046
Cardoso PBarcelos P(2018)Validation of a dynamic checkpoint mechanism for Apache Hadoop with failure scenarios2018 IEEE 19th Latin-American Test Symposium (LATS)10.1109/LATW.2018.8347240(1-6)Online publication date: Mar-2018
https://doi.org/10.1109/LATW.2018.8347240
Subasi OTipireddy RKrishnamoorthy S(2018)Quantification, Trade-off Analysis, and Optimal Checkpoint Placement for Reliability and Availability2018 IEEE 25th International Conference on High Performance Computing (HiPC)10.1109/HiPC.2018.00029(183-192)Online publication date: Dec-2018
https://doi.org/10.1109/HiPC.2018.00029
Chien ABalaji PDun NFang AFujita HIskra KRubenstein ZZheng ZHammond JLaguna IRichards DDubey Avan Straalen BHoemmen MHeroux MTeranishi KSiegel A(2017)Exploring versioned distributed arrays for resilience in scientific applicationsInternational Journal of High Performance Computing Applications10.1177/109434201666479631:6(564-590)Online publication date: 1-Nov-2017
https://dl.acm.org/doi/10.1177/1094342016664796
Benoit ARaina SRobert Y(2017)Efficient checkpoint/verification patternsInternational Journal of High Performance Computing Applications10.1177/109434201559453131:1(52-65)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1177/1094342015594531
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten