On the modelling of optimal coordinated checkpoint period in supercomputers

Moríñigo, José A.; Rodríguez-Pascual, Manuel; Mayo-García, Rafael

doi:10.1007/s11227-018-2621-1

On the modelling of optimal coordinated checkpoint period in supercomputers

Published: 22 September 2018

Volume 75, pages 930–954, (2019)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

José A. Moríñigo ORCID: orcid.org/0000-0003-2528-7485¹,
Manuel Rodríguez-Pascual¹ &
Rafael Mayo-García¹

200 Accesses
4 Citations
Explore all metrics

Abstract

This work revises current assumptions adopted in the checkpointing modelling and evaluates their impact on the attained prediction of the optimal coordinated single-level checkpoint period. An accurate a priori assessment of the optimal checkpoint period for a given computing facility is necessary as it drives the incurred overhead due to frequent checkpointing and, as a result, implies a drop in the resource steady-state availability. The present study discusses the impact of the order of approximation used in the single-level coordinated checkpoint modelling and follows on extending previous results of the optimal checkpoint period to explore the effects of the checkpoint rate on the cluster performance under total execution time and energy consumption policies, and in terms of resource availability. A consequence of a prescribed checkpoint rate with current technology is a critical size of the cluster above which the attained availability is too poor to become a cost-effective platform. Thus, some guidelines for the cluster sizing are indicated.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Checkpoint and Restart: An Energy Consumption Characterization in Clusters

Analysis of parallel application checkpoint storage for system configuration

Article 16 October 2020

Betzabeth León, Daniel Franco, … Emilio Luque

Analysis of the Tradeoffs Between Energy and Run Time for Multilevel Checkpointing

References

Cappello F, Geist A, Gropp W, Kale S, Kramer B, Snir M (2014) Towards exascale resilience: 2014 update. Int J Supercomput Front Innov 1(1):5–28
Google Scholar
Geist A, Reed DA (2015) A survey of high-performance computing scaling challenges. Int J High Perform Comput Appl 31:1–10
Google Scholar
Shalf J, Dosanjh S, Morrison J (2011) Exascale computing technology challenges. In: Proceedings of the International Conference on High Computing for Computational Science—VECPAR 2010, Lecture Notes in Computer Science, vol 6449. Springer, Berlin, pp 1–25
Geist A (2016) How to kill a supercomputer: dirty power, cosmic rays and bad solder-will future exascale supercomputers be able to withstand the steady onslaught of routine faults? In: IEEE Spectrum. http://spectrum.ieee.org/computing/hardware
Schroeder B, Gibson GA (2010) A large-scale study of failures in high-performance computing systems. IEEE Trans Dependable Secure Comput 7(4):337–350
Article Google Scholar
Hacker ThJ, Romero F, Carothers CD (2009) An analysis of clustered failures on large supercomputing systems. J Parallel Distrib Comput 69:652–665
Article Google Scholar
Hérault T, Robert Y (eds) (2015) Fault-tolerance techniques for high-performance computing. Springer, Berlin, pp 3–85
Book MATH Google Scholar
Hiroyama S, Dohi T, Okamura H (2010) Comparison of aperiodic checkpoint placement algorithms. In: Proceedings of the Advanced Computer Science and Information Technology, AST20120, Miyazaki, June 23–25
Buntinas D, Coti C, Hérault T, Lemarinier P, Pilard L, Rezmerita A, Rodríguez E, Cappello F (2008) Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI. Future Gener Comput Syst 24(1):73–84
Article Google Scholar
Naruse K, Umemura Sh, Nakagawa S (2006) Optimal checkpointing interval for two-level recovery schemes. Comput Math Appl 51:371–376
Article MathSciNet MATH Google Scholar
Li H, Pang L, Wang Z (2014) Two-level incremental checkpoint recovery scheme for reducing system total overheads. PLoS ONE. https://doi.org/10.1371/journal.pone.01045912014
Google Scholar
Benoit A, Cavelan A, Robert Y, Sun H (2015) Optimal resilience patterns to cope with fail-stop and silent errors. Research Report RR-8786, LIP-ENS Lyon <hal-01215857>
Di S, Robert Y, Vivien F, Cappello F (2017) Toward an optimal online checkpoint solution under a two-level HPC checkpoint model. IEEE Trans Parallel Distrib Syst 28(1):244–259
Article Google Scholar
Mohror K, Moody A, Bronevetsky G, Supinski BR (2014) Detailed modelling and evaluation of a scalable multilevel checkpointing system. IEEE Trans Parallel Distrib Syst 25(9):2255–2263
Article Google Scholar
Ferreira KB, Widener P, Levy S, Arnold D, Hoefler T (2014) Understanding the effect of communication and coordination on checkpointing at scale. In: Supercomputing Conference (SC14), Nov. 16–21, New Orleans
Bolsica G, Bouteiller A, Brunet E, Cappello F, Dongarra J, Guermouche A, Hérault Th, Robert Y, Vivien F, Zaidouni D (2014) Unified model for assessing checkpointing protocols at extreme-scale. Concurr Comput Pract Exp 26:2772–2791
Article Google Scholar
Bouguerra MS, Gainaru A, Gómez LB, Cappello F, Matsuoka S, Maruyama N (2013) Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing. In: Proceedings of the IEEE 27th International Symposium on Parallel and Distributed Processing, May 20–24, Boston
Gottumukkala NR, Nassar R, Paun M, Leangsuksun ChB, Scott SL (2010) Reliability of a system of K nodes for high performance computing applications. IEEE Trans Reliab 59(1):162–169
Article Google Scholar
Paun M, Naksinehaboon N, Nassar R, Leangsuksun Ch, Scott SL, Taerat N (2010) Incremental checkpoint schemes for Weibull failure distribution. Int J Found Comput Sci 21(3):329–344
Article MathSciNet MATH Google Scholar
Bouguerra MS, Gautier T, Trystram D, Vincent JM (2010) A flexible checkpoint/restart model in distributed systems. In: PPAM 2009, Part I, LNCS 6067, pp 206–215
Aupy G, Benoit A, Hérault T, Robert Y, Dongarra J (2013) Optimal checkpointing period: time vs. energy. In: Proceedings of the Benchmarking and Simulation of High Performance Computer Systems, Supercomputing Conference (SC13), Nov. 17–22, Denver
Vaidya NH (1998) A case for two-level recovery schemes. IEEE Trans Comput 47(6):656–666
Article Google Scholar
Daly JT (2006) A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener Comput Syst 22:303–312
Article Google Scholar
Young W (1974) A first order approximation to the optimum checkpoint interval. Commun ACM 17(9):530–531
Article MATH Google Scholar
Gelenbe E, Hernández M (1990) Optimum checkpoints with age dependent failures. Acta Inf 27:517–531
Article MathSciNet MATH Google Scholar
Cox DR, Miller HD (1972) The theory of stochastic processes. Chapman and Hall Ltd, London
Google Scholar
Vaidya NH (1997) Impact of checkpoint latency on overhead ratio of a checkpointing scheme. IEEE Trans Comput 46(8):942–947
Article Google Scholar
Vaidya NH (1995) On checkpoint latency, tex. As A&M University, Report 95015
Ling Y, Mi J, Lin X (2001) A variational calculus approach to optimal checkpoint placement. IEEE Trans Comput 50(7):699–708
Article Google Scholar
Ozaki T, Dohi T, Okamura H, Kaio N (2006) Distribution-free checkpoint placement algorithms based on min–max principle. IEEE Trans Dependable Secure Comput 3(2):130–140
Article Google Scholar
Plank JS, Elwasif WR (1997) Experimental assessment of workstation failures and their impact on checkpointing systems. In: 28th Annual International Symposium on Fault-Tolerant Computing, Munich, pp 48–57 (also as Univ. Tennessee Technical Report UT CS 97379, 1997)
Liu Y, Nassar R, Leangsuksun CB, Naksinehaboon N, Paun M, Scott SL (2008) An optimal checkpoint/restart model for a large scale high performance computing system. In: Proceedings of the IEEE International Symposium Parallel and Distributed Processing, Miami, pp 1–9
Ozaki T, Dohi T, Kaio N (2009) Numerical computation algorithms for sequential checkpoint placement. Perform Eval 66:311–326
Article Google Scholar
Herault T, Robert Y (eds) (2015) Fault-tolerance techniques for high-performance computing, chapter 1, computer communications and networks series. Springer, Berlin
Google Scholar
Kella O, Stadje W (2006) Superposition of renewal processes and an application to multi-server queues. Stat Probab Lett 76:1914–1924
Article MathSciNet MATH Google Scholar
Private Communication (2016) SLURM user group meeting, Sept. 26–27, Athens

Download references

Acknowledgements

This work was supported by the COST Action NESUS (IC1305) and partially funded by the Spanish Ministry of Economy and Competitiveness Project CODEC2 (TIN2015-63562-R) with FEDER funds, the RICAP Network (517RT0529) with CYTED funds, and EU H2020 Project HPC4E (Grant Agreement No. 689772).

Author information

Authors and Affiliations

Department of Technology, CIEMAT, Avda. Complutense 40, 28840, Madrid, Spain
José A. Moríñigo, Manuel Rodríguez-Pascual & Rafael Mayo-García

Authors

José A. Moríñigo
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Rodríguez-Pascual
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Mayo-García
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to José A. Moríñigo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Moríñigo, J.A., Rodríguez-Pascual, M. & Mayo-García, R. On the modelling of optimal coordinated checkpoint period in supercomputers. J Supercomput 75, 930–954 (2019). https://doi.org/10.1007/s11227-018-2621-1

Download citation

Published: 22 September 2018
Issue Date: 06 February 2019
DOI: https://doi.org/10.1007/s11227-018-2621-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the modelling of optimal coordinated checkpoint period in supercomputers

Abstract

Access this article

Similar content being viewed by others

Checkpoint and Restart: An Energy Consumption Characterization in Clusters

Analysis of parallel application checkpoint storage for system configuration

Analysis of the Tradeoffs Between Energy and Run Time for Multilevel Checkpointing

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On the modelling of optimal coordinated checkpoint period in supercomputers

Abstract

Access this article

Similar content being viewed by others

Checkpoint and Restart: An Energy Consumption Characterization in Clusters

Analysis of parallel application checkpoint storage for system configuration

Analysis of the Tradeoffs Between Energy and Run Time for Multilevel Checkpointing

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation