Skip to main content
Log in

On the modelling of optimal coordinated checkpoint period in supercomputers

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

This work revises current assumptions adopted in the checkpointing modelling and evaluates their impact on the attained prediction of the optimal coordinated single-level checkpoint period. An accurate a priori assessment of the optimal checkpoint period for a given computing facility is necessary as it drives the incurred overhead due to frequent checkpointing and, as a result, implies a drop in the resource steady-state availability. The present study discusses the impact of the order of approximation used in the single-level coordinated checkpoint modelling and follows on extending previous results of the optimal checkpoint period to explore the effects of the checkpoint rate on the cluster performance under total execution time and energy consumption policies, and in terms of resource availability. A consequence of a prescribed checkpoint rate with current technology is a critical size of the cluster above which the attained availability is too poor to become a cost-effective platform. Thus, some guidelines for the cluster sizing are indicated.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Cappello F, Geist A, Gropp W, Kale S, Kramer B, Snir M (2014) Towards exascale resilience: 2014 update. Int J Supercomput Front Innov 1(1):5–28

    Google Scholar 

  2. Geist A, Reed DA (2015) A survey of high-performance computing scaling challenges. Int J High Perform Comput Appl 31:1–10

    Google Scholar 

  3. Shalf J, Dosanjh S, Morrison J (2011) Exascale computing technology challenges. In: Proceedings of the International Conference on High Computing for Computational Science—VECPAR 2010, Lecture Notes in Computer Science, vol 6449. Springer, Berlin, pp 1–25

  4. Geist A (2016) How to kill a supercomputer: dirty power, cosmic rays and bad solder-will future exascale supercomputers be able to withstand the steady onslaught of routine faults? In: IEEE Spectrum. http://spectrum.ieee.org/computing/hardware

  5. Schroeder B, Gibson GA (2010) A large-scale study of failures in high-performance computing systems. IEEE Trans Dependable Secure Comput 7(4):337–350

    Article  Google Scholar 

  6. Hacker ThJ, Romero F, Carothers CD (2009) An analysis of clustered failures on large supercomputing systems. J Parallel Distrib Comput 69:652–665

    Article  Google Scholar 

  7. Hérault T, Robert Y (eds) (2015) Fault-tolerance techniques for high-performance computing. Springer, Berlin, pp 3–85

    Book  MATH  Google Scholar 

  8. Hiroyama S, Dohi T, Okamura H (2010) Comparison of aperiodic checkpoint placement algorithms. In: Proceedings of the Advanced Computer Science and Information Technology, AST20120, Miyazaki, June 23–25

  9. Buntinas D, Coti C, Hérault T, Lemarinier P, Pilard L, Rezmerita A, Rodríguez E, Cappello F (2008) Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI. Future Gener Comput Syst 24(1):73–84

    Article  Google Scholar 

  10. Naruse K, Umemura Sh, Nakagawa S (2006) Optimal checkpointing interval for two-level recovery schemes. Comput Math Appl 51:371–376

    Article  MathSciNet  MATH  Google Scholar 

  11. Li H, Pang L, Wang Z (2014) Two-level incremental checkpoint recovery scheme for reducing system total overheads. PLoS ONE. https://doi.org/10.1371/journal.pone.01045912014

    Google Scholar 

  12. Benoit A, Cavelan A, Robert Y, Sun H (2015) Optimal resilience patterns to cope with fail-stop and silent errors. Research Report RR-8786, LIP-ENS Lyon <hal-01215857>

  13. Di S, Robert Y, Vivien F, Cappello F (2017) Toward an optimal online checkpoint solution under a two-level HPC checkpoint model. IEEE Trans Parallel Distrib Syst 28(1):244–259

    Article  Google Scholar 

  14. Mohror K, Moody A, Bronevetsky G, Supinski BR (2014) Detailed modelling and evaluation of a scalable multilevel checkpointing system. IEEE Trans Parallel Distrib Syst 25(9):2255–2263

    Article  Google Scholar 

  15. Ferreira KB, Widener P, Levy S, Arnold D, Hoefler T (2014) Understanding the effect of communication and coordination on checkpointing at scale. In: Supercomputing Conference (SC14), Nov. 16–21, New Orleans

  16. Bolsica G, Bouteiller A, Brunet E, Cappello F, Dongarra J, Guermouche A, Hérault Th, Robert Y, Vivien F, Zaidouni D (2014) Unified model for assessing checkpointing protocols at extreme-scale. Concurr Comput Pract Exp 26:2772–2791

    Article  Google Scholar 

  17. Bouguerra MS, Gainaru A, Gómez LB, Cappello F, Matsuoka S, Maruyama N (2013) Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing. In: Proceedings of the IEEE 27th International Symposium on Parallel and Distributed Processing, May 20–24, Boston

  18. Gottumukkala NR, Nassar R, Paun M, Leangsuksun ChB, Scott SL (2010) Reliability of a system of K nodes for high performance computing applications. IEEE Trans Reliab 59(1):162–169

    Article  Google Scholar 

  19. Paun M, Naksinehaboon N, Nassar R, Leangsuksun Ch, Scott SL, Taerat N (2010) Incremental checkpoint schemes for Weibull failure distribution. Int J Found Comput Sci 21(3):329–344

    Article  MathSciNet  MATH  Google Scholar 

  20. Bouguerra MS, Gautier T, Trystram D, Vincent JM (2010) A flexible checkpoint/restart model in distributed systems. In: PPAM 2009, Part I, LNCS 6067, pp 206–215

  21. Aupy G, Benoit A, Hérault T, Robert Y, Dongarra J (2013) Optimal checkpointing period: time vs. energy. In: Proceedings of the Benchmarking and Simulation of High Performance Computer Systems, Supercomputing Conference (SC13), Nov. 17–22, Denver

  22. Vaidya NH (1998) A case for two-level recovery schemes. IEEE Trans Comput 47(6):656–666

    Article  Google Scholar 

  23. Daly JT (2006) A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener Comput Syst 22:303–312

    Article  Google Scholar 

  24. Young W (1974) A first order approximation to the optimum checkpoint interval. Commun ACM 17(9):530–531

    Article  MATH  Google Scholar 

  25. Gelenbe E, Hernández M (1990) Optimum checkpoints with age dependent failures. Acta Inf 27:517–531

    Article  MathSciNet  MATH  Google Scholar 

  26. Cox DR, Miller HD (1972) The theory of stochastic processes. Chapman and Hall Ltd, London

    Google Scholar 

  27. Vaidya NH (1997) Impact of checkpoint latency on overhead ratio of a checkpointing scheme. IEEE Trans Comput 46(8):942–947

    Article  Google Scholar 

  28. Vaidya NH (1995) On checkpoint latency, tex. As A&M University, Report 95015

  29. Ling Y, Mi J, Lin X (2001) A variational calculus approach to optimal checkpoint placement. IEEE Trans Comput 50(7):699–708

    Article  Google Scholar 

  30. Ozaki T, Dohi T, Okamura H, Kaio N (2006) Distribution-free checkpoint placement algorithms based on min–max principle. IEEE Trans Dependable Secure Comput 3(2):130–140

    Article  Google Scholar 

  31. Plank JS, Elwasif WR (1997) Experimental assessment of workstation failures and their impact on checkpointing systems. In: 28th Annual International Symposium on Fault-Tolerant Computing, Munich, pp 48–57 (also as Univ. Tennessee Technical Report UT CS 97379, 1997)

  32. Liu Y, Nassar R, Leangsuksun CB, Naksinehaboon N, Paun M, Scott SL (2008) An optimal checkpoint/restart model for a large scale high performance computing system. In: Proceedings of the IEEE International Symposium Parallel and Distributed Processing, Miami, pp 1–9

  33. Ozaki T, Dohi T, Kaio N (2009) Numerical computation algorithms for sequential checkpoint placement. Perform Eval 66:311–326

    Article  Google Scholar 

  34. Herault T, Robert Y (eds) (2015) Fault-tolerance techniques for high-performance computing, chapter 1, computer communications and networks series. Springer, Berlin

    Google Scholar 

  35. Kella O, Stadje W (2006) Superposition of renewal processes and an application to multi-server queues. Stat Probab Lett 76:1914–1924

    Article  MathSciNet  MATH  Google Scholar 

  36. Private Communication (2016) SLURM user group meeting, Sept. 26–27, Athens

Download references

Acknowledgements

This work was supported by the COST Action NESUS (IC1305) and partially funded by the Spanish Ministry of Economy and Competitiveness Project CODEC2 (TIN2015-63562-R) with FEDER funds, the RICAP Network (517RT0529) with CYTED funds, and EU H2020 Project HPC4E (Grant Agreement No. 689772).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to José A. Moríñigo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Moríñigo, J.A., Rodríguez-Pascual, M. & Mayo-García, R. On the modelling of optimal coordinated checkpoint period in supercomputers. J Supercomput 75, 930–954 (2019). https://doi.org/10.1007/s11227-018-2621-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-018-2621-1

Keywords

Navigation