Skip to main content
Log in

Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actions

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The complexity and scale of high-performance computer systems are rapidly increasing, so fault tolerance is becoming a critical challenge. In this paper, we consider the impact of multiple proactive actions on proactive fault tolerance and periodic checkpointing. We extended Aupy’s model in the presence of multiple proactive actions, including proactive checkpointing and task migration. We then propose optimal strategies for deciding when to trust predictions, and provide algorithms for the optimal storage interval for periodic checkpointing. The results show that the proposed method can significantly improve system productivity. Our case study indicates that the recall of the predictor is more important for small platforms, and that precision becomes increasingly important as the scale of the system increases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. The results obtained via the Weibull distribution can be found in http://good.gd/3262191.htm.

References

  1. Robert F (2012) What it’ll take to go exascale. Science 27:394–396

    Google Scholar 

  2. Sato K, Moody A, Mohror K, Gamblin T et al (2012) Design and modeling of a non-blocking checkpointing system. In: The 2012 international conference for high performance computing, networking, storage and analysis (SC’ 12), article no. 19

  3. Cappello F (2009) fault tolerance in petascale/exascale systems: current knowledge, challenges and research opportunities. Int J High Perform Comput Appl 23:212–226

    Article  Google Scholar 

  4. Li Y, Lan Z, Gujrati P, Sun X (2009) Fault-aware runtime strategies for high-performance computing. IEEE Trans Parallel Distrib Syst 20(4):460–473

    Article  Google Scholar 

  5. Varela MR, Ferreira KB, Riesen R (2010) Fault-tolerance for exascale systems. In: 2010 IEEE international conference on cluster computing workshops and posters (cluster workshops), pp 1–4

  6. Leonardo BG, Seiji T, Komatitsch D, Cappello F, Maruyama N et al (2011) FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings of 2011 international conference for high performance computing, networking, storage and analysis, pp 1–32

  7. Moody A, Bronevetsky G, Mohror K, Bronis R et al (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE international conference for high performance computing, networking, storage and analysis (SC’10),pp 1–11

  8. Jangjaimon I, Tzeng NF (2013) Adaptive incremental checkpointing via delta compression for networked multicore systems. In: The 27th IEEE international symposium on parallel & distributed processing (IPDPS 2013), pp 7–18

  9. Mohamed SB, Gainaru A, Leonardo BG, Franck Cappello et al (2013) Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing. In: The 27th IEEE international symposium on parallel & distributed processing (IPDPS 2013), pp 501–512

  10. Gainaru A, Cappello F, Kramer W (2012) Taming of the shrew: modeling the normal and faulty behavior of large-scale hpc systems. In: IEEE 26th international parallel & distributed processing symposium (IPDPS), pp 1168–1179

  11. Gainaru A, Cappello F, Kramer W, Snir M (2012) Fault prediction under the microscope—a closer look into HPC systems. In: The 2012 international conference for high performance computing, networking, storage and analysis (SC’12), article no. 77

  12. Yu L, Zheng Z, Lan Z, Coghlan S (2011) Practical online failure prediction for BlueGene/P: period-based vs event-driven. In: Dependable systems and networks workshops (DSN-W), pp 259–264

  13. Aupy G, Robert Y, Vivien F, Zaidouni D (2014) Checkpointing algorithms and fault prediction. J Parallel Distrib Comput 74(2):2048–2064

    Article  Google Scholar 

  14. Cappello F, Geist A, Gropp B, Kale L et al (2009) Toward exascale resilience. Int J High Perform Comput Appl 23(4):374–388

    Article  Google Scholar 

  15. Ifeanyi P, Egwutuoha DL, Bran S, Shiping C (2013) A survey of fault tolerance mechanisms and checkpoint/restart implementations or high performance computing systems. J Supercomput 65(3):1302–1326

    Article  Google Scholar 

  16. Wang C, Mueller F, Engelmann C, Stephen LS (2012) Proactive process-level live migration and back migration in HPC environments. J Parallel Distrib Comput 72(2):254–267

    Article  Google Scholar 

  17. Nagarajan A, Mueller F, Engelmann C, Scott S (2007) Proactive fault tolerance for HPC with Xen virtualization. In: Proceedings of the international conference on supercomputing, pp 23–32

  18. Zheng Z, Lan Z, Gupta R, Coghlan S, Beckman P (2010) A practical failure prediction with location and lead time for BlueGene/P. In: Dependable systems and networks workshops (DSN-W 2010), pp 15–22

  19. Daly J (2003) A model for predicting the optimum checkpoint interval for restart dumps. In: International conference of computational science (ICCS 2003), pp 3–12

  20. Mitzenmacher M, Upfal E (2005) Probability and computing: randomized algorithms and probabilistic analysis. Cambridge University Press, Cambridge

  21. Schroeder B, Gibson G (2006) A large scale study of failures in high-performance-computing systems. In: The 2006 international symposium on dependable systems and networks, pp 337–350

  22. Smith R, Dietrich D (1994) The bathtub curve: an alternative explanation. In: Proceedings of the reliability and maintainability symposium, pp 241–247

  23. Tang W (2010) The ANL intrepid log (online). http://www.cs.huji.ac.il/labs/parallel/workload/l_anl_int/index.html

  24. Tang GW, Lan Z, Desai N, Buettner D, Yu Y (2011) Reducing fragmentation on torus-connected supercomputers. In: Proceedings of the IEEE international parallel & distributed processing symposium, pp 828–839

  25. Lan Z, Gu JX, Zheng ZM, Thakur R et al (2008) Dynamic meta-learning for failure prediction in large-scale systems: a case study. In: Proceedings of the international conference on parallel processing, pp 157–164

  26. Nakka N, Agrawal A, Coudhary A (2011) Predicting node failure in high performance computing systems from failure and usage logs. In: IEEE workshop on dependable parallel, distributed and network-centric systems, pp 1557–1566

  27. Gainaru A, Cappello F, Snir M, Kramer W (2012) Fault prediction under the microscope: a closer look into HPC systems. In: Proceedings of the international conference on high performance computing (SC’12), article no. 77

  28. Hargrove P, Duell J (2006) Berkeley lab checkpoint/restart (BLCR) for Linux clusters. In: Proceedings of the scientific discovery through advanced computing (SciDAC), pp 494–499

  29. Sudakov OO, Meshcheriakov IS, Boyko YV (2007) CHPOX: transparent checkpointing system for Linux clusters. In: IEEE international workshop on intelligent data acquisition and advanced computing systems: technology and applications, pp 159–164

  30. Fagg GE, Dongarra J (2000) FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. Recent Adv Parallel Virtual Mach Message Passing Interface 346–353

  31. Daly J (2006) A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener Comput Syst 22:303–312

    Article  Google Scholar 

  32. Gomez LB, Nukada A, Maruyama N, Cappello F, Matsuoka S (2010) Low-overhead diskless checkpoint for hybrid computing systems. In: The 2010 international conference on high performance computing (HiPC), pp 1–10

  33. Litzkow M, Tannenbaum T, Basney J, Livny M (1997) Checkpoint and migration of UNIX processes in the condor distributed processing system. In: University of Wisconsin-Madison computer science technical report, no. 1346

  34. Cores I, Rodríguez G, Martín MJ, González P (2014) In-memory application-level checkpoint-based migration for MPI programs. J Supercomput. doi:10.1007/3-540-44864-0_1

  35. Chakravorty S, Mendes C, Kale L (2006) Proactive fault tolerance in MPI applications via task migration. In: Proceedings of the international conference on high performance computing (HiPC’06), pp 485–496

Download references

Acknowledgments

The authors thank Tang from the Illinois Institute of Technology for making the log file of Intrepid available. We would like to thank the reviewers for their comments and suggestions, which greatly helped improve this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei Zhu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, L., Gu, J., Wang, Y. et al. Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actions. J Supercomput 71, 3668–3694 (2015). https://doi.org/10.1007/s11227-015-1458-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-015-1458-0

Keywords

Navigation