Abstract
The complexity and scale of high-performance computer systems are rapidly increasing, so fault tolerance is becoming a critical challenge. In this paper, we consider the impact of multiple proactive actions on proactive fault tolerance and periodic checkpointing. We extended Aupy’s model in the presence of multiple proactive actions, including proactive checkpointing and task migration. We then propose optimal strategies for deciding when to trust predictions, and provide algorithms for the optimal storage interval for periodic checkpointing. The results show that the proposed method can significantly improve system productivity. Our case study indicates that the recall of the predictor is more important for small platforms, and that precision becomes increasingly important as the scale of the system increases.
Similar content being viewed by others
Notes
The results obtained via the Weibull distribution can be found in http://good.gd/3262191.htm.
References
Robert F (2012) What it’ll take to go exascale. Science 27:394–396
Sato K, Moody A, Mohror K, Gamblin T et al (2012) Design and modeling of a non-blocking checkpointing system. In: The 2012 international conference for high performance computing, networking, storage and analysis (SC’ 12), article no. 19
Cappello F (2009) fault tolerance in petascale/exascale systems: current knowledge, challenges and research opportunities. Int J High Perform Comput Appl 23:212–226
Li Y, Lan Z, Gujrati P, Sun X (2009) Fault-aware runtime strategies for high-performance computing. IEEE Trans Parallel Distrib Syst 20(4):460–473
Varela MR, Ferreira KB, Riesen R (2010) Fault-tolerance for exascale systems. In: 2010 IEEE international conference on cluster computing workshops and posters (cluster workshops), pp 1–4
Leonardo BG, Seiji T, Komatitsch D, Cappello F, Maruyama N et al (2011) FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings of 2011 international conference for high performance computing, networking, storage and analysis, pp 1–32
Moody A, Bronevetsky G, Mohror K, Bronis R et al (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE international conference for high performance computing, networking, storage and analysis (SC’10),pp 1–11
Jangjaimon I, Tzeng NF (2013) Adaptive incremental checkpointing via delta compression for networked multicore systems. In: The 27th IEEE international symposium on parallel & distributed processing (IPDPS 2013), pp 7–18
Mohamed SB, Gainaru A, Leonardo BG, Franck Cappello et al (2013) Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing. In: The 27th IEEE international symposium on parallel & distributed processing (IPDPS 2013), pp 501–512
Gainaru A, Cappello F, Kramer W (2012) Taming of the shrew: modeling the normal and faulty behavior of large-scale hpc systems. In: IEEE 26th international parallel & distributed processing symposium (IPDPS), pp 1168–1179
Gainaru A, Cappello F, Kramer W, Snir M (2012) Fault prediction under the microscope—a closer look into HPC systems. In: The 2012 international conference for high performance computing, networking, storage and analysis (SC’12), article no. 77
Yu L, Zheng Z, Lan Z, Coghlan S (2011) Practical online failure prediction for BlueGene/P: period-based vs event-driven. In: Dependable systems and networks workshops (DSN-W), pp 259–264
Aupy G, Robert Y, Vivien F, Zaidouni D (2014) Checkpointing algorithms and fault prediction. J Parallel Distrib Comput 74(2):2048–2064
Cappello F, Geist A, Gropp B, Kale L et al (2009) Toward exascale resilience. Int J High Perform Comput Appl 23(4):374–388
Ifeanyi P, Egwutuoha DL, Bran S, Shiping C (2013) A survey of fault tolerance mechanisms and checkpoint/restart implementations or high performance computing systems. J Supercomput 65(3):1302–1326
Wang C, Mueller F, Engelmann C, Stephen LS (2012) Proactive process-level live migration and back migration in HPC environments. J Parallel Distrib Comput 72(2):254–267
Nagarajan A, Mueller F, Engelmann C, Scott S (2007) Proactive fault tolerance for HPC with Xen virtualization. In: Proceedings of the international conference on supercomputing, pp 23–32
Zheng Z, Lan Z, Gupta R, Coghlan S, Beckman P (2010) A practical failure prediction with location and lead time for BlueGene/P. In: Dependable systems and networks workshops (DSN-W 2010), pp 15–22
Daly J (2003) A model for predicting the optimum checkpoint interval for restart dumps. In: International conference of computational science (ICCS 2003), pp 3–12
Mitzenmacher M, Upfal E (2005) Probability and computing: randomized algorithms and probabilistic analysis. Cambridge University Press, Cambridge
Schroeder B, Gibson G (2006) A large scale study of failures in high-performance-computing systems. In: The 2006 international symposium on dependable systems and networks, pp 337–350
Smith R, Dietrich D (1994) The bathtub curve: an alternative explanation. In: Proceedings of the reliability and maintainability symposium, pp 241–247
Tang W (2010) The ANL intrepid log (online). http://www.cs.huji.ac.il/labs/parallel/workload/l_anl_int/index.html
Tang GW, Lan Z, Desai N, Buettner D, Yu Y (2011) Reducing fragmentation on torus-connected supercomputers. In: Proceedings of the IEEE international parallel & distributed processing symposium, pp 828–839
Lan Z, Gu JX, Zheng ZM, Thakur R et al (2008) Dynamic meta-learning for failure prediction in large-scale systems: a case study. In: Proceedings of the international conference on parallel processing, pp 157–164
Nakka N, Agrawal A, Coudhary A (2011) Predicting node failure in high performance computing systems from failure and usage logs. In: IEEE workshop on dependable parallel, distributed and network-centric systems, pp 1557–1566
Gainaru A, Cappello F, Snir M, Kramer W (2012) Fault prediction under the microscope: a closer look into HPC systems. In: Proceedings of the international conference on high performance computing (SC’12), article no. 77
Hargrove P, Duell J (2006) Berkeley lab checkpoint/restart (BLCR) for Linux clusters. In: Proceedings of the scientific discovery through advanced computing (SciDAC), pp 494–499
Sudakov OO, Meshcheriakov IS, Boyko YV (2007) CHPOX: transparent checkpointing system for Linux clusters. In: IEEE international workshop on intelligent data acquisition and advanced computing systems: technology and applications, pp 159–164
Fagg GE, Dongarra J (2000) FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. Recent Adv Parallel Virtual Mach Message Passing Interface 346–353
Daly J (2006) A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener Comput Syst 22:303–312
Gomez LB, Nukada A, Maruyama N, Cappello F, Matsuoka S (2010) Low-overhead diskless checkpoint for hybrid computing systems. In: The 2010 international conference on high performance computing (HiPC), pp 1–10
Litzkow M, Tannenbaum T, Basney J, Livny M (1997) Checkpoint and migration of UNIX processes in the condor distributed processing system. In: University of Wisconsin-Madison computer science technical report, no. 1346
Cores I, Rodríguez G, Martín MJ, González P (2014) In-memory application-level checkpoint-based migration for MPI programs. J Supercomput. doi:10.1007/3-540-44864-0_1
Chakravorty S, Mendes C, Kale L (2006) Proactive fault tolerance in MPI applications via task migration. In: Proceedings of the international conference on high performance computing (HiPC’06), pp 485–496
Acknowledgments
The authors thank Tang from the Illinois Institute of Technology for making the log file of Intrepid available. We would like to thank the reviewers for their comments and suggestions, which greatly helped improve this paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhu, L., Gu, J., Wang, Y. et al. Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actions. J Supercomput 71, 3668–3694 (2015). https://doi.org/10.1007/s11227-015-1458-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-015-1458-0