Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actions

Zhu, Lei; Gu, Jianhua; Wang, Yunlan; Zhao, Tianhai; Cai, Zhennao

doi:10.1007/s11227-015-1458-0

Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actions

Published: 04 June 2015

Volume 71, pages 3668–3694, (2015)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Lei Zhu¹,
Jianhua Gu¹,
Yunlan Wang¹,
Tianhai Zhao¹ &
…
Zhennao Cai¹

291 Accesses
8 Citations
Explore all metrics

Abstract

The complexity and scale of high-performance computer systems are rapidly increasing, so fault tolerance is becoming a critical challenge. In this paper, we consider the impact of multiple proactive actions on proactive fault tolerance and periodic checkpointing. We extended Aupy’s model in the presence of multiple proactive actions, including proactive checkpointing and task migration. We then propose optimal strategies for deciding when to trust predictions, and provide algorithms for the optimal storage interval for periodic checkpointing. The results show that the proposed method can significantly improve system productivity. Our case study indicates that the recall of the predictor is more important for small platforms, and that precision becomes increasingly important as the scale of the system increases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Energy efficiency in cloud computing data centers: a survey on software technologies

Article 30 August 2022

Serverless Computing: Current Trends and Open Problems

Containers in HPC: a survey

Article 27 October 2022

Notes

The results obtained via the Weibull distribution can be found in http://good.gd/3262191.htm.

References

Robert F (2012) What it’ll take to go exascale. Science 27:394–396
Google Scholar
Sato K, Moody A, Mohror K, Gamblin T et al (2012) Design and modeling of a non-blocking checkpointing system. In: The 2012 international conference for high performance computing, networking, storage and analysis (SC’ 12), article no. 19
Cappello F (2009) fault tolerance in petascale/exascale systems: current knowledge, challenges and research opportunities. Int J High Perform Comput Appl 23:212–226
Article Google Scholar
Li Y, Lan Z, Gujrati P, Sun X (2009) Fault-aware runtime strategies for high-performance computing. IEEE Trans Parallel Distrib Syst 20(4):460–473
Article Google Scholar
Varela MR, Ferreira KB, Riesen R (2010) Fault-tolerance for exascale systems. In: 2010 IEEE international conference on cluster computing workshops and posters (cluster workshops), pp 1–4
Leonardo BG, Seiji T, Komatitsch D, Cappello F, Maruyama N et al (2011) FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings of 2011 international conference for high performance computing, networking, storage and analysis, pp 1–32
Moody A, Bronevetsky G, Mohror K, Bronis R et al (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE international conference for high performance computing, networking, storage and analysis (SC’10),pp 1–11
Jangjaimon I, Tzeng NF (2013) Adaptive incremental checkpointing via delta compression for networked multicore systems. In: The 27th IEEE international symposium on parallel & distributed processing (IPDPS 2013), pp 7–18
Mohamed SB, Gainaru A, Leonardo BG, Franck Cappello et al (2013) Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing. In: The 27th IEEE international symposium on parallel & distributed processing (IPDPS 2013), pp 501–512
Gainaru A, Cappello F, Kramer W (2012) Taming of the shrew: modeling the normal and faulty behavior of large-scale hpc systems. In: IEEE 26th international parallel & distributed processing symposium (IPDPS), pp 1168–1179
Gainaru A, Cappello F, Kramer W, Snir M (2012) Fault prediction under the microscope—a closer look into HPC systems. In: The 2012 international conference for high performance computing, networking, storage and analysis (SC’12), article no. 77
Yu L, Zheng Z, Lan Z, Coghlan S (2011) Practical online failure prediction for BlueGene/P: period-based vs event-driven. In: Dependable systems and networks workshops (DSN-W), pp 259–264
Aupy G, Robert Y, Vivien F, Zaidouni D (2014) Checkpointing algorithms and fault prediction. J Parallel Distrib Comput 74(2):2048–2064
Article Google Scholar
Cappello F, Geist A, Gropp B, Kale L et al (2009) Toward exascale resilience. Int J High Perform Comput Appl 23(4):374–388
Article Google Scholar
Ifeanyi P, Egwutuoha DL, Bran S, Shiping C (2013) A survey of fault tolerance mechanisms and checkpoint/restart implementations or high performance computing systems. J Supercomput 65(3):1302–1326
Article Google Scholar
Wang C, Mueller F, Engelmann C, Stephen LS (2012) Proactive process-level live migration and back migration in HPC environments. J Parallel Distrib Comput 72(2):254–267
Article Google Scholar
Nagarajan A, Mueller F, Engelmann C, Scott S (2007) Proactive fault tolerance for HPC with Xen virtualization. In: Proceedings of the international conference on supercomputing, pp 23–32
Zheng Z, Lan Z, Gupta R, Coghlan S, Beckman P (2010) A practical failure prediction with location and lead time for BlueGene/P. In: Dependable systems and networks workshops (DSN-W 2010), pp 15–22
Daly J (2003) A model for predicting the optimum checkpoint interval for restart dumps. In: International conference of computational science (ICCS 2003), pp 3–12
Mitzenmacher M, Upfal E (2005) Probability and computing: randomized algorithms and probabilistic analysis. Cambridge University Press, Cambridge
Schroeder B, Gibson G (2006) A large scale study of failures in high-performance-computing systems. In: The 2006 international symposium on dependable systems and networks, pp 337–350
Smith R, Dietrich D (1994) The bathtub curve: an alternative explanation. In: Proceedings of the reliability and maintainability symposium, pp 241–247
Tang W (2010) The ANL intrepid log (online). http://www.cs.huji.ac.il/labs/parallel/workload/l_anl_int/index.html
Tang GW, Lan Z, Desai N, Buettner D, Yu Y (2011) Reducing fragmentation on torus-connected supercomputers. In: Proceedings of the IEEE international parallel & distributed processing symposium, pp 828–839
Lan Z, Gu JX, Zheng ZM, Thakur R et al (2008) Dynamic meta-learning for failure prediction in large-scale systems: a case study. In: Proceedings of the international conference on parallel processing, pp 157–164
Nakka N, Agrawal A, Coudhary A (2011) Predicting node failure in high performance computing systems from failure and usage logs. In: IEEE workshop on dependable parallel, distributed and network-centric systems, pp 1557–1566
Gainaru A, Cappello F, Snir M, Kramer W (2012) Fault prediction under the microscope: a closer look into HPC systems. In: Proceedings of the international conference on high performance computing (SC’12), article no. 77
Hargrove P, Duell J (2006) Berkeley lab checkpoint/restart (BLCR) for Linux clusters. In: Proceedings of the scientific discovery through advanced computing (SciDAC), pp 494–499
Sudakov OO, Meshcheriakov IS, Boyko YV (2007) CHPOX: transparent checkpointing system for Linux clusters. In: IEEE international workshop on intelligent data acquisition and advanced computing systems: technology and applications, pp 159–164
Fagg GE, Dongarra J (2000) FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. Recent Adv Parallel Virtual Mach Message Passing Interface 346–353
Daly J (2006) A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener Comput Syst 22:303–312
Article Google Scholar
Gomez LB, Nukada A, Maruyama N, Cappello F, Matsuoka S (2010) Low-overhead diskless checkpoint for hybrid computing systems. In: The 2010 international conference on high performance computing (HiPC), pp 1–10
Litzkow M, Tannenbaum T, Basney J, Livny M (1997) Checkpoint and migration of UNIX processes in the condor distributed processing system. In: University of Wisconsin-Madison computer science technical report, no. 1346
Cores I, Rodríguez G, Martín MJ, González P (2014) In-memory application-level checkpoint-based migration for MPI programs. J Supercomput. doi:10.1007/3-540-44864-0_1
Chakravorty S, Mendes C, Kale L (2006) Proactive fault tolerance in MPI applications via task migration. In: Proceedings of the international conference on high performance computing (HiPC’06), pp 485–496

Download references

Acknowledgments

The authors thank Tang from the Illinois Institute of Technology for making the log file of Intrepid available. We would like to thank the reviewers for their comments and suggestions, which greatly helped improve this paper.

Author information

Authors and Affiliations

School of Computer, Northwestern Polytechnical University, Xi’an, Shaanxi, People’s Republic of China
Lei Zhu, Jianhua Gu, Yunlan Wang, Tianhai Zhao & Zhennao Cai

Authors

Lei Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Gu
View author publications
You can also search for this author in PubMed Google Scholar
Yunlan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Tianhai Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Zhennao Cai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lei Zhu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, L., Gu, J., Wang, Y. et al. Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actions. J Supercomput 71, 3668–3694 (2015). https://doi.org/10.1007/s11227-015-1458-0

Download citation

Published: 04 June 2015
Issue Date: October 2015
DOI: https://doi.org/10.1007/s11227-015-1458-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actions

Abstract

Access this article

Similar content being viewed by others

Energy efficiency in cloud computing data centers: a survey on software technologies

Serverless Computing: Current Trends and Open Problems

Containers in HPC: a survey

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actions

Abstract

Access this article

Similar content being viewed by others

Energy efficiency in cloud computing data centers: a survey on software technologies

Serverless Computing: Current Trends and Open Problems

Containers in HPC: a survey

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation