Abstract
Fault tolerance is one of the crucial challenges for HPCs to achieve exascale. In this paper, we consider the impact of the predictions that fail to precisely identify the fault-occurrence time on uncoordinated proactive checkpointing/restart (C/R). We extended Aupy’s model in the presence of the uncoordinated proactive C/R and distorted predictions. We then propose optimal strategies for deciding when to accept the predictions, and provide algorithms for the optimal storage interval for the periodic C/R. The results show that the proposed method can significantly improve the performance of the system. Furthermore, our case study indicates that the recall of the predictor is more important than precision for our system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Robert, F.: What it’ll take to go exascale. Science 27, 394–396 (2012)
Sato, K., Moody, A., Mohror, K., Gamblin, T., et al: Design and modeling of a non-blocking checkpointing system. In: The 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2012), Article No. 19 (2012)
Cappello, F.: fault tolerance in petascale/exascale systems: current knowledge, challenges and research opportunities. Int. J. High Perform. Comput. Appl. 23, 212–226 (2009)
Zheng, Z., Lan, Z., Gupta, R., Coghlan, S., Beckman, P.: A practical failure prediction with location and lead time for blue gene/p. In: Dependable Systems and Networks Workshops (DSN-W 2010), pp. 15–22 (2010)
Varela, M.R., Ferreira, K.B., Riesen, R.: Fault-tolerance for exascale systems. In: 2010 IEEE International Conference on Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS), pp. 1–4 (2010)
Leonardo, B.G., Seiji, T., Komatitsch, D., Cappello, F., Maruyama, N., et al.: FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–32 (2011)
Moody, A., Bronevetsky, G., Mohror, K., Bronis, R., et al.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2010), pp. 1–11 (2010)
Jangjaimon, I., Tzeng, N.F.: Adaptive incremental checkpointing via delta compression for networked multicore systems. In: The 27th IEEE International Symposium on Parallel & Distributed Processing (IPDPS 2013), pp. 7–18 (2013)
Gainaru, A., Cappello, F., Kramer, W.: Taming of the shrew: modeling the normal and faulty behavior of large-scale HPC systems. In: IEEE 26th International Parallel & Distributed Processing Symposium (IPDPS), pp. 1168–1179 (2012)
Gainaru, A., Cappello, F., Kramer, W., Snir, M.: Fault prediction under the microscope—a closer look into hpc systems. In: The 2012 International Conference for High Performance Computing, Networking, Storage and Analysis(SC 2012), Article No. 77 (2012)
Yu, L., Zheng, Z., Lan, Z., Coghlan, S.: Practical online failure prediction for bluegene/p: period-based vs event-driven. In: Dependable Systems and Networks Workshops (DSN-W), pp. 259–264
Mohamed, S.B., Gainaru, A., Leonardo, B.G., Franck, C., et al.: Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing. In: The 27th IEEE International Symposium on Parallel & Distributed Processing (IPDPS 2013), pp. 501–512 (2013)
Aupy, G., Robert, Y., Vivien, F., Zaidouni, D.: Checkpointing algorithms and fault prediction. J. Parallel Distrib. Comput. 74(2), 2048–2064 (2014)
Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34, 375–408 (2002)
Bosilca, G., Bouteiller, A., Brunet, E., Cappello, F., et al.: Unified model for assessing checkpointing protocols at extreme-scale. Concurrency Computat 26, 2772–2791 (2014)
Ifeanyi, P., Egwutuoha, D.L., Bran, S., Shiping, C.: A survey of fault tolerance mechanisms and checkpoint/restart implementations or high performance computing systems. J. Supercomputing 65(3), 1302–1326 (2013)
Lan, Z., Gu, J.X., Zheng, Z.M., Thakur, R., et al.: Dynamic meta-learning for failure prediction in large-scale systems: A case study. In: Proceedings OfInternational Conference on Parallel Processing, pp. 157–164 (2008)
Nakka, N., Agrawal, A., Coudhary, A.: Predicting node failure in high performance computing systems from failure and usage logs. In: IEEE Workshop on Dependable Parallel, Distributed and Network-Centric Systems, pp. 1557–1566 (2011)
Gainaru, A., Cappello, F., Snir, M., Kramer, W.: Fault prediction under the microscope: a closer look into HPC systems. In: proceedings of the International Conference on High Performance Computing (SC 2012), Article No. 77 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Zhu, L., Gu, J., Cai, Z. (2015). Optimizing the Overheads for Uncoordinated Proactive Checkpointing. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9531. Springer, Cham. https://doi.org/10.1007/978-3-319-27140-8_49
Download citation
DOI: https://doi.org/10.1007/978-3-319-27140-8_49
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27139-2
Online ISBN: 978-3-319-27140-8
eBook Packages: Computer ScienceComputer Science (R0)