Abstract
As future HPC systems become larger, the failure rates and the cost of checkpointing to the global file system are expected to increase. Hierarchical incremental CPR is a promising approach to solve this problem. It utilizes a hierarchical storage system of local and global storages and performs incremental checkpointing by writing only updated memory pages between two consecutive checkpoints. In this paper, we response to an open question; how to optimize the checkpoint interval when the checkpoint overheads are changing with time as in hierarchical incremental CPR. We propose a runtime checkpoint interval autotuning technique to optimize the efficiency of hierarchical incremental CPR. Evaluation results show that the efficiency can be significantly increased if the storage hierarchy can be exploited with appropriate checkpoint intervals.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Schroeder, B., Gibson, G.A.: Understanding failures in petascale computers. J. Phys. Conf. Ser. 78, 012022 (2007)
Sancho, J.C., Pertini, F., Johnson, G., Fernandez, J., Frachtenberg, E.: On the feasibility of incremental checkpointing for scientific computing. In: Proceedings of IPDPS 2004, pp. 58–67 (2004)
Amrizal, A., Hirasawa, S., Komatsu, K., Takizawa, H., Kobayashi, H.: Improving the scalability of transparent checkpointing for GPU computing systems. In: Proceedings of the 2012 IEEE Region 10 Conference, pp. 989–994, 19–22 November 2012
Vaidya, N.H.: A case for two-level recovery schemes. IEEE Trans. Comput. 47(6), 656666 (1998)
Dong, X., Muralimanohar, N., Jouppi, N., Kaufmann, R., Xie Y.: Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems. In: Proceedings of SC 2009 (2009)
Di, S., Bouguerra, M.S., Gomez, L.B., Cappello, F.: Optimization of multi-level checkpoint model for large scale HPC applications. In: Proceedings of IPDPS 2014, pp. 1181–1190 (2004)
Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Sys. 22(3), 303–312 (2006)
Dehn, E.: Algebraic Equations: An Introduction to the Theories of Lagrange and Galois. Columbia University Press, New York (1930)
Balakrishnan, N., Childs, A.: Outlier. In: Hazewinkel, M. (ed.) Encyclopedia of Mathematics. Springer (2001). ISBN 978-1-55608-010-4
Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9), 530–531 (1974)
Dash, S.: A comparative study of moving averages: simple, weighted, and exponential. http://www.tradestation.com/education/labs/analysis-concepts/a-comparative-study-of-moving-averages
Brun, R., Dumitrescu, L.Z.: CTH: a software family for multi-dimensional shock physics analysis. In: Hertel Jr., E.S., Bell, R.L., Elrick, M.G., Farnsworth, A.V., Kerley, G.I., McGlaun, J.W., Petney, S.V., Silling, S.A., Taylor, P.A., Yarrington, L. (eds.) Shock Waves @ Marseille, pp. 377–382. Springer, Heidelberg (1995)
Ferreira, K.B., Riesen, R., Brighwell, R., Bridges, P., Arnold, D.: ibhashckpt: hash-based incremental checkpointing using GPU’s. In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds.) Recent Advances in the Message Passing Interface. LNCS, vol. 6960, pp. 272–281. Springer, Heidelberg (2011)
Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
Moody, A., Bronevetsky, G., Mohror, K., Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of SC 2010 (2010)
Sato, K., Maruyama, N., Mohror, K., Moody, A., Gamblin, T., de Supinski, B.R., Matsuoka, S.: Design and modeling of a non-blocking checkpointing system. In: Proceedings of SC 2012 (2012). http://portal.acm.org/citation.cfm?id=2389022
Acknowledgments
This research is partially supported by JST CREST “An Evolutionary Approach to Construction of a Software Development Environment for Massively-Parallel Heterogeneous Systems” and Grant-in-Aid for Scientific Research(B) #25280041. The first author, Alfian Amrizal, is financially supported by Monbukagakusho.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Amrizal, A., Hirasawa, S., Takizawa, H., Kobayashi, H. (2015). Automatic Parameter Tuning of Hierarchical Incremental Checkpointing. In: Daydé, M., Marques, O., Nakajima, K. (eds) High Performance Computing for Computational Science -- VECPAR 2014. VECPAR 2014. Lecture Notes in Computer Science(), vol 8969. Springer, Cham. https://doi.org/10.1007/978-3-319-17353-5_25
Download citation
DOI: https://doi.org/10.1007/978-3-319-17353-5_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-17352-8
Online ISBN: 978-3-319-17353-5
eBook Packages: Computer ScienceComputer Science (R0)