Skip to main content

Automatic Parameter Tuning of Hierarchical Incremental Checkpointing

  • Conference paper
  • First Online:
High Performance Computing for Computational Science -- VECPAR 2014 (VECPAR 2014)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8969))

  • 789 Accesses

Abstract

As future HPC systems become larger, the failure rates and the cost of checkpointing to the global file system are expected to increase. Hierarchical incremental CPR is a promising approach to solve this problem. It utilizes a hierarchical storage system of local and global storages and performs incremental checkpointing by writing only updated memory pages between two consecutive checkpoints. In this paper, we response to an open question; how to optimize the checkpoint interval when the checkpoint overheads are changing with time as in hierarchical incremental CPR. We propose a runtime checkpoint interval autotuning technique to optimize the efficiency of hierarchical incremental CPR. Evaluation results show that the efficiency can be significantly increased if the storage hierarchy can be exploited with appropriate checkpoint intervals.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Schroeder, B., Gibson, G.A.: Understanding failures in petascale computers. J. Phys. Conf. Ser. 78, 012022 (2007)

    Article  Google Scholar 

  2. Sancho, J.C., Pertini, F., Johnson, G., Fernandez, J., Frachtenberg, E.: On the feasibility of incremental checkpointing for scientific computing. In: Proceedings of IPDPS 2004, pp. 58–67 (2004)

    Google Scholar 

  3. Amrizal, A., Hirasawa, S., Komatsu, K., Takizawa, H., Kobayashi, H.: Improving the scalability of transparent checkpointing for GPU computing systems. In: Proceedings of the 2012 IEEE Region 10 Conference, pp. 989–994, 19–22 November 2012

    Google Scholar 

  4. Vaidya, N.H.: A case for two-level recovery schemes. IEEE Trans. Comput. 47(6), 656666 (1998)

    Article  Google Scholar 

  5. Dong, X., Muralimanohar, N., Jouppi, N., Kaufmann, R., Xie Y.: Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems. In: Proceedings of SC 2009 (2009)

    Google Scholar 

  6. Di, S., Bouguerra, M.S., Gomez, L.B., Cappello, F.: Optimization of multi-level checkpoint model for large scale HPC applications. In: Proceedings of IPDPS 2014, pp. 1181–1190 (2004)

    Google Scholar 

  7. Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Sys. 22(3), 303–312 (2006)

    Article  Google Scholar 

  8. Dehn, E.: Algebraic Equations: An Introduction to the Theories of Lagrange and Galois. Columbia University Press, New York (1930)

    MATH  Google Scholar 

  9. Balakrishnan, N., Childs, A.: Outlier. In: Hazewinkel, M. (ed.) Encyclopedia of Mathematics. Springer (2001). ISBN 978-1-55608-010-4

    Google Scholar 

  10. Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9), 530–531 (1974)

    Article  MATH  Google Scholar 

  11. Dash, S.: A comparative study of moving averages: simple, weighted, and exponential. http://www.tradestation.com/education/labs/analysis-concepts/a-comparative-study-of-moving-averages

  12. Brun, R., Dumitrescu, L.Z.: CTH: a software family for multi-dimensional shock physics analysis. In: Hertel Jr., E.S., Bell, R.L., Elrick, M.G., Farnsworth, A.V., Kerley, G.I., McGlaun, J.W., Petney, S.V., Silling, S.A., Taylor, P.A., Yarrington, L. (eds.) Shock Waves @ Marseille, pp. 377–382. Springer, Heidelberg (1995)

    Google Scholar 

  13. Ferreira, K.B., Riesen, R., Brighwell, R., Bridges, P., Arnold, D.: ibhashckpt: hash-based incremental checkpointing using GPU’s. In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds.) Recent Advances in the Message Passing Interface. LNCS, vol. 6960, pp. 272–281. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  14. Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)

    Article  Google Scholar 

  15. Moody, A., Bronevetsky, G., Mohror, K., Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of SC 2010 (2010)

    Google Scholar 

  16. Sato, K., Maruyama, N., Mohror, K., Moody, A., Gamblin, T., de Supinski, B.R., Matsuoka, S.: Design and modeling of a non-blocking checkpointing system. In: Proceedings of SC 2012 (2012). http://portal.acm.org/citation.cfm?id=2389022

Download references

Acknowledgments

This research is partially supported by JST CREST “An Evolutionary Approach to Construction of a Software Development Environment for Massively-Parallel Heterogeneous Systems” and Grant-in-Aid for Scientific Research(B) #25280041. The first author, Alfian Amrizal, is financially supported by Monbukagakusho.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hiroyuki Takizawa .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Amrizal, A., Hirasawa, S., Takizawa, H., Kobayashi, H. (2015). Automatic Parameter Tuning of Hierarchical Incremental Checkpointing. In: Daydé, M., Marques, O., Nakajima, K. (eds) High Performance Computing for Computational Science -- VECPAR 2014. VECPAR 2014. Lecture Notes in Computer Science(), vol 8969. Springer, Cham. https://doi.org/10.1007/978-3-319-17353-5_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-17353-5_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-17352-8

  • Online ISBN: 978-3-319-17353-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics