Skip to main content
Log in

Optimizing checkpoint for scientific simulations

  • Published:
Journal of Zhejiang University SCIENCE C Aims and scope Submit manuscript

Abstract

It is extremely time-consuming to restart a long-running simulation from the beginning when a failure occurs. Checkpointing is a viable solution that enables simulations to be resumed from the point of failure. We study three models to determine the optimal checkpoint interval between contiguous checkpoints so that the total execution time is minimized and we demonstrate that optimal checkpointing can facilitate self-optimizing. This study greatly advances our knowledge of and practice in optimizing long-running scientific simulations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Cao, T., Vaz Salles, M., Sowell, B., Yue, Y., Demers, A., Gehrke, J., White, W., 2011. Fast Checkpoint Recovery Algorithms for Frequently Consistent Applications. Proc. ACM SIGMOD Int. Conf. on Management of data, p.265–276. [doi:10.1145/1989323.1989352]

  • Chandy, K., 1975. A survey of analytic models for rollback and recovery strategies. Computer, 8(5):40–47. [doi:10.1109/C-M.1975.218955]

    Article  Google Scholar 

  • Duda, A., 1983. The effects of checkpointing on program execution times. Inf. Process. Lett., 16(5):221–229. [doi:10.1016/0020-0190(83)90093-5]

    Article  MathSciNet  MATH  Google Scholar 

  • Gelenbe, E., Hernandez, M., 1990. Optimum checkpoints with age dependent failures. Acta Inf., 27(6):519–531. [doi:10.1007/BF00277388]

    Article  MathSciNet  MATH  Google Scholar 

  • Grassi, V., Donatiello, L., Tucci, S., 1992. On the optimal checkpointing of critical task and transaction-oriented systems. IEEE Trans. Software Eng., 18(1):72–77. [doi:10.1109/32.120317]

    Article  Google Scholar 

  • Huang, Y., Madey, G., 2005. Autonomic Web-Based Simulations. Proc. 38th Annual Simulation Symp., p.160–167. [doi:10.1109/ANSS.2005.15]

  • Huang, Y., Xiang, X., Madey, G., 2004. A Self Manageable Infrastructure for Supporting Web-Based Simulations. Proc. 37th Annual Simulation Symp., p.149–156. [doi:10.1109/SIMSYM.2004.1299478]

  • Ji, Y., Jiang, H., Chaudhary, V., 2011. A heuristic checkpoint placement algorithm for adaptive application-level checkpointing. Int. J. Appl. Sci. Technol., 1(6):50–61.

    Google Scholar 

  • Kohl, J., Papadopoulas, P., 1998. Efficient and Flexible Fault Tolerance and Migration of Scientific Simulations Using CUMULVS. Proc. SIGMETRICS Symp. on Parallel and Distributed Tools, p.60–71. [doi:10.1145/281035.281042]

  • Kulkarni, V.G., Nicola, V.F., Trivedi, K.S., 1990. Effects of checkpointing and queuing on program performance. Commun. Stat. Stoch. Models, 6(4):615–648. [doi:10.1080/15326349908807166]

    Article  MathSciNet  MATH  Google Scholar 

  • Kwak, S., Yang, J., 2012. Optimal checkpoint placement on real-time tasks with harmonic periods. J. Comput. Sci. Technol., 27(1):105–112. [doi:10.1007/s11390-012-1209-0]

    Article  Google Scholar 

  • Kwak, S.W., Chio, B.J., Kim, B.K., 2001. An optimal checkpointing strategy for real time control systems under transient faults. IEEE Trans. Reliab., 50(3):293–301. [doi:10.1109/24.974127]

    Article  Google Scholar 

  • Ling, Y., Mi, J., Lin, X., 2001. A variational calculus approach to optimal checkpoint placement. IEEE Trans. Comput., 50(7):699–708. [doi:10.1109/12.936236]

    Article  Google Scholar 

  • Nicola, V., 1995. Checkpointing and the Modeling of Program Execution Time. In: Lyu, M.R. (Ed.), Software Fault Tolerance. John Wiley & Sons, Chichester, England, p.167–188.

    Google Scholar 

  • Shin, K.G., Lin, T., Lee, Y., 1987. Optimal checkpointing of real-time tasks. IEEE Trans. Comput., 36(11):519–531.

    Article  Google Scholar 

  • Tantawi, A.N., Ruschitzka, M., 1983. Performance Analysis of Checkpointing Strategies. Proc. ACM SIGMETRICS Conf. on Measurement and Modeling of Computer Systems, p.129.

  • Young, J.W., 1974. A first order approximation to the optimum checkpoint interval. Commun. ACM, 17(9):530–531. [doi:10.1145/361147.361115]

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ying-ping Huang.

Additional information

Project supported by the National Science Foundation of USA and the Information Technology Research (ITR/AP-DEB) (No. 0112820)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xiao, Xs., Huang, Yp. & Zhang, Xh. Optimizing checkpoint for scientific simulations. J. Zhejiang Univ. - Sci. C 13, 891–900 (2012). https://doi.org/10.1631/jzus.C1200135

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1631/jzus.C1200135

Key words

CLC number

Navigation