Optimizing checkpoint for scientific simulations

Xiao, Xi-sheng; Huang, Ying-ping; Zhang, Xi-hui

doi:10.1631/jzus.C1200135

Optimizing checkpoint for scientific simulations

Published: 09 December 2012

Volume 13, pages 891–900, (2012)
Cite this article

Journal of Zhejiang University SCIENCE C Aims and scope Submit manuscript

Xi-sheng Xiao^1,2,
Ying-ping Huang³ &
Xi-hui Zhang³

91 Accesses
Explore all metrics

Abstract

It is extremely time-consuming to restart a long-running simulation from the beginning when a failure occurs. Checkpointing is a viable solution that enables simulations to be resumed from the point of failure. We study three models to determine the optimal checkpoint interval between contiguous checkpoints so that the total execution time is minimized and we demonstrate that optimal checkpointing can facilitate self-optimizing. This study greatly advances our knowledge of and practice in optimizing long-running scientific simulations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on the evolution of stream processing systems

Article Open access 22 November 2023

An improved particle swarm optimization algorithm for task scheduling in cloud computing

Article 15 February 2023

Cloud benchmarking and performance analysis of an HPC application in Amazon EC2

Article Open access 28 June 2023

References

Cao, T., Vaz Salles, M., Sowell, B., Yue, Y., Demers, A., Gehrke, J., White, W., 2011. Fast Checkpoint Recovery Algorithms for Frequently Consistent Applications. Proc. ACM SIGMOD Int. Conf. on Management of data, p.265–276. [doi:10.1145/1989323.1989352]
Chandy, K., 1975. A survey of analytic models for rollback and recovery strategies. Computer, 8(5):40–47. [doi:10.1109/C-M.1975.218955]
Article Google Scholar
Duda, A., 1983. The effects of checkpointing on program execution times. Inf. Process. Lett., 16(5):221–229. [doi:10.1016/0020-0190(83)90093-5]
Article MathSciNet MATH Google Scholar
Gelenbe, E., Hernandez, M., 1990. Optimum checkpoints with age dependent failures. Acta Inf., 27(6):519–531. [doi:10.1007/BF00277388]
Article MathSciNet MATH Google Scholar
Grassi, V., Donatiello, L., Tucci, S., 1992. On the optimal checkpointing of critical task and transaction-oriented systems. IEEE Trans. Software Eng., 18(1):72–77. [doi:10.1109/32.120317]
Article Google Scholar
Huang, Y., Madey, G., 2005. Autonomic Web-Based Simulations. Proc. 38th Annual Simulation Symp., p.160–167. [doi:10.1109/ANSS.2005.15]
Huang, Y., Xiang, X., Madey, G., 2004. A Self Manageable Infrastructure for Supporting Web-Based Simulations. Proc. 37th Annual Simulation Symp., p.149–156. [doi:10.1109/SIMSYM.2004.1299478]
Ji, Y., Jiang, H., Chaudhary, V., 2011. A heuristic checkpoint placement algorithm for adaptive application-level checkpointing. Int. J. Appl. Sci. Technol., 1(6):50–61.
Google Scholar
Kohl, J., Papadopoulas, P., 1998. Efficient and Flexible Fault Tolerance and Migration of Scientific Simulations Using CUMULVS. Proc. SIGMETRICS Symp. on Parallel and Distributed Tools, p.60–71. [doi:10.1145/281035.281042]
Kulkarni, V.G., Nicola, V.F., Trivedi, K.S., 1990. Effects of checkpointing and queuing on program performance. Commun. Stat. Stoch. Models, 6(4):615–648. [doi:10.1080/15326349908807166]
Article MathSciNet MATH Google Scholar
Kwak, S., Yang, J., 2012. Optimal checkpoint placement on real-time tasks with harmonic periods. J. Comput. Sci. Technol., 27(1):105–112. [doi:10.1007/s11390-012-1209-0]
Article Google Scholar
Kwak, S.W., Chio, B.J., Kim, B.K., 2001. An optimal checkpointing strategy for real time control systems under transient faults. IEEE Trans. Reliab., 50(3):293–301. [doi:10.1109/24.974127]
Article Google Scholar
Ling, Y., Mi, J., Lin, X., 2001. A variational calculus approach to optimal checkpoint placement. IEEE Trans. Comput., 50(7):699–708. [doi:10.1109/12.936236]
Article Google Scholar
Nicola, V., 1995. Checkpointing and the Modeling of Program Execution Time. In: Lyu, M.R. (Ed.), Software Fault Tolerance. John Wiley & Sons, Chichester, England, p.167–188.
Google Scholar
Shin, K.G., Lin, T., Lee, Y., 1987. Optimal checkpointing of real-time tasks. IEEE Trans. Comput., 36(11):519–531.
Article Google Scholar
Tantawi, A.N., Ruschitzka, M., 1983. Performance Analysis of Checkpointing Strategies. Proc. ACM SIGMETRICS Conf. on Measurement and Modeling of Computer Systems, p.129.
Young, J.W., 1974. A first order approximation to the optimum checkpoint interval. Commun. ACM, 17(9):530–531. [doi:10.1145/361147.361115]
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Economics & Management College, Southwest Jiaotong University, Chengdu, 610031, China
Xi-sheng Xiao
Industrial and Commercial College, Guizhou University of Finance and Economics, Guiyang, 550003, China
Xi-sheng Xiao
College of Business, University of North Alabama, Florence, AL, 35632, USA
Ying-ping Huang & Xi-hui Zhang

Authors

Xi-sheng Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Ying-ping Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xi-hui Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ying-ping Huang.

Additional information

Project supported by the National Science Foundation of USA and the Information Technology Research (ITR/AP-DEB) (No. 0112820)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xiao, Xs., Huang, Yp. & Zhang, Xh. Optimizing checkpoint for scientific simulations. J. Zhejiang Univ. - Sci. C 13, 891–900 (2012). https://doi.org/10.1631/jzus.C1200135

Download citation

Received: 12 May 2012
Accepted: 03 September 2012
Published: 09 December 2012
Issue Date: December 2012
DOI: https://doi.org/10.1631/jzus.C1200135

Key words

CLC number

O242

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing checkpoint for scientific simulations

Abstract

Access this article

Similar content being viewed by others

A survey on the evolution of stream processing systems

An improved particle swarm optimization algorithm for task scheduling in cloud computing

Cloud benchmarking and performance analysis of an HPC application in Amazon EC2

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Key words

CLC number

Navigation

Optimizing checkpoint for scientific simulations

Abstract

Access this article

Similar content being viewed by others

A survey on the evolution of stream processing systems

An improved particle swarm optimization algorithm for task scheduling in cloud computing

Cloud benchmarking and performance analysis of an HPC application in Amazon EC2

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

CLC number

Search

Navigation