Analysis of parallel application checkpoint storage for system configuration

León, Betzabeth; Franco, Daniel; Rexachs, Dolores; Luque, Emilio

doi:10.1007/s11227-020-03445-1

Analysis of parallel application checkpoint storage for system configuration

Published: 16 October 2020

Volume 77, pages 4582–4617, (2021)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

296 Accesses
2 Citations
Explore all metrics

Abstract

The use of fault tolerance strategies such as checkpoints is essential to maintain the availability of systems and their applications in high-performance computing environments. However, checkpoint storage can impact the performance and scalability of parallel applications that use message passing. In the present work, a study is carried out on the elements that can impact the storage of the checkpoint and how these can influence the scalability of an application with fault tolerance. A methodology has been designed based on predicting the size of the checkpoint when the number of processes, the application workload or the mapping varies, using a reduced number of resources. By following this methodology, the system administrator will be able to make decisions about what should be done with the number of processes used and the number of appropriate nodes, adjusting the process mapping in applications that use checkpoints.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerating erasure coding by exploiting multiple repair paths in distributed storage systems

Article 12 April 2024

Performance improvement of the triangular matrix product in commodity clusters

Article Open access 15 April 2024

Containers in HPC: a survey

Article 27 October 2022

References

León B, Franco D, Rexachs D, Luque E (2018) Characterization of I/O Patterns generated by Fault Tolerance in HPC environments. International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA) vol 18, p 28
Lemarinier Bouteiller, Capello Krawezik (2003) Coordinated checkpoint versus message log for fault tolerant MPI, in 2003 Proceedings IEEE International Conference on Cluster Computing, pp. 242–250. https://doi.org/10.1109/CLUSTR.2003.1253321
Shahzad F, Thies J, Kreutzer M, Zeiser T, Hager G, Wellein G (2019) CRAFT: a library for easier application-level checkpoint/restart and automatic fault tolerance. IEEE Trans Parallel Distrib Syst 30(3):501. https://doi.org/10.1109/TPDS.2018.2866794
Article Google Scholar
Coti C, Herault T, Lemarinier P, Pilard L, Rezmerita A, Rodriguez E, Cappello F (2006) Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI, In: SC ’06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, pp. 18–18. https://doi.org/10.1109/SC.2006.15
Moríñigo JA, Rodríguez-Pascual M, Mayo-García R (2019) On the modelling of optimal coordinated checkpoint period in supercomputers. J Supercomput 75(2):930
Article Google Scholar
Guermouche A, Ropars T, Brunet E, Snir M, Cappello F (2011) Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications, in 2011 IEEE International Parallel Distributed Processing Symposium, pp. 989–1000. https://doi.org/10.1109/IPDPS.2011.95
Kumar M, Choudhary A, Kumar V (2014) A comparison between different checkpoint schemes with advantages and disadvantages. Int J Comput Appl Nat Semin Recent Adv Wireless Netw Commun 3:36
Google Scholar
Kovács J, Kacsuk P, Januszewski R, Jankowski G (2010) Application and middleware transparent checkpointing with TCKPT on ClusterGrids. Future Gener Comput Syst 26(3):498
Article Google Scholar
Castro-León M, Meyer H, Rexachs D, Luque E (2015) Fault tolerance at system level based on RADIC architecture. Journal of Parallel and Distributed Computing 86:98. https://doi.org/10.1016/j.jpdc.2015.08.005. http://www.sciencedirect.com/science/article/pii/S0743731515001434
Subasi O, Zyulkyarov F, Unsal O, Labarta J (2015) Marriage Between Coordinated and Uncoordinated Checkpointing for the Exascale Era, in 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems, pp. 470–478
Takizawa H, Amrizal MA, Komatsu K, Egawa R (2017) An Application-Level Incremental Checkpointing Mechanism with Automatic Parameter Tuning, In: 2017 Fifth International Symposium on Computing and Networking (CANDAR), pp. 389–394. https://doi.org/10.1109/CANDAR.2017.96
Li G, Pattabiraman K, Cher C, Bose P (2015) Experience report: An application-specific checkpointing technique for minimizing checkpoint corruption, In: 2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE), pp. 141–152
Ansel J, Arya K, Cooperman G (2009) DMTCP: Transparent checkpointing for cluster computations and the desktop, In: 2009 IEEE International Symposium on Parallel Distributed Processing, pp. 1–12. https://doi.org/10.1109/IPDPS.2009.5161063
Kongmunvattana A, Tanchatchawal S, Tzeng Nian-Feng (2000) Coherence-based coordinated checkpointing for software distributed shared memory systems, In: Proceedings 20th IEEE International Conference on Distributed Computing Systems, pp. 556–563
Cores I, Rodríguez G, González P, Osorio RR et al (2013) Improving scalability of application-level checkpoint-recovery by reducing checkpoint sizes. New Gener Comput 31(3):163
Article Google Scholar
Kongmunvattana A (2015) Reducing checkpoint creation overhead using data similarity. Int J Comput 4(4):199
Google Scholar
Rusu C, Grecu C, Anghel L (2008) Improving the scalability of checkpoint recovery for networks-on-chip, in 2008 IEEE International Symposium on Circuits and Systems, pp. 2793–2796. https://doi.org/10.1109/ISCAS.2008.4542037
Bouabache F, Herault T, Fedak G, Cappello F (2008) Hierarchical Replication Techniques to Ensure Checkpoint Storage Reliability in Grid Environment, In: 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), pp. 475–483. https://doi.org/10.1109/CCGRID.2008.95
Al-Kiswany S, Ripeanu M, Vazhkudai SS, Gharaibeh A (2008) stdchk: A Checkpoint Storage System for Desktop Grid Computing, In: 2008 The 28th International Conference on Distributed Computing Systems, pp. 613–624. https://doi.org/10.1109/ICDCS.2008.19
Shahzad F, Wittmann M, Zeiser T, Hager G, Wellein G, Evaluation An, of Different I, O Techniques for Checkpoint, Restart, in, (2013) IEEE International Symposium on Parallel Distributed Processing. Workshops and Phd Forum 2013:1708–1716. https://doi.org/10.1109/IPDPSW.2013.145
Wan L, Cao Q, Wang F, Oral S (2017) Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems. Journal of Parallel and Distributed Computing 100:16. https://doi.org/10.1016/j.jpdc.2016.10.002. http://www.sciencedirect.com/science/article/pii/S0743731516301198
Parasyris K, Keller K, Bautista-Gomez L, Unsal O, Support Checkpoint Restart, for Heterogeneous HPC Applications, in, (2020) 20th IEEE/ACM International Symposium on Cluster. Cloud and Internet Computing (CCGRID) 2020:242–251
Garg R, Mohan A, Sullivan M, Cooperman G (2018) In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp. 302–313
Amrizal A, Hirasawa S, Komatsu K, Takizawa H, Kobayashi H (2012) Improving the scalability of transparent checkpointing for GPU computing systems, In: TENCON 2012 IEEE Region 10 Conference (IEEE, 2012), pp. 1–6
Hargrove PH, Duell JC (2006) Berkeley lab checkpoint/restart (blcr) for linux clusters. J Phys Conf Ser 46:494
Article Google Scholar
Ferreira KB, Riesen R, Bridges P, Arnold D, Brightwell R (2014) Accelerating incremental checkpointing for extreme-scale computing. Future Gener Comput Syst 30:66
Article Google Scholar
Muhammad Abrar Akber S, Chen H, Wang Y, Jin H (2018) Minimizing Overheads of Checkpoints in Distributed Stream Processing Systems, In: 2018 IEEE 7th International Conference on Cloud Networking (CloudNet), pp. 1–4. https://doi.org/10.1109/CloudNet.2018.8549548
Dauwe D, Pasricha S, Maciejewski AA, Siegel HJ (2018) An Analysis of Multilevel Checkpoint Performance Models, In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 783–792. https://doi.org/10.1109/IPDPSW.2018.00125
León B, Franco D, Rexachs D, Luque E (2020) Analysis of Checkpoint I/O Behavior. In: Krzhizhanovskaya VV, Závodszky G, Lees MH, Dongarra JJ, Sloot PMA, Brissos S, Teixeira J (eds) Computational Science - ICCS 2020. Springer International Publishing, Cham, pp 191–205
Chapter Google Scholar
MPICH (2000) Using the Hydra Process Manager, in https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manage
Vaidya NH (1997) Impact of checkpoint latency on overhead ratio of a checkpointing scheme. IEEE Trans Comput 46(8):942
Article Google Scholar
Panadero J, Wong A, Rexachs D, Luque E (2018) P3S: a methodology to analyze and predict application scalability. IEEE Trans Parallel Distrib Syst 29(3):642. https://doi.org/10.1109/TPDS.2017.2763148
Article Google Scholar
Goodell D, Gropp W, Zhao X, Thakur R (2011) Scalable memory use in MPI: a case study with MPICH2. European MPI users’ group meeting. Springer, Berlin, pp 140–149
Google Scholar
Yoshinaga K, Tsujita Y, Hori A, Sato M, Namiki M, Ishikawa Y (2013) A Delegation Mechanism on Many-Core Oriented Hybrid Parallel Computers for Scalability of Communicators and Communications in MPI, In: 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, pp. 249–253
Bailey DH, Barszcz E, Barton JT, Browning DS, Carter RL, Dagum L, Fatoohi RA, Frederickson PO, Lasinski TA, Schreiber RS et al (1991) The NAS parallel benchmarks. Int J Supercomput Appl 5(3):63
Google Scholar
Karlin I, Keasler J, Neely J (2013) LULESH 2.0 Updates and Changes, In: 2009 IEEE International Symposium on Parallel Distributed Processing, vol. United States, vol. United States
Hou KY, Shin KG, Turner Y, Singhal S (2013) Tradeoffs in Compressing Virtual Machine Checkpoints, In: Proceedings of the 7th International Workshop on Virtualization Technologies in Distributed Computing (Association for Computing Machinery, New York, NY, USA, 2013), VTDC ’13, p. 41–48. https://doi.org/10.1145/2465829.2465834

Download references

Acknowledgements

This publication was supported under contract TIN2017-84875-P, funded by the Agencia Estatal de Investigación (AEI), Spain, and the Fondo Europeo de Desarrollo Regional (FEDER) UE and partially funded by a research collaboration agreement with the Fundación Escuelas Universitarias Gimbernat (EUG).

Author information

Authors and Affiliations

Computer Architecture and Operating Systems Department, Universitat Autònoma de Barcelona, 08193, Bellaterra, Barcelona, Spain
Betzabeth León, Daniel Franco, Dolores Rexachs & Emilio Luque

Authors

Betzabeth León
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Franco
View author publications
You can also search for this author in PubMed Google Scholar
Dolores Rexachs
View author publications
You can also search for this author in PubMed Google Scholar
Emilio Luque
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Betzabeth León.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

León, B., Franco, D., Rexachs, D. et al. Analysis of parallel application checkpoint storage for system configuration. J Supercomput 77, 4582–4617 (2021). https://doi.org/10.1007/s11227-020-03445-1

Download citation

Accepted: 30 September 2020
Published: 16 October 2020
Issue Date: May 2021
DOI: https://doi.org/10.1007/s11227-020-03445-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Analysis of parallel application checkpoint storage for system configuration

Abstract

Access this article

Similar content being viewed by others

Accelerating erasure coding by exploiting multiple repair paths in distributed storage systems

Performance improvement of the triangular matrix product in commodity clusters

Containers in HPC: a survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Analysis of parallel application checkpoint storage for system configuration

Abstract

Access this article

Similar content being viewed by others

Accelerating erasure coding by exploiting multiple repair paths in distributed storage systems

Performance improvement of the triangular matrix product in commodity clusters

Containers in HPC: a survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation