Skip to main content
Log in

Analysis of parallel application checkpoint storage for system configuration

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The use of fault tolerance strategies such as checkpoints is essential to maintain the availability of systems and their applications in high-performance computing environments. However, checkpoint storage can impact the performance and scalability of parallel applications that use message passing. In the present work, a study is carried out on the elements that can impact the storage of the checkpoint and how these can influence the scalability of an application with fault tolerance. A methodology has been designed based on predicting the size of the checkpoint when the number of processes, the application workload or the mapping varies, using a reduced number of resources. By following this methodology, the system administrator will be able to make decisions about what should be done with the number of processes used and the number of appropriate nodes, adjusting the process mapping in applications that use checkpoints.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. León B, Franco D, Rexachs D, Luque E (2018) Characterization of I/O Patterns generated by Fault Tolerance in HPC environments. International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA) vol 18, p 28

  2. Lemarinier Bouteiller, Capello Krawezik (2003) Coordinated checkpoint versus message log for fault tolerant MPI, in 2003 Proceedings IEEE International Conference on Cluster Computing, pp. 242–250. https://doi.org/10.1109/CLUSTR.2003.1253321

  3. Shahzad F, Thies J, Kreutzer M, Zeiser T, Hager G, Wellein G (2019) CRAFT: a library for easier application-level checkpoint/restart and automatic fault tolerance. IEEE Trans Parallel Distrib Syst 30(3):501. https://doi.org/10.1109/TPDS.2018.2866794

    Article  Google Scholar 

  4. Coti C, Herault T, Lemarinier P, Pilard L, Rezmerita A, Rodriguez E, Cappello F (2006) Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI, In: SC ’06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, pp. 18–18. https://doi.org/10.1109/SC.2006.15

  5. Moríñigo JA, Rodríguez-Pascual M, Mayo-García R (2019) On the modelling of optimal coordinated checkpoint period in supercomputers. J Supercomput 75(2):930

    Article  Google Scholar 

  6. Guermouche A, Ropars T, Brunet E, Snir M, Cappello F (2011) Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications, in 2011 IEEE International Parallel Distributed Processing Symposium, pp. 989–1000. https://doi.org/10.1109/IPDPS.2011.95

  7. Kumar M, Choudhary A, Kumar V (2014) A comparison between different checkpoint schemes with advantages and disadvantages. Int J Comput Appl Nat Semin Recent Adv Wireless Netw Commun 3:36

    Google Scholar 

  8. Kovács J, Kacsuk P, Januszewski R, Jankowski G (2010) Application and middleware transparent checkpointing with TCKPT on ClusterGrids. Future Gener Comput Syst 26(3):498

    Article  Google Scholar 

  9. Castro-León M, Meyer H, Rexachs D, Luque E (2015) Fault tolerance at system level based on RADIC architecture. Journal of Parallel and Distributed Computing 86:98. https://doi.org/10.1016/j.jpdc.2015.08.005. http://www.sciencedirect.com/science/article/pii/S0743731515001434

  10. Subasi O, Zyulkyarov F, Unsal O, Labarta J (2015) Marriage Between Coordinated and Uncoordinated Checkpointing for the Exascale Era, in 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems, pp. 470–478

  11. Takizawa H, Amrizal MA, Komatsu K, Egawa R (2017) An Application-Level Incremental Checkpointing Mechanism with Automatic Parameter Tuning, In: 2017 Fifth International Symposium on Computing and Networking (CANDAR), pp. 389–394. https://doi.org/10.1109/CANDAR.2017.96

  12. Li G, Pattabiraman K, Cher C, Bose P (2015) Experience report: An application-specific checkpointing technique for minimizing checkpoint corruption, In: 2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE), pp. 141–152

  13. Ansel J, Arya K, Cooperman G (2009) DMTCP: Transparent checkpointing for cluster computations and the desktop, In: 2009 IEEE International Symposium on Parallel Distributed Processing, pp. 1–12. https://doi.org/10.1109/IPDPS.2009.5161063

  14. Kongmunvattana A, Tanchatchawal S, Tzeng Nian-Feng (2000) Coherence-based coordinated checkpointing for software distributed shared memory systems, In: Proceedings 20th IEEE International Conference on Distributed Computing Systems, pp. 556–563

  15. Cores I, Rodríguez G, González P, Osorio RR et al (2013) Improving scalability of application-level checkpoint-recovery by reducing checkpoint sizes. New Gener Comput 31(3):163

    Article  Google Scholar 

  16. Kongmunvattana A (2015) Reducing checkpoint creation overhead using data similarity. Int J Comput 4(4):199

    Google Scholar 

  17. Rusu C, Grecu C, Anghel L (2008) Improving the scalability of checkpoint recovery for networks-on-chip, in 2008 IEEE International Symposium on Circuits and Systems, pp. 2793–2796. https://doi.org/10.1109/ISCAS.2008.4542037

  18. Bouabache F, Herault T, Fedak G, Cappello F (2008) Hierarchical Replication Techniques to Ensure Checkpoint Storage Reliability in Grid Environment, In: 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), pp. 475–483. https://doi.org/10.1109/CCGRID.2008.95

  19. Al-Kiswany S, Ripeanu M, Vazhkudai SS, Gharaibeh A (2008) stdchk: A Checkpoint Storage System for Desktop Grid Computing, In: 2008 The 28th International Conference on Distributed Computing Systems, pp. 613–624. https://doi.org/10.1109/ICDCS.2008.19

  20. Shahzad F, Wittmann M, Zeiser T, Hager G, Wellein G, Evaluation An, of Different I, O Techniques for Checkpoint, Restart, in, (2013) IEEE International Symposium on Parallel Distributed Processing. Workshops and Phd Forum 2013:1708–1716. https://doi.org/10.1109/IPDPSW.2013.145

  21. Wan L, Cao Q, Wang F, Oral S (2017) Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems. Journal of Parallel and Distributed Computing 100:16. https://doi.org/10.1016/j.jpdc.2016.10.002. http://www.sciencedirect.com/science/article/pii/S0743731516301198

  22. Parasyris K, Keller K, Bautista-Gomez L, Unsal O, Support Checkpoint Restart, for Heterogeneous HPC Applications, in, (2020) 20th IEEE/ACM International Symposium on Cluster. Cloud and Internet Computing (CCGRID) 2020:242–251

  23. Garg R, Mohan A, Sullivan M, Cooperman G (2018) In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp. 302–313

  24. Amrizal A, Hirasawa S, Komatsu K, Takizawa H, Kobayashi H (2012) Improving the scalability of transparent checkpointing for GPU computing systems, In: TENCON 2012 IEEE Region 10 Conference (IEEE, 2012), pp. 1–6

  25. Hargrove PH, Duell JC (2006) Berkeley lab checkpoint/restart (blcr) for linux clusters. J Phys Conf Ser 46:494

    Article  Google Scholar 

  26. Ferreira KB, Riesen R, Bridges P, Arnold D, Brightwell R (2014) Accelerating incremental checkpointing for extreme-scale computing. Future Gener Comput Syst 30:66

    Article  Google Scholar 

  27. Muhammad Abrar Akber S, Chen H, Wang Y, Jin H (2018) Minimizing Overheads of Checkpoints in Distributed Stream Processing Systems, In: 2018 IEEE 7th International Conference on Cloud Networking (CloudNet), pp. 1–4. https://doi.org/10.1109/CloudNet.2018.8549548

  28. Dauwe D, Pasricha S, Maciejewski AA, Siegel HJ (2018) An Analysis of Multilevel Checkpoint Performance Models, In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 783–792. https://doi.org/10.1109/IPDPSW.2018.00125

  29. León B, Franco D, Rexachs D, Luque E (2020) Analysis of Checkpoint I/O Behavior. In: Krzhizhanovskaya VV, Závodszky G, Lees MH, Dongarra JJ, Sloot PMA, Brissos S, Teixeira J (eds) Computational Science - ICCS 2020. Springer International Publishing, Cham, pp 191–205

    Chapter  Google Scholar 

  30. MPICH (2000) Using the Hydra Process Manager, in https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manage

  31. Vaidya NH (1997) Impact of checkpoint latency on overhead ratio of a checkpointing scheme. IEEE Trans Comput 46(8):942

    Article  Google Scholar 

  32. Panadero J, Wong A, Rexachs D, Luque E (2018) P3S: a methodology to analyze and predict application scalability. IEEE Trans Parallel Distrib Syst 29(3):642. https://doi.org/10.1109/TPDS.2017.2763148

    Article  Google Scholar 

  33. Goodell D, Gropp W, Zhao X, Thakur R (2011) Scalable memory use in MPI: a case study with MPICH2. European MPI users’ group meeting. Springer, Berlin, pp 140–149

    Google Scholar 

  34. Yoshinaga K, Tsujita Y, Hori A, Sato M, Namiki M, Ishikawa Y (2013) A Delegation Mechanism on Many-Core Oriented Hybrid Parallel Computers for Scalability of Communicators and Communications in MPI, In: 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, pp. 249–253

  35. Bailey DH, Barszcz E, Barton JT, Browning DS, Carter RL, Dagum L, Fatoohi RA, Frederickson PO, Lasinski TA, Schreiber RS et al (1991) The NAS parallel benchmarks. Int J Supercomput Appl 5(3):63

    Google Scholar 

  36. Karlin I, Keasler J, Neely J (2013) LULESH 2.0 Updates and Changes, In: 2009 IEEE International Symposium on Parallel Distributed Processing, vol. United States, vol. United States

  37. Hou KY, Shin KG, Turner Y, Singhal S (2013) Tradeoffs in Compressing Virtual Machine Checkpoints, In: Proceedings of the 7th International Workshop on Virtualization Technologies in Distributed Computing (Association for Computing Machinery, New York, NY, USA, 2013), VTDC ’13, p. 41–48. https://doi.org/10.1145/2465829.2465834

Download references

Acknowledgements

This publication was supported under contract TIN2017-84875-P, funded by the Agencia Estatal de Investigación (AEI), Spain, and the Fondo Europeo de Desarrollo Regional (FEDER) UE and partially funded by a research collaboration agreement with the Fundación Escuelas Universitarias Gimbernat (EUG).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Betzabeth León.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

León, B., Franco, D., Rexachs, D. et al. Analysis of parallel application checkpoint storage for system configuration. J Supercomput 77, 4582–4617 (2021). https://doi.org/10.1007/s11227-020-03445-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-020-03445-1

Keywords

Navigation