Abstract
The use of fault tolerance strategies such as checkpoints is essential to maintain the availability of systems and their applications in high-performance computing environments. However, checkpoint storage can impact the performance and scalability of parallel applications that use message passing. In the present work, a study is carried out on the elements that can impact the storage of the checkpoint and how these can influence the scalability of an application with fault tolerance. A methodology has been designed based on predicting the size of the checkpoint when the number of processes, the application workload or the mapping varies, using a reduced number of resources. By following this methodology, the system administrator will be able to make decisions about what should be done with the number of processes used and the number of appropriate nodes, adjusting the process mapping in applications that use checkpoints.
Similar content being viewed by others
References
León B, Franco D, Rexachs D, Luque E (2018) Characterization of I/O Patterns generated by Fault Tolerance in HPC environments. International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA) vol 18, p 28
Lemarinier Bouteiller, Capello Krawezik (2003) Coordinated checkpoint versus message log for fault tolerant MPI, in 2003 Proceedings IEEE International Conference on Cluster Computing, pp. 242–250. https://doi.org/10.1109/CLUSTR.2003.1253321
Shahzad F, Thies J, Kreutzer M, Zeiser T, Hager G, Wellein G (2019) CRAFT: a library for easier application-level checkpoint/restart and automatic fault tolerance. IEEE Trans Parallel Distrib Syst 30(3):501. https://doi.org/10.1109/TPDS.2018.2866794
Coti C, Herault T, Lemarinier P, Pilard L, Rezmerita A, Rodriguez E, Cappello F (2006) Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI, In: SC ’06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, pp. 18–18. https://doi.org/10.1109/SC.2006.15
Moríñigo JA, Rodríguez-Pascual M, Mayo-García R (2019) On the modelling of optimal coordinated checkpoint period in supercomputers. J Supercomput 75(2):930
Guermouche A, Ropars T, Brunet E, Snir M, Cappello F (2011) Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications, in 2011 IEEE International Parallel Distributed Processing Symposium, pp. 989–1000. https://doi.org/10.1109/IPDPS.2011.95
Kumar M, Choudhary A, Kumar V (2014) A comparison between different checkpoint schemes with advantages and disadvantages. Int J Comput Appl Nat Semin Recent Adv Wireless Netw Commun 3:36
Kovács J, Kacsuk P, Januszewski R, Jankowski G (2010) Application and middleware transparent checkpointing with TCKPT on ClusterGrids. Future Gener Comput Syst 26(3):498
Castro-León M, Meyer H, Rexachs D, Luque E (2015) Fault tolerance at system level based on RADIC architecture. Journal of Parallel and Distributed Computing 86:98. https://doi.org/10.1016/j.jpdc.2015.08.005. http://www.sciencedirect.com/science/article/pii/S0743731515001434
Subasi O, Zyulkyarov F, Unsal O, Labarta J (2015) Marriage Between Coordinated and Uncoordinated Checkpointing for the Exascale Era, in 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems, pp. 470–478
Takizawa H, Amrizal MA, Komatsu K, Egawa R (2017) An Application-Level Incremental Checkpointing Mechanism with Automatic Parameter Tuning, In: 2017 Fifth International Symposium on Computing and Networking (CANDAR), pp. 389–394. https://doi.org/10.1109/CANDAR.2017.96
Li G, Pattabiraman K, Cher C, Bose P (2015) Experience report: An application-specific checkpointing technique for minimizing checkpoint corruption, In: 2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE), pp. 141–152
Ansel J, Arya K, Cooperman G (2009) DMTCP: Transparent checkpointing for cluster computations and the desktop, In: 2009 IEEE International Symposium on Parallel Distributed Processing, pp. 1–12. https://doi.org/10.1109/IPDPS.2009.5161063
Kongmunvattana A, Tanchatchawal S, Tzeng Nian-Feng (2000) Coherence-based coordinated checkpointing for software distributed shared memory systems, In: Proceedings 20th IEEE International Conference on Distributed Computing Systems, pp. 556–563
Cores I, Rodríguez G, González P, Osorio RR et al (2013) Improving scalability of application-level checkpoint-recovery by reducing checkpoint sizes. New Gener Comput 31(3):163
Kongmunvattana A (2015) Reducing checkpoint creation overhead using data similarity. Int J Comput 4(4):199
Rusu C, Grecu C, Anghel L (2008) Improving the scalability of checkpoint recovery for networks-on-chip, in 2008 IEEE International Symposium on Circuits and Systems, pp. 2793–2796. https://doi.org/10.1109/ISCAS.2008.4542037
Bouabache F, Herault T, Fedak G, Cappello F (2008) Hierarchical Replication Techniques to Ensure Checkpoint Storage Reliability in Grid Environment, In: 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), pp. 475–483. https://doi.org/10.1109/CCGRID.2008.95
Al-Kiswany S, Ripeanu M, Vazhkudai SS, Gharaibeh A (2008) stdchk: A Checkpoint Storage System for Desktop Grid Computing, In: 2008 The 28th International Conference on Distributed Computing Systems, pp. 613–624. https://doi.org/10.1109/ICDCS.2008.19
Shahzad F, Wittmann M, Zeiser T, Hager G, Wellein G, Evaluation An, of Different I, O Techniques for Checkpoint, Restart, in, (2013) IEEE International Symposium on Parallel Distributed Processing. Workshops and Phd Forum 2013:1708–1716. https://doi.org/10.1109/IPDPSW.2013.145
Wan L, Cao Q, Wang F, Oral S (2017) Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems. Journal of Parallel and Distributed Computing 100:16. https://doi.org/10.1016/j.jpdc.2016.10.002. http://www.sciencedirect.com/science/article/pii/S0743731516301198
Parasyris K, Keller K, Bautista-Gomez L, Unsal O, Support Checkpoint Restart, for Heterogeneous HPC Applications, in, (2020) 20th IEEE/ACM International Symposium on Cluster. Cloud and Internet Computing (CCGRID) 2020:242–251
Garg R, Mohan A, Sullivan M, Cooperman G (2018) In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp. 302–313
Amrizal A, Hirasawa S, Komatsu K, Takizawa H, Kobayashi H (2012) Improving the scalability of transparent checkpointing for GPU computing systems, In: TENCON 2012 IEEE Region 10 Conference (IEEE, 2012), pp. 1–6
Hargrove PH, Duell JC (2006) Berkeley lab checkpoint/restart (blcr) for linux clusters. J Phys Conf Ser 46:494
Ferreira KB, Riesen R, Bridges P, Arnold D, Brightwell R (2014) Accelerating incremental checkpointing for extreme-scale computing. Future Gener Comput Syst 30:66
Muhammad Abrar Akber S, Chen H, Wang Y, Jin H (2018) Minimizing Overheads of Checkpoints in Distributed Stream Processing Systems, In: 2018 IEEE 7th International Conference on Cloud Networking (CloudNet), pp. 1–4. https://doi.org/10.1109/CloudNet.2018.8549548
Dauwe D, Pasricha S, Maciejewski AA, Siegel HJ (2018) An Analysis of Multilevel Checkpoint Performance Models, In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 783–792. https://doi.org/10.1109/IPDPSW.2018.00125
León B, Franco D, Rexachs D, Luque E (2020) Analysis of Checkpoint I/O Behavior. In: Krzhizhanovskaya VV, Závodszky G, Lees MH, Dongarra JJ, Sloot PMA, Brissos S, Teixeira J (eds) Computational Science - ICCS 2020. Springer International Publishing, Cham, pp 191–205
MPICH (2000) Using the Hydra Process Manager, in https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manage
Vaidya NH (1997) Impact of checkpoint latency on overhead ratio of a checkpointing scheme. IEEE Trans Comput 46(8):942
Panadero J, Wong A, Rexachs D, Luque E (2018) P3S: a methodology to analyze and predict application scalability. IEEE Trans Parallel Distrib Syst 29(3):642. https://doi.org/10.1109/TPDS.2017.2763148
Goodell D, Gropp W, Zhao X, Thakur R (2011) Scalable memory use in MPI: a case study with MPICH2. European MPI users’ group meeting. Springer, Berlin, pp 140–149
Yoshinaga K, Tsujita Y, Hori A, Sato M, Namiki M, Ishikawa Y (2013) A Delegation Mechanism on Many-Core Oriented Hybrid Parallel Computers for Scalability of Communicators and Communications in MPI, In: 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, pp. 249–253
Bailey DH, Barszcz E, Barton JT, Browning DS, Carter RL, Dagum L, Fatoohi RA, Frederickson PO, Lasinski TA, Schreiber RS et al (1991) The NAS parallel benchmarks. Int J Supercomput Appl 5(3):63
Karlin I, Keasler J, Neely J (2013) LULESH 2.0 Updates and Changes, In: 2009 IEEE International Symposium on Parallel Distributed Processing, vol. United States, vol. United States
Hou KY, Shin KG, Turner Y, Singhal S (2013) Tradeoffs in Compressing Virtual Machine Checkpoints, In: Proceedings of the 7th International Workshop on Virtualization Technologies in Distributed Computing (Association for Computing Machinery, New York, NY, USA, 2013), VTDC ’13, p. 41–48. https://doi.org/10.1145/2465829.2465834
Acknowledgements
This publication was supported under contract TIN2017-84875-P, funded by the Agencia Estatal de Investigación (AEI), Spain, and the Fondo Europeo de Desarrollo Regional (FEDER) UE and partially funded by a research collaboration agreement with the Fundación Escuelas Universitarias Gimbernat (EUG).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
León, B., Franco, D., Rexachs, D. et al. Analysis of parallel application checkpoint storage for system configuration. J Supercomput 77, 4582–4617 (2021). https://doi.org/10.1007/s11227-020-03445-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-020-03445-1