Abstract
Process migration provides many benefits for parallel environments including dynamic load balancing, data access locality or fault tolerance. This paper describes an in-memory application-level checkpoint-based migration solution for MPI codes that uses the Hierarchical Data Format 5 (HDF5) to write the checkpoint files. The main features of the proposed solution are transparency for the user, achieved through the use of CPPC (ComPiler for Portable Checkpointing); portability, as the application-level approach makes the solution adequate for any MPI implementation and operating system, and the use of the HDF5 file format enables the restart on different architectures; and high performance, by saving the checkpoint files to memory instead of to disk through the use of the HDF5 in-memory files. Experimental results prove that the in-memory approach reduces significantly the I/O cost of the migration process.









Similar content being viewed by others
References
Cores I, Rodríguez G, González P, Martín MJ (2014) Failure avoidance in MPI applications using an application-level approach. Comput J 57(1):100–114
Cores I, Rodríguez G, González P, Martín MJ (2012) Reducing application-level checkpoint file sizes: towards scalable fault tolerance solutions. In: Proceedings of ISPA 12, Madrid, Spain, 10–13 July 2012. IEEE Computer Society Press, Los Alamitos, pp 371–378
Du C, Sun X-H (2006) MPI-Mitten: enabling migration technology in MPI. In: Proceedings of CCGRID 06, Singapore, 16–19 May 2006. IEEE Computer Society Press, Los Alamitos, pp 11–18
Li M, Vazhkudai SS, Butt AR, Meng F, Ma X, Kim Y, Engelmann C, Shipman GM (2010) Functional partitioning to optimize end-to-end performance on many-core architectures. In: Proceedings of conference on high performance computing networking, storage and analysis, SC 2010, New Orleans, LA, USA, 13–19 Nov 2010, pp 1–12
National Aeronautics and Space Administration. The NAS parallel benchmarks. http://www.nas.nasa.gov/publications/npb.html. Accessed on July 2013
Ouyang X, Rajachandrasekar R, Besseron X, Panda DK (2011) High performance pipelined process migration with RDMA. In: Proceedings of CCGRID 11, Newport Beach, CA, USA, 23–26 May 2011. IEEE Computer Society Press, Los Alamitos, pp 314–323
Rodríguez G, Martín MJ, González P, Touri no J, Doallo R (2010) CPPC: A compiler-assisted tool for portable checkpointing of message-passing applications. Concurr Comput Pract Exp 22(6):749–766
Singh R, Graham P (2008) Performance driven partial checkpoint/migrate for LAM-MPI. In: Proceedings of HPCS 08, Québec City, Canada, 9–11 June 2008. IEEE Computer Society Press, Los Alamitos, pp 110–116
The HDF Group. HDF-5: hierarchical data format. http://www.hdfgroup.org/HDF5/. Accessed on July 2013
The HDF Group. HDF5 File image operations. http://www.hdfgroup.org/HDF5/doc/Advanced/FileImageOperations/HDF5FileImageOperations.pdf. Accessed on July 2013
Wang C, Mueller F, Engelmann C, Scott SL (2008) Proactive process-level live migration in HPC environments. In: Proceedings of the 21st IEEE/ACM international conference on high performance computing, networking, storage and analysis (SC) 2008, pp 1–12
Acknowledgments
This research was supported by the Ministry of Science and Innovation of Spain (Project TIN2010-16735) and by the Galician Government (Project 10PXIB 105180PR and consolidation program of competitive reference groups GRC2013/055).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cores, I., Rodríguez, G., Martín, M.J. et al. In-memory application-level checkpoint-based migration for MPI programs. J Supercomput 70, 660–670 (2014). https://doi.org/10.1007/s11227-014-1120-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-014-1120-2