Abstract
The running times of large-scale computational science and engineering parallel applications, executed on clusters or Grid platforms, are usually longer than the mean-time-between-failures (MTBF). Therefore, hardware failures must be tolerated to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery are very useful techniques to implement fault-tolerant applications. Although extensive research has been carried out in this field, there are few available tools to help parallel programmers to enhance their applications with fault tolerance support. This work presents an experience to endow with fault tolerance two large MPI scientific applications: an air quality simulation model and a crack growth analysis. A fault tolerant solution has been implemented by means of a checkpointing and recovery tool, the CPPC framework. Detailed experimental results are presented to show the practical usefulness and low overhead of this checkpointing approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov, A.: MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes. In: Proceedings of the 2002 ACM/IEEE Supercomputing Conference, pp. 1–18 (2002)
Louca, S., Neophytou, N., Lachanas, A., Evripidou, P.: MPI-FT: Portable fault tolerance scheme for MPI. Parallel Processing Letters 10(4), 371–382 (2000)
Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. In: 8th IEEE International Symposium on High Performance Distributed Computing, pp. 167–176 (1999)
Rao, S., Alvisi, L., Vin, H.: Egida: An extensible toolkit for low-overhead fault tolerance. In: 29th International Symposium on Fault-Tolerant Computing (FTCS-29), pp. 48–55 (1999)
Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: Automated application-level checkpointing of MPI programs. In: ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming (PPOPP), pp. 84–94 (2003)
Rodríguez, G., Martín, M., González, P., Touriño, J.: Controller/precompiler for portable checkpointing. IEICE Transactions on Information and Systems E89-D(2), 408–417 (2006)
Martín, M., Singh, D., Mouriño, J., Rivera, F., Doallo, R., Bruguera, J.: High performance air pollution modeling for a power plant environment. Parallel Computing 29(11-12), 1763–1790 (2003)
González, P., Cabaleiro, J.C., Pena, T.F., Rivera, F.F.: Dual BEM for crack growth analysis in distributed-memory multiprocessors. Advances in Engineering Software 31(12), 921–927 (2000)
Elnozahy, E., Alvisi, L., Wang, Y., Johnson, D.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375–408 (2002)
National Center for Supercomputing Applications: HDF5: File Format Specification [last accessed May 2007], http://hdf.ncsa.uiuc.edu/HDF5
Gailly, J., Adler, M.: Zlib home page [last accessed May 2007], http://www.zlib.net
Li, K., Naughton, J.F., Plank, J.S.: Low-latency concurrent checkpointing for parallel programs. IEEE Transactions on Parallel and Distributed Systems 5(8), 874–879 (1994)
Carmichael, G., Peters, L., Saylor, R.: The STEM-II regional scale acid deposition and photochemical oxidant model - I. An overview of model development and applications. Atmospheric Environment 25A(10), 2077–2105 (1991)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Rodríguez, G., González, P., Martín, M.J., Touriño, J. (2007). Enhancing Fault-Tolerance of Large-Scale MPI Scientific Applications. In: Malyshkin, V. (eds) Parallel Computing Technologies. PaCT 2007. Lecture Notes in Computer Science, vol 4671. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73940-1_15
Download citation
DOI: https://doi.org/10.1007/978-3-540-73940-1_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73939-5
Online ISBN: 978-3-540-73940-1
eBook Packages: Computer ScienceComputer Science (R0)