Enhancing Fault-Tolerance of Large-Scale MPI Scientific Applications

Rodríguez, G.; González, P.; Martín, M. J.; Touriño, J.

doi:10.1007/978-3-540-73940-1_15

G. Rodríguez¹,
P. González¹,
M. J. Martín¹ &
…
J. Touriño¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4671))

Included in the following conference series:

International Conference on Parallel Computing Technologies

671 Accesses
4 Citations

Abstract

The running times of large-scale computational science and engineering parallel applications, executed on clusters or Grid platforms, are usually longer than the mean-time-between-failures (MTBF). Therefore, hardware failures must be tolerated to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery are very useful techniques to implement fault-tolerant applications. Although extensive research has been carried out in this field, there are few available tools to help parallel programmers to enhance their applications with fault tolerance support. This work presents an experience to endow with fault tolerance two large MPI scientific applications: an air quality simulation model and a crack growth analysis. A fault tolerant solution has been implemented by means of a checkpointing and recovery tool, the CPPC framework. Detailed experimental results are presented to show the practical usefulness and low overhead of this checkpointing approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov, A.: MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes. In: Proceedings of the 2002 ACM/IEEE Supercomputing Conference, pp. 1–18 (2002)
Google Scholar
Louca, S., Neophytou, N., Lachanas, A., Evripidou, P.: MPI-FT: Portable fault tolerance scheme for MPI. Parallel Processing Letters 10(4), 371–382 (2000)
Article Google Scholar
Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. In: 8th IEEE International Symposium on High Performance Distributed Computing, pp. 167–176 (1999)
Google Scholar
Rao, S., Alvisi, L., Vin, H.: Egida: An extensible toolkit for low-overhead fault tolerance. In: 29th International Symposium on Fault-Tolerant Computing (FTCS-29), pp. 48–55 (1999)
Google Scholar
Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: Automated application-level checkpointing of MPI programs. In: ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming (PPOPP), pp. 84–94 (2003)
Google Scholar
Rodríguez, G., Martín, M., González, P., Touriño, J.: Controller/precompiler for portable checkpointing. IEICE Transactions on Information and Systems E89-D(2), 408–417 (2006)
Article Google Scholar
Martín, M., Singh, D., Mouriño, J., Rivera, F., Doallo, R., Bruguera, J.: High performance air pollution modeling for a power plant environment. Parallel Computing 29(11-12), 1763–1790 (2003)
Article Google Scholar
González, P., Cabaleiro, J.C., Pena, T.F., Rivera, F.F.: Dual BEM for crack growth analysis in distributed-memory multiprocessors. Advances in Engineering Software 31(12), 921–927 (2000)
Article MATH Google Scholar
Elnozahy, E., Alvisi, L., Wang, Y., Johnson, D.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375–408 (2002)
Article Google Scholar
National Center for Supercomputing Applications: HDF5: File Format Specification [last accessed May 2007], http://hdf.ncsa.uiuc.edu/HDF5
Gailly, J., Adler, M.: Zlib home page [last accessed May 2007], http://www.zlib.net
Li, K., Naughton, J.F., Plank, J.S.: Low-latency concurrent checkpointing for parallel programs. IEEE Transactions on Parallel and Distributed Systems 5(8), 874–879 (1994)
Article Google Scholar
Carmichael, G., Peters, L., Saylor, R.: The STEM-II regional scale acid deposition and photochemical oxidant model - I. An overview of model development and applications. Atmospheric Environment 25A(10), 2077–2105 (1991)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Architecture Group, Dep. Electronics and Systems, University of A Coruña, Spain
G. Rodríguez, P. González, M. J. Martín & J. Touriño

Authors

G. Rodríguez
View author publications
You can also search for this author in PubMed Google Scholar
P. González
View author publications
You can also search for this author in PubMed Google Scholar
M. J. Martín
View author publications
You can also search for this author in PubMed Google Scholar
J. Touriño
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Victor Malyshkin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rodríguez, G., González, P., Martín, M.J., Touriño, J. (2007). Enhancing Fault-Tolerance of Large-Scale MPI Scientific Applications. In: Malyshkin, V. (eds) Parallel Computing Technologies. PaCT 2007. Lecture Notes in Computer Science, vol 4671. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73940-1_15

Download citation

DOI: https://doi.org/10.1007/978-3-540-73940-1_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73939-5
Online ISBN: 978-3-540-73940-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics