Skip to main content

Enhancing Fault-Tolerance of Large-Scale MPI Scientific Applications

  • Conference paper
Parallel Computing Technologies (PaCT 2007)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4671))

Included in the following conference series:

Abstract

The running times of large-scale computational science and engineering parallel applications, executed on clusters or Grid platforms, are usually longer than the mean-time-between-failures (MTBF). Therefore, hardware failures must be tolerated to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery are very useful techniques to implement fault-tolerant applications. Although extensive research has been carried out in this field, there are few available tools to help parallel programmers to enhance their applications with fault tolerance support. This work presents an experience to endow with fault tolerance two large MPI scientific applications: an air quality simulation model and a crack growth analysis. A fault tolerant solution has been implemented by means of a checkpointing and recovery tool, the CPPC framework. Detailed experimental results are presented to show the practical usefulness and low overhead of this checkpointing approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov, A.: MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes. In: Proceedings of the 2002 ACM/IEEE Supercomputing Conference, pp. 1–18 (2002)

    Google Scholar 

  2. Louca, S., Neophytou, N., Lachanas, A., Evripidou, P.: MPI-FT: Portable fault tolerance scheme for MPI. Parallel Processing Letters 10(4), 371–382 (2000)

    Article  Google Scholar 

  3. Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. In: 8th IEEE International Symposium on High Performance Distributed Computing, pp. 167–176 (1999)

    Google Scholar 

  4. Rao, S., Alvisi, L., Vin, H.: Egida: An extensible toolkit for low-overhead fault tolerance. In: 29th International Symposium on Fault-Tolerant Computing (FTCS-29), pp. 48–55 (1999)

    Google Scholar 

  5. Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: Automated application-level checkpointing of MPI programs. In: ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming (PPOPP), pp. 84–94 (2003)

    Google Scholar 

  6. Rodríguez, G., Martín, M., González, P., Touriño, J.: Controller/precompiler for portable checkpointing. IEICE Transactions on Information and Systems E89-D(2), 408–417 (2006)

    Article  Google Scholar 

  7. Martín, M., Singh, D., Mouriño, J., Rivera, F., Doallo, R., Bruguera, J.: High performance air pollution modeling for a power plant environment. Parallel Computing 29(11-12), 1763–1790 (2003)

    Article  Google Scholar 

  8. González, P., Cabaleiro, J.C., Pena, T.F., Rivera, F.F.: Dual BEM for crack growth analysis in distributed-memory multiprocessors. Advances in Engineering Software 31(12), 921–927 (2000)

    Article  MATH  Google Scholar 

  9. Elnozahy, E., Alvisi, L., Wang, Y., Johnson, D.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375–408 (2002)

    Article  Google Scholar 

  10. National Center for Supercomputing Applications: HDF5: File Format Specification [last accessed May 2007], http://hdf.ncsa.uiuc.edu/HDF5

  11. Gailly, J., Adler, M.: Zlib home page [last accessed May 2007], http://www.zlib.net

  12. Li, K., Naughton, J.F., Plank, J.S.: Low-latency concurrent checkpointing for parallel programs. IEEE Transactions on Parallel and Distributed Systems 5(8), 874–879 (1994)

    Article  Google Scholar 

  13. Carmichael, G., Peters, L., Saylor, R.: The STEM-II regional scale acid deposition and photochemical oxidant model - I. An overview of model development and applications. Atmospheric Environment 25A(10), 2077–2105 (1991)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Victor Malyshkin

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Rodríguez, G., González, P., Martín, M.J., Touriño, J. (2007). Enhancing Fault-Tolerance of Large-Scale MPI Scientific Applications. In: Malyshkin, V. (eds) Parallel Computing Technologies. PaCT 2007. Lecture Notes in Computer Science, vol 4671. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73940-1_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-73940-1_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-73939-5

  • Online ISBN: 978-3-540-73940-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics