Skip to main content

Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules: A Programmer’s Perspective

  • Chapter
Dependable Systems: Software, Computing, Networks

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 4028))

  • 384 Accesses

Abstract

Dynamic Parallel Schedules (DPS) is a flow graph based framework for developing parallel applications on clusters of workstations. The DPS flow graph execution model enables automatic pipelined parallel execution of applications. DPS supports graceful degradation of parallel applications in case of node failures. The fault-tolerance mechanism relies on a set of backup threads stored in the volatile storage of alternate nodes that are kept up to date by both duplicating transmitted data objects and performing periodical checkpointing. The current state of a failed node can be reconstructed on its backup threads by re-executing the application since the last checkpoint. A valid execution order is automatically deduced from the flow graph. The addition of fault-tolerance to a DPS application requires only minor changes to the application’s source code. The present contribution focuses on the development of fault-tolerant parallel applications with DPS from a programmer’s perspective.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. In: 8th International Symposium on High Performance Distributed Computing (HPDC-8 1999). IEEE CS Press, Los Alamitos (1999)

    Google Scholar 

  2. Baratloo, A., Dasgupta, P., Kedem, Z.M.: Calypso: A Novel Software System for Fault-Tolerant Parallel Procssing on Distributed Platforms. In: Proc. International Symposium on High-Performance Distributed Computing, pp. 122–129 (1995)

    Google Scholar 

  3. Batchu, R., Neelamegam, J., Cui, Z., Beddhua, M., Skjel-lum, A., Dandass, Y., Apte, M.: MPI/FT: Architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing. In: 1st IEEE International Symposium of Cluster Computing and the Grid, Melbourne, Australia (2001)

    Google Scholar 

  4. Bhargava, B., Lian, S.R.: Independent Checkpointing and Concurrent Rollback for Recovery - an Optimistic Approach. In: Proc. IEEE Symposium on Reliable Distributed Systems, pp. 3–12 (1988)

    Google Scholar 

  5. Chakravorty, S., Kale, L.V.: A fault tolerant protocol for massively parallel systems. In: 18th International Parallel and Distributed Processing Symposium (IPDPS 2004), pp. 212–219 (April 2004)

    Google Scholar 

  6. Das, D., Dasgupta, P., Das, P.P.: A New Method for Transparent Fault Tolerance of Distributed Programs on a Network of Workstations Using Alternative Schedules. In: Proc. Conf. on Algorithms and Architectures for Parallel Processing (ICAPP 1997), pp. 479–486 (1997)

    Google Scholar 

  7. Dongarra, J., Otto, S., Snir, M., Walker, D.: A message passing standard for MPP and Workstations. Communications of the ACM 39(7), 84–90 (1996)

    Article  Google Scholar 

  8. Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computing Surveys 34(3), 375–408 (2002)

    Article  Google Scholar 

  9. Elnozahy, E.N., Zwaenepoel, W.: Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback and Fast Output Commit. IEEE Transactions on Computers 41(5), 526–531 (1992)

    Article  Google Scholar 

  10. Gerlach, S., Hersch, R.D.: DPS - Dynamic Parallel Schedules. In: International Parallel and Distributed Processing Symposium (IPDPS 2003), pp. 15–24 (April 2003)

    Google Scholar 

  11. Gerlach, S., Hersch, R.D.: Fault-tolerant Parallel Applications with Dynamic Parallel Schedules. In: International Parallel and Distributed Processing Symposium (IPDPS 2005), p. 278b (April 2005)

    Google Scholar 

  12. Gerlach, S.: DPS online documentation, http://dps.epfl.ch

  13. Johnson, D.B., Zwaenepoel, W.: Sender based message logging, Digest of Papers, FTCS-17. In: Proc. 17th Annual International Symposium on Fault-Tolerant Computing, pp. 14–19 (1987)

    Google Scholar 

  14. Plank, J.S., Kim, Y., Dongarra, J.J.: Algorithm-Based Diskless Checkpointing for Fault Tolerant Matrix Operations, FTCS-25. In: Proc. 25th Annual International Symposium on Fault-Tolerant Computing, pp. 351–360 (1995)

    Google Scholar 

  15. Strom, R., Yemini, S.: Optimistic recovery in distributed systems. ACM Transactions on Computer Systems 3(3), 204–226 (1985)

    Article  Google Scholar 

  16. Tamir, Y., Sequin, C.H.: Error recovery in multicomputers using global checkpoints. In: Proceedings of the International Conference on Parallel Processing, pp. 32–41 (1984)

    Google Scholar 

  17. Wang, Y.M., Fuchs, W.K.: Lazy Checkpoint Coordination for Bounding Rollback Propagation. In: Proc. 12th Symposium on Reliable Distributed Systems, October 1993, pp. 78–85 (1993)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Gerlach, S., Schaeli, B., Hersch, R.D. (2006). Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules: A Programmer’s Perspective. In: Kohlas, J., Meyer, B., Schiper, A. (eds) Dependable Systems: Software, Computing, Networks. Lecture Notes in Computer Science, vol 4028. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11808107_9

Download citation

  • DOI: https://doi.org/10.1007/11808107_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-36821-2

  • Online ISBN: 978-3-540-36823-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics