Skip to main content

DDG Task Recovery for Cluster Computing

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2328))

Abstract

This paper presents a solution for the problem of transparent recovery of asynchronous distributed computation on clusters of workstations when a fault occurs on a node. If the system has fault-tolerant features, it can survive the fault and continues its computations. Performance degradation is unavoidable when hardware redundancies are not available. It is a large advantage if the long-runtime application can restart from a checkpoint instead of restarting whole computation. This paper presents the fault-tolerant feature of the DDG environment oriented to cluster systems without hardware spare.

This work is supported by the Slovak Scientific Grant Agency within Research Project No. 2/7186/20

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Tran V.D., Hluchy L., Nguyen G.T.: Parallel Program Model for Distributed Systems. EuroPVM/MPI, 2000, pp. 250–257. Springer-Verlag.

    Google Scholar 

  2. Hluchý L., Tran V.D., Nguyen G.T.: Parallel Programming with Data Driven Model. EuroMicro, 2000, pp. 205–211. IEEE Computer Society Press.

    Google Scholar 

  3. Tran V.D., Hluchý L., Nguyen G.T.: Parallel Program Model and Environment. PARCO, 1999, pp. 697–704. Imperial College Press.

    Google Scholar 

  4. Bauch A., Maehle E., Markus F.J.: A Distributed Algorithm for Fault-Tolerant Dynamic Task Scheduling. EuroMicro, 1994, pp. 309–316.

    Google Scholar 

  5. Duato J., Yalamanchili S., Ni L.: Interconnection Networks an Engineering Approach. IEEE Computer Society Press, 1997. ISBN 0-8186-7800-3.

    Google Scholar 

  6. Pfister G.F.: In Search of Clusters, 2nd Edition. Prentice Hall, 1998, ISBN 0-13-899709-8.

    Google Scholar 

  7. El-Rewini H., Lewis T. G.: Distributed and Parallel Computing. Manning Publication, 1998. ISBN 0-13-795592-8.

    Google Scholar 

  8. Richmond M., Hitchens M.: A New Process Migration Algorithm. Operating System Review, 1997, vol. 31, no. 1, pp. 31–42.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Nguyen, G.T., Hluchy, L., Tran, V.D., Kotocova, M. (2002). DDG Task Recovery for Cluster Computing. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Waśniewski, J. (eds) Parallel Processing and Applied Mathematics. PPAM 2001. Lecture Notes in Computer Science, vol 2328. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48086-2_41

Download citation

  • DOI: https://doi.org/10.1007/3-540-48086-2_41

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43792-5

  • Online ISBN: 978-3-540-48086-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics