DDG Task Recovery for Cluster Computing

Nguyen, G. T.; Hluchy, L.; Tran, V. D.; Kotocova, M.

doi:10.1007/3-540-48086-2_41

DDG Task Recovery for Cluster Computing

G. T. Nguyen⁸,
L. Hluchy⁸,
V. D. Tran⁸ &
…
M. Kotocova⁹

Conference paper
First Online: 01 January 2002

473 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2328))

Abstract

This paper presents a solution for the problem of transparent recovery of asynchronous distributed computation on clusters of workstations when a fault occurs on a node. If the system has fault-tolerant features, it can survive the fault and continues its computations. Performance degradation is unavoidable when hardware redundancies are not available. It is a large advantage if the long-runtime application can restart from a checkpoint instead of restarting whole computation. This paper presents the fault-tolerant feature of the DDG environment oriented to cluster systems without hardware spare.

This work is supported by the Slovak Scientific Grant Agency within Research Project No. 2/7186/20

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Tran V.D., Hluchy L., Nguyen G.T.: Parallel Program Model for Distributed Systems. EuroPVM/MPI, 2000, pp. 250–257. Springer-Verlag.
Google Scholar
Hluchý L., Tran V.D., Nguyen G.T.: Parallel Programming with Data Driven Model. EuroMicro, 2000, pp. 205–211. IEEE Computer Society Press.
Google Scholar
Tran V.D., Hluchý L., Nguyen G.T.: Parallel Program Model and Environment. PARCO, 1999, pp. 697–704. Imperial College Press.
Google Scholar
Bauch A., Maehle E., Markus F.J.: A Distributed Algorithm for Fault-Tolerant Dynamic Task Scheduling. EuroMicro, 1994, pp. 309–316.
Google Scholar
Duato J., Yalamanchili S., Ni L.: Interconnection Networks an Engineering Approach. IEEE Computer Society Press, 1997. ISBN 0-8186-7800-3.
Google Scholar
Pfister G.F.: In Search of Clusters, 2nd Edition. Prentice Hall, 1998, ISBN 0-13-899709-8.
Google Scholar
El-Rewini H., Lewis T. G.: Distributed and Parallel Computing. Manning Publication, 1998. ISBN 0-13-795592-8.
Google Scholar
Richmond M., Hitchens M.: A New Process Migration Algorithm. Operating System Review, 1997, vol. 31, no. 1, pp. 31–42.
Article Google Scholar

Download references

Author information

Authors and Affiliations

SAS, Institute of Informatics, Dubravska cesta 9, 84237, Bratislava, Slovakia
G. T. Nguyen, L. Hluchy & V. D. Tran
Department of Computer Science, STU, Ilkovicova 3, 81219, Bratislava, Slovakia
M. Kotocova

Authors

G. T. Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
L. Hluchy
View author publications
You can also search for this author in PubMed Google Scholar
V. D. Tran
View author publications
You can also search for this author in PubMed Google Scholar
M. Kotocova
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Mathematics and Computer Science, Technical University of Czestochowa, Dabrowskiego 73, 42-200, Czestochowa, Poland
Roman Wyrzykowski
Computer Science Department, University of Tennessee, 122 Volunteer Blvd, Knoxville, TN, 37996-3450, USA
Jack Dongarra
Computer Science Department, Oklahoma State University, 700 N. Greenwood Ave., Tulsa, OK, 74106, USA
Marcin Paprzycki
DTU, UNI-C, Danish Computing Centre for Research and Education, Bldg. 304, 2800, Lyngby, Denmark
Jerzy Waśniewski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, G.T., Hluchy, L., Tran, V.D., Kotocova, M. (2002). DDG Task Recovery for Cluster Computing. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Waśniewski, J. (eds) Parallel Processing and Applied Mathematics. PPAM 2001. Lecture Notes in Computer Science, vol 2328. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48086-2_41

Download citation

DOI: https://doi.org/10.1007/3-540-48086-2_41
Published: 06 June 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43792-5
Online ISBN: 978-3-540-48086-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics