Abstract
This paper presents a solution for the problem of transparent recovery of asynchronous distributed computation on clusters of workstations when a fault occurs on a node. If the system has fault-tolerant features, it can survive the fault and continues its computations. Performance degradation is unavoidable when hardware redundancies are not available. It is a large advantage if the long-runtime application can restart from a checkpoint instead of restarting whole computation. This paper presents the fault-tolerant feature of the DDG environment oriented to cluster systems without hardware spare.
This work is supported by the Slovak Scientific Grant Agency within Research Project No. 2/7186/20
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Tran V.D., Hluchy L., Nguyen G.T.: Parallel Program Model for Distributed Systems. EuroPVM/MPI, 2000, pp. 250–257. Springer-Verlag.
Hluchý L., Tran V.D., Nguyen G.T.: Parallel Programming with Data Driven Model. EuroMicro, 2000, pp. 205–211. IEEE Computer Society Press.
Tran V.D., Hluchý L., Nguyen G.T.: Parallel Program Model and Environment. PARCO, 1999, pp. 697–704. Imperial College Press.
Bauch A., Maehle E., Markus F.J.: A Distributed Algorithm for Fault-Tolerant Dynamic Task Scheduling. EuroMicro, 1994, pp. 309–316.
Duato J., Yalamanchili S., Ni L.: Interconnection Networks an Engineering Approach. IEEE Computer Society Press, 1997. ISBN 0-8186-7800-3.
Pfister G.F.: In Search of Clusters, 2nd Edition. Prentice Hall, 1998, ISBN 0-13-899709-8.
El-Rewini H., Lewis T. G.: Distributed and Parallel Computing. Manning Publication, 1998. ISBN 0-13-795592-8.
Richmond M., Hitchens M.: A New Process Migration Algorithm. Operating System Review, 1997, vol. 31, no. 1, pp. 31–42.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nguyen, G.T., Hluchy, L., Tran, V.D., Kotocova, M. (2002). DDG Task Recovery for Cluster Computing. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Waśniewski, J. (eds) Parallel Processing and Applied Mathematics. PPAM 2001. Lecture Notes in Computer Science, vol 2328. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48086-2_41
Download citation
DOI: https://doi.org/10.1007/3-540-48086-2_41
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43792-5
Online ISBN: 978-3-540-48086-0
eBook Packages: Springer Book Archive