Abstract
This paper proposes a novel method for achieving a distributed self-refined fault tolerance by dynamically partitioning the processes into smaller groups, which are mutually disjoint and collectively exhaustive of the whole system. The present model provides tolerance for frequent faults, makes the roll back recovery simple and less time consuming. An optimal checkpoint interval is found using a mathematical approximation and a spare process is made to capture all the in-transit messages when a process fails at its ends. Piggybacking the events of dependent processes on the outgoing messages is used for process grouping. A process with maximum information can scatter chunk values to the other dependent processes in its group. Each process constructs a checkpoint when the received chunk matches with its log.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Agbaria, A., Friedman, R.: Starfish: Fault tolerant dynamic MPI programs on clusters of workstations. In: Proceedings of the 8th IEEE Symposium on High Performance Distributed Computing, pp. 31–42. IEEE CS Press, Los Alamitos (1999)
Alvisi, L., Marzullo, K.: Message Logging: Pessimistic, optimistic, causal and optimal. IEEE Transactions on Software Engineering 24(2), 149–159 (1998)
Bosilca, G., et al.: MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes. In: Proceedings of Super Computing Conference, pp. 23–41. ACM/IEEE CS Press (2002)
Bouteiller, et al.: MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In: Super Computing (2003)
Chandy, K.M., Lamport, L.: Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computing Systems 3(1), 63–75 (1985)
Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375–408 (2002)
Stellner, G.: Cocheck: Checkpointing and process migration for MPI. In: IPPS, pp. 526–531 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gopalan, N.P., Nagarajan, K. (2005). Self-refined Fault Tolerance in HPC Using Dynamic Dependent Process Groups. In: Pal, A., Kshemkalyani, A.D., Kumar, R., Gupta, A. (eds) Distributed Computing – IWDC 2005. IWDC 2005. Lecture Notes in Computer Science, vol 3741. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11603771_18
Download citation
DOI: https://doi.org/10.1007/11603771_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30959-8
Online ISBN: 978-3-540-32428-7
eBook Packages: Computer ScienceComputer Science (R0)