Self-refined Fault Tolerance in HPC Using Dynamic Dependent Process Groups

Gopalan, N. P.; Nagarajan, K.

doi:10.1007/11603771_18

N. P. Gopalan²⁰ &
K. Nagarajan²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3741))

Included in the following conference series:

International Workshop on Distributed Computing

Abstract

This paper proposes a novel method for achieving a distributed self-refined fault tolerance by dynamically partitioning the processes into smaller groups, which are mutually disjoint and collectively exhaustive of the whole system. The present model provides tolerance for frequent faults, makes the roll back recovery simple and less time consuming. An optimal checkpoint interval is found using a mathematical approximation and a spare process is made to capture all the in-transit messages when a process fails at its ends. Piggybacking the events of dependent processes on the outgoing messages is used for process grouping. A process with maximum information can scatter chunk values to the other dependent processes in its group. Each process constructs a checkpoint when the received chunk matches with its log.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Fault-Aware Group-Collective Communication Creation and Repair in MPI

A Resilient Hierarchical Checkpointing Algorithm for Distributed Systems Running on Cluster Federation

HPC-SFI: System-Level Fault Injection for High Performance Computing Systems

References

Agbaria, A., Friedman, R.: Starfish: Fault tolerant dynamic MPI programs on clusters of workstations. In: Proceedings of the 8th IEEE Symposium on High Performance Distributed Computing, pp. 31–42. IEEE CS Press, Los Alamitos (1999)
Google Scholar
Alvisi, L., Marzullo, K.: Message Logging: Pessimistic, optimistic, causal and optimal. IEEE Transactions on Software Engineering 24(2), 149–159 (1998)
Article Google Scholar
Bosilca, G., et al.: MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes. In: Proceedings of Super Computing Conference, pp. 23–41. ACM/IEEE CS Press (2002)
Google Scholar
Bouteiller, et al.: MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In: Super Computing (2003)
Google Scholar
Chandy, K.M., Lamport, L.: Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computing Systems 3(1), 63–75 (1985)
Article Google Scholar
Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375–408 (2002)
Article Google Scholar
Stellner, G.: Cocheck: Checkpointing and process migration for MPI. In: IPPS, pp. 526–531 (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, National Institute of Technology, Tiruchirappalli, TN, 620015, India
N. P. Gopalan & K. Nagarajan

Authors

N. P. Gopalan
View author publications
You can also search for this author in PubMed Google Scholar
K. Nagarajan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, West Bengal, India
Ajit Pal
Department of Computer Science, University of Illinois at Chicago, 60607, Chicago, Illinois
Ajay D. Kshemkalyani
Computer Science and Engineering Department, Indian Institute of Technology, 721 302, Kharagpur, WB, India
Rajeev Kumar
Department of Computer Science and Engineering, Indian Institute of Technology, 721 302, Kharagpur, India
Arobinda Gupta

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gopalan, N.P., Nagarajan, K. (2005). Self-refined Fault Tolerance in HPC Using Dynamic Dependent Process Groups. In: Pal, A., Kshemkalyani, A.D., Kumar, R., Gupta, A. (eds) Distributed Computing – IWDC 2005. IWDC 2005. Lecture Notes in Computer Science, vol 3741. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11603771_18

Download citation

DOI: https://doi.org/10.1007/11603771_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30959-8
Online ISBN: 978-3-540-32428-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics