Log-Based Rollback Recovery without Checkpoints of Shared Memory in Software DSM

Park, Soyeon; Maeng, Seung Ryoul

doi:10.1007/s11227-006-1667-7

Log-Based Rollback Recovery without Checkpoints of Shared Memory in Software DSM

Published: February 2006

Volume 35, pages 141–154, (2006)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Soyeon Park¹ &
Seung Ryoul Maeng¹

65 Accesses
2 Citations
Explore all metrics

Abstract

A common approach to fault-tolerant software DSM is to take checkpoints with message logging. Our remote logging has low overhead because each node saves the coherence-related data into the memory of a remote node through a high-speed system area network. For more lightweight fault-tolerant DSM, in this paper, we mainly focused on eliminating shared memory checkpointing during failure-free execution. Each node independently takes the checkpoints of execution states and non-shared data only. When a node fails, it regenerates its pages from the remote copies in live nodes. In order to efficiently reconstruct pages, we also introduced a XOR-diffing technique. The diff logs, which have been created by XOR operations during failure-free execution, can be applicable to any version of remote copies either backward or forward for recovery. Our scheme reduces the checkpointing overhead and also alleviates the imbalance in execution times among nodes due to independent checkpointing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

R. Bianchini, L. I. Kontothanassis, R. Pinto, M. De Maria, M. Abud, and C. L. Amorim. Hiding communication latency and coherence overhead in software DSMs. In Proc. of the 7th International Conf. on Architectural Support for Programming Languages and Operating Systems, Oct. 1996.
A. Bilas, C. Liao, and J. P. Singh. Accelerating shared virtual memory using commodity NI support to avoid asynchronous message handling. In Proc. of the 26th International Symp. on Computer Architecture, May 1999.
R. Christodoulopoulou and A. Bilas. Dynamic data replication for tolerating single node failures in shared virtual memory clusters of workstations. In Proc. of the Workshop of Caching, Coherence and Consistency, June 2001.
R. Christodoulopoulou, R. Azimi, and A. Bilas. Dynamic data replication: An approach to providing fault-tolerant shared memory clusters. In Proc. of the 9th International Symp. on High-Performance Computer Architecture, pp. 203–214, Feb. 2003.
M. Costa, P. Guedes, M. Sequeira, N. Neves, and M. Castro. Lightweight logging for lazy release consistent distributed shared memory. In Proc. of the 2nd USENIX Symp. on Operating Systems Design and Implementation, pp. 59–73, Oct. 1996.
D. Dunning, G. Regnier, G. McAlpine, D. Cameron, B. Shubert, F. Berry, A. M. Merritt, E. Gronke, and C. Dodd. The virtual interface architecture. IEEE Micro, 18(2):66–76, 1998.
Article Google Scholar
M. Elnozahy, L. Alvisi, Y. Wang, and D. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96–181, Carnegie Mellon University, Oct. 1996.
P. Keleher, S. Dwarkadas, A. L. Cox, and W. Zwaenepoel. TreadMarks: Distributed shared memory on standard workstations and operating systems. In Proc. of the Winter 94 USENIX Conference, pp. 115–131, Jan. 1994.
A. Kongmunvattana and N. F. Tzeng. Coherence-centric logging and recovery for home-based software distributed shared memory. In Proc. of the International Conf. on Parallel Processing, pp. 274–281, Sept. 1999.
A. Kontothanassis and M. L. Scott. High performance software coherence for current and future architectures. Journal of Parallel and Distributed Computing, 29(2):179–195, 1995.
Article Google Scholar
K. Li, J. F. Naughton, and J. S. Plank. Low-latency, concurrent checkpointing for parallel programs. IEEE Transactions on Parallel and Distributed Systems, Vol. 5, Aug. 1994.
M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny. Checkpoint and migration of UNIX processes in the Condor distributed processing system. Technical Report UW-CS-TR-1346, University of Wisconsin-Madison, April 1997.
C. Morin and I. Puaut. A survey of recoverable distributed shared virtual memory systems. IEEE Trans. on Parallel and Distributed Systems, 8(9):959–969, 1997.
Article Google Scholar
Myricom Inc: http://www.myrinet.com
S. Park, Y. Kim, and S. R. Maeng. Lightweight logging and recovery for distributed shared memory over virtual interface architecture. In Proc. of the International Symp. on Parallel and Distributed Computing, pp. 199–206, Oct. 2003.
T. Park and H. Y. Yeom. An efficient logging scheme for lazy release consistent distributed shared memory systems. In Proc. of the 9th Symp. on Parallel and Distributed Processing, pp. 670–674, March 1998.
J. S. Plank. An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance. Technical Report UT–CS–97–372, University of Rochester, July 1997.
J. S. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt: Transparent checkpointing under Unix. In Usenix Winter Technical Conference, pp. 213–223, Jan. 1995.
B. Randell, P. A. Lee, and P. C. Treleaven. Reliability issues in computing system design. ACM Computing Surveys, 10(2):123–166, 1978.
Article Google Scholar
M. Rangarajan and L. Iftode. Software distributed shared memory over virtual interface architecture: Implementation and performance. In Proc. of the 4th Annual Linux Conference, pp. 341–352, Oct. 2000.
G. G. Richard III and M. Singhal. Using logging and asynchronous checkpointing to implement recoverable distributed shared memory. In Proc. of the 12th Symp. on Reliable Distributed Systems, pp. 58–67, Oct. 1993.
J. C. Sancho, F. Petrini, G. Johnson, J. Fernandez, E. Frachtenberg. On the feasibility of incremental checkpointing for scientific computing. In Proc. of the International Parallel and Distributed Processing Symposium, April. 2004.
R. Stets, S. Dwarkadas, N. Hardavellas, G. Hunt, L. Kontothanassis, S. Parthasarathy, and M. Scott. CASHMERE-2L: Software coherent shared memory on a clustered remote-write network. 16th In Proc. of the Symp. Operating Systems Principles, Oct. 1997.
F. Sultan, T. D. Nguyen, and L. Iftode. Scalable fault-tolerant distributed shared memory. In Proc. of the IEEE/ACM Supercomputing 2000, Nov. 2000.
F. Sultan, T. D. Nguyen, and L. Iftode. Lazy garbage collection of recovery state for fault-tolerant distributed shared memory. IEEE Trans. on Parallel and Distributed Systems, 13(7):673–686, 2002.
Article Google Scholar
G. Suri, B. Janssens, and W. K. Fuchs. Reduced overhead logging for rollback recovery in distributed shared memory. In Proc. of the 25th International Symp. on Fault Tolerant Computing, pp. 279–288, June 1995.
S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proc. of the 22th International Symp. on Computer Architecture, pp. 24–36, June 1995.
Y. Zhou, L. Iftode, and K. Li. Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems. In Proc. of the 2nd USENIX Symp. on Operating Systems Design and Implementation, pp. 75–88, Oct. 1996.

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology, 373-1 Guseong-dong, Yuseong-gu, Daejeon, Korea
Soyeon Park & Seung Ryoul Maeng

Authors

Soyeon Park
View author publications
Search author on:PubMed Google Scholar
Seung Ryoul Maeng
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Soyeon Park.

Additional information

This research is supported by KISTEP under the National Research Laboratory program.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Park, S., Maeng, S.R. Log-Based Rollback Recovery without Checkpoints of Shared Memory in Software DSM. J Supercomput 35, 141–154 (2006). https://doi.org/10.1007/s11227-006-1667-7

Download citation

Issue Date: February 2006
DOI: https://doi.org/10.1007/s11227-006-1667-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Log-Based Rollback Recovery without Checkpoints of Shared Memory in Software DSM

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Fault-tolerant precise data access on distributed log-structured merge-tree

Addressing the Last Roadblock for Message Logging in HPC: Alleviating the Memory Requirement Using Dedicated Resources

Gossip based fault tolerant protocol in distributed transactional memory using quorum based replication system

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Log-Based Rollback Recovery without Checkpoints of Shared Memory in Software DSM

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Fault-tolerant precise data access on distributed log-structured merge-tree

Addressing the Last Roadblock for Message Logging in HPC: Alleviating the Memory Requirement Using Dedicated Resources

Gossip based fault tolerant protocol in distributed transactional memory using quorum based replication system

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now