Skip to main content
Log in

Log-Based Rollback Recovery without Checkpoints of Shared Memory in Software DSM

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

A common approach to fault-tolerant software DSM is to take checkpoints with message logging. Our remote logging has low overhead because each node saves the coherence-related data into the memory of a remote node through a high-speed system area network. For more lightweight fault-tolerant DSM, in this paper, we mainly focused on eliminating shared memory checkpointing during failure-free execution. Each node independently takes the checkpoints of execution states and non-shared data only. When a node fails, it regenerates its pages from the remote copies in live nodes. In order to efficiently reconstruct pages, we also introduced a XOR-diffing technique. The diff logs, which have been created by XOR operations during failure-free execution, can be applicable to any version of remote copies either backward or forward for recovery. Our scheme reduces the checkpointing overhead and also alleviates the imbalance in execution times among nodes due to independent checkpointing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. R. Bianchini, L. I. Kontothanassis, R. Pinto, M. De Maria, M. Abud, and C. L. Amorim. Hiding communication latency and coherence overhead in software DSMs. In Proc. of the 7th International Conf. on Architectural Support for Programming Languages and Operating Systems, Oct. 1996.

  2. A. Bilas, C. Liao, and J. P. Singh. Accelerating shared virtual memory using commodity NI support to avoid asynchronous message handling. In Proc. of the 26th International Symp. on Computer Architecture, May 1999.

  3. R. Christodoulopoulou and A. Bilas. Dynamic data replication for tolerating single node failures in shared virtual memory clusters of workstations. In Proc. of the Workshop of Caching, Coherence and Consistency, June 2001.

  4. R. Christodoulopoulou, R. Azimi, and A. Bilas. Dynamic data replication: An approach to providing fault-tolerant shared memory clusters. In Proc. of the 9th International Symp. on High-Performance Computer Architecture, pp. 203–214, Feb. 2003.

  5. M. Costa, P. Guedes, M. Sequeira, N. Neves, and M. Castro. Lightweight logging for lazy release consistent distributed shared memory. In Proc. of the 2nd USENIX Symp. on Operating Systems Design and Implementation, pp. 59–73, Oct. 1996.

  6. D. Dunning, G. Regnier, G. McAlpine, D. Cameron, B. Shubert, F. Berry, A. M. Merritt, E. Gronke, and C. Dodd. The virtual interface architecture. IEEE Micro, 18(2):66–76, 1998.

    Article  Google Scholar 

  7. M. Elnozahy, L. Alvisi, Y. Wang, and D. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96–181, Carnegie Mellon University, Oct. 1996.

  8. P. Keleher, S. Dwarkadas, A. L. Cox, and W. Zwaenepoel. TreadMarks: Distributed shared memory on standard workstations and operating systems. In Proc. of the Winter 94 USENIX Conference, pp. 115–131, Jan. 1994.

  9. A. Kongmunvattana and N. F. Tzeng. Coherence-centric logging and recovery for home-based software distributed shared memory. In Proc. of the International Conf. on Parallel Processing, pp. 274–281, Sept. 1999.

  10. A. Kontothanassis and M. L. Scott. High performance software coherence for current and future architectures. Journal of Parallel and Distributed Computing, 29(2):179–195, 1995.

    Article  Google Scholar 

  11. K. Li, J. F. Naughton, and J. S. Plank. Low-latency, concurrent checkpointing for parallel programs. IEEE Transactions on Parallel and Distributed Systems, Vol. 5, Aug. 1994.

  12. M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny. Checkpoint and migration of UNIX processes in the Condor distributed processing system. Technical Report UW-CS-TR-1346, University of Wisconsin-Madison, April 1997.

  13. C. Morin and I. Puaut. A survey of recoverable distributed shared virtual memory systems. IEEE Trans. on Parallel and Distributed Systems, 8(9):959–969, 1997.

    Article  Google Scholar 

  14. Myricom Inc: http://www.myrinet.com

  15. S. Park, Y. Kim, and S. R. Maeng. Lightweight logging and recovery for distributed shared memory over virtual interface architecture. In Proc. of the International Symp. on Parallel and Distributed Computing, pp. 199–206, Oct. 2003.

  16. T. Park and H. Y. Yeom. An efficient logging scheme for lazy release consistent distributed shared memory systems. In Proc. of the 9th Symp. on Parallel and Distributed Processing, pp. 670–674, March 1998.

  17. J. S. Plank. An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance. Technical Report UT–CS–97–372, University of Rochester, July 1997.

  18. J. S. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt: Transparent checkpointing under Unix. In Usenix Winter Technical Conference, pp. 213–223, Jan. 1995.

  19. B. Randell, P. A. Lee, and P. C. Treleaven. Reliability issues in computing system design. ACM Computing Surveys, 10(2):123–166, 1978.

    Article  Google Scholar 

  20. M. Rangarajan and L. Iftode. Software distributed shared memory over virtual interface architecture: Implementation and performance. In Proc. of the 4th Annual Linux Conference, pp. 341–352, Oct. 2000.

  21. G. G. Richard III and M. Singhal. Using logging and asynchronous checkpointing to implement recoverable distributed shared memory. In Proc. of the 12th Symp. on Reliable Distributed Systems, pp. 58–67, Oct. 1993.

  22. J. C. Sancho, F. Petrini, G. Johnson, J. Fernandez, E. Frachtenberg. On the feasibility of incremental checkpointing for scientific computing. In Proc. of the International Parallel and Distributed Processing Symposium, April. 2004.

  23. R. Stets, S. Dwarkadas, N. Hardavellas, G. Hunt, L. Kontothanassis, S. Parthasarathy, and M. Scott. CASHMERE-2L: Software coherent shared memory on a clustered remote-write network. 16th In Proc. of the Symp. Operating Systems Principles, Oct. 1997.

  24. F. Sultan, T. D. Nguyen, and L. Iftode. Scalable fault-tolerant distributed shared memory. In Proc. of the IEEE/ACM Supercomputing 2000, Nov. 2000.

  25. F. Sultan, T. D. Nguyen, and L. Iftode. Lazy garbage collection of recovery state for fault-tolerant distributed shared memory. IEEE Trans. on Parallel and Distributed Systems, 13(7):673–686, 2002.

    Article  Google Scholar 

  26. G. Suri, B. Janssens, and W. K. Fuchs. Reduced overhead logging for rollback recovery in distributed shared memory. In Proc. of the 25th International Symp. on Fault Tolerant Computing, pp. 279–288, June 1995.

  27. S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proc. of the 22th International Symp. on Computer Architecture, pp. 24–36, June 1995.

  28. Y. Zhou, L. Iftode, and K. Li. Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems. In Proc. of the 2nd USENIX Symp. on Operating Systems Design and Implementation, pp. 75–88, Oct. 1996.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Soyeon Park.

Additional information

This research is supported by KISTEP under the National Research Laboratory program.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Park, S., Maeng, S.R. Log-Based Rollback Recovery without Checkpoints of Shared Memory in Software DSM. J Supercomput 35, 141–154 (2006). https://doi.org/10.1007/s11227-006-1667-7

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-006-1667-7

Keywords

Navigation