Skip to main content

Integrating Coordinated Checkpointing and Recovery Mechanisms into DSM Synchronization Barriers

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 3503))

Abstract

Distributed Shared Memory (DSM) creates an abstraction of a physical shared memory that parallel programmers can access. Most recent software DSMs provide relaxed memory models that guarantee consistency only at synchronization operations. As the main goal of DSM systems is to provide support for long term computation intensive applications, checkpointing and recovery mechanisms are highly desirable. This article presents and evaluates the integration of a coordinated checkpointing mechanism to the barrier primitive that is usually provided with many DSM systems. Our results on some popular benchmarks and a real parallel application show that the overhead introduced during the failure-free execution is often small.

This work was partially supported by NSERC, Canada Foundation for Innovation and Canada Research Chair Programs.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Amza, C., Cox, A., Dwarkakas, S., Zwaenenpoel, W.: Software DSM Protocols that Adapt between Single Writer and Multiple Writer. In: Proc. of HPCA 1997, pp. 261–271 (1997)

    Google Scholar 

  2. Bailey, D., et al.: The NAS Parallel Benchmarks, TR 103863-NASA (July 1993)

    Google Scholar 

  3. Elnozahy, M., Alvisi, L., Wang, L.: A Survey of Rollback/recovery Protocols in Message-Passing Systems, TR CMU-CS-96-181 (1996)

    Google Scholar 

  4. Gharachorloo, K., et al.: Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors. In: Proc. ISCA, pp. 15–24 (May 1990)

    Google Scholar 

  5. Hu, W., Shi, W., Tang, Z.: JIAJIA: An SVM System Based on A New Cache Coherence Protocol. In: Proc. of HPCN 1999, pp. 463–472 (1999)

    Google Scholar 

  6. Iftode, L.: Home-Based Shared Virtual Memory, PhD Thesis. Princeton University, Princeton (1998)

    Google Scholar 

  7. Iftode, L., et al.: Scope Consistency: Bridging the Gap Between Release Consistency and Entry Consistency. In: Proc. ACM SPAA 1996, pp. 277–287 (1996)

    Google Scholar 

  8. Janakiraman, G., Tamir, Y.: Coordinated Checkpointing-Rollback Error Recovery for DSM Multicomputers. In: Proc. of 13th Symposium on Reliable Distributed Systems (1994)

    Google Scholar 

  9. Keleher, P., et al.: TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. In: Proc. USENIX, pp. 115–132 (1994)

    Google Scholar 

  10. Kongmunvattana, A., Tanchatchawal, S., Tzeng, N.: Coherence-Based Coordinated Check-pointing for Software Distributed Shared Memory Systems. In: Proc. ICDCS, April, pp. 556–563 (2000)

    Google Scholar 

  11. Kongmunvattana, A., Tzeng, N.: Logging and Recovery in Adaptive Software Distributed Shared memory Systems. In: Proc. of the 18th Symp. on Reliable Distributed Systems (1999)

    Google Scholar 

  12. Lu, H., Dwarkadas, S., Cox, A.L., Zwaenepoel, W.: Quantifying the performance differences between pvm and Treadmarks. Journal of Parallel and Distributed Computation 43, 65–78 (1997)

    Article  Google Scholar 

  13. Melo, R., et al.: Comparing Two Long DNA Sequences Using a DSM System. In: Kosch, H., Böszörményi, L., Hellwagner, H. (eds.) Euro-Par 2003. LNCS, vol. 2790, pp. 517–524. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  14. Monnerat, L., Bianchinni, R.: Efficiently Adapting to Sharing Patterns in Software DSMs. In: Proc. HPCA 1998 (February 1998)

    Google Scholar 

  15. Mosberger, D.: Memory Consistency Models. Operating Systems Review, 18–26 (1993)

    Google Scholar 

  16. Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: transparent Checkpointing under Linux. In: USENIX Winter 1995 Technical Conference (January 1995)

    Google Scholar 

  17. Shi, W.: Improving the Performance of DSM Systems, PhD Thesis, CAS (November 1999)

    Google Scholar 

  18. Speight, E., Bennett, J.: Reducing Coherence-Related Communication in Software Distributed Shared Memory Systems, TR ECE-TR-98-03, Rice University (1998)

    Google Scholar 

  19. Sultan, F., Nguyen, T., Iftode, L.: Scalable Fault Tolerant Distributed Shared Memory. In: Proc. of Int. Conf. On High Performance Networking and Computing (2000)

    Google Scholar 

  20. Wang, Y., Chung, P., Fuchs, W.: Tight Upper Bound on Useful Distributed Systems Checkpoints, Technical Report CRHC-95-16, University of Urbana-Champaign, USA (1995)

    Google Scholar 

  21. Zandy, V.: CKPT: A Checkpoint Library under Unix, http://www.cs.wisc.edu/~zandy/ckpt

  22. Smith, T.F., Waterman, M.S.: Identification of common molecular sub-sequences. Journal of Molecular Biology 147(1), 195–197 (1981)

    Article  Google Scholar 

  23. Badrinath, R., Morin, C.: Locks and Barriers in Checkpointing and Recovery. In: Proceedings of the IEEE/ACM CCGrid 2004, Chicago, USA (April 2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Boukerche, A., Koch, J., de Melo, A.C.M.A. (2005). Integrating Coordinated Checkpointing and Recovery Mechanisms into DSM Synchronization Barriers. In: Nikoletseas, S.E. (eds) Experimental and Efficient Algorithms. WEA 2005. Lecture Notes in Computer Science, vol 3503. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11427186_35

Download citation

  • DOI: https://doi.org/10.1007/11427186_35

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25920-6

  • Online ISBN: 978-3-540-32078-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics