skip to main content
10.1145/2000064.2000083acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Rebound: scalable checkpointing for coherent shared memory

Published: 04 June 2011 Publication History

Abstract

As we move to large manycores, the hardware-based global check-pointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include global operations, work lost to global rollback, and inefficiencies in imbalanced or I/O-intensive loads. Scalable checkpointing requires tracking inter-thread dependences and building the checkpoint and rollback operations around dynamic groups of communicating processors.
To address this problem, this paper introduces Rebound, the first hardware-based scheme for coordinated local checkpointing in multiprocessors with directory based cache coherence. Rebound leverages the transactions of a directory protocol to track inter-thread dependences. In addition, it boosts checkpointing efficiency by: (i) delaying the writeback of data to safe memory at checkpoints, (ii) supporting operation with multiple checkpoints, and (iii) optimizing checkpointing at barrier synchronization. Finally, Rebound introduces distributed algorithms for checkpointing and rollback sets of processors. Simulations of parallel programs with up to 64 threads show that Rebound is scalable and has very low overhead. For 64 processors, its average performance overhead is only 2%, compared to 15% for global checkpointing.

References

[1]
R. Ahmed, R. Frazier, and P. Marinos. Cache-aided rollback error recovery (CARER) algorithm for shared-memory multiprocessor systems. In Int. Symp. on Fault-Tol. Comp. Sys., June 1990.
[2]
M. Banatre, A. Gefflaut, P. Joubert, C. Morin, and P. Lee. An architecture for tolerating processor failures in shared-memory multiprocessors. IEEE Trans. Comp., 45(10), 1996.
[3]
M. Banatre and P. Joubert. Cache management in a tightly coupled fault tolerant multiprocessor. In Int. Symp. on Fault-Tol. Comp. Sys., June 1990.
[4]
B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7), 1970.
[5]
D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A framework for architectural-level power analysis and optimizations. In Int. Symp. on Comp. Arch., June 2000.
[6]
L. Ceze, J. Tuck, C. Cascaval, and J. Torrellas. Bulk disambiguation of speculative threads in multiprocessors. In Int. Symp. on Comp. Arch., June 2006.
[7]
T. J. Dell. A white paper on the benefits of Chipkill-correct ECC for PC server main memory. IBM Microelec. Div., Nov 2005.
[8]
E. Elnozahy, L. Alvisi, Y. Wang, and D. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comp. Surv., 1992.
[9]
E. Elnozahy and W. Zwaenepoel. Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Trans. on Comp., 41(5), May 1992.
[10]
A. Gupta, W. Weber, and T. Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Int. Conf. on Par. Proc., Aug 1990.
[11]
Intel Corporation. Single Chip Cloud Computing (SCC) platform overview, Feb 2010. techresearch.intel.com.
[12]
B. Janssens and K. Fuchs. The performance of cache-based error recovery in multiprocessors. IEEE Trans. Par. Dist. Syst., 5(10), 1994.
[13]
A. Kermarrec, G. Cabillic, A. Gefflaut, C. Morin, and I. Puaut. A recoverable distributed shared memory integrating coherence and recoverability. In Int. Symp. on Fault-Tol. Comp., June 1995.
[14]
R. Koo and S. Toueg. Checkpointing and rollback-recovery for distributed systems. IEEE Trans. Soft. Eng., 1987.
[15]
P. Lee and T. Anderson. Fault Tolerance: Principles and Practice. Springer-Verlag, Inc., 1990.
[16]
C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. Reddi, and K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In Prog. Lang. Design and Impl., June 2005.
[17]
Y. Masubuchi, S. Hoshina, T. Shimada, H. Hirayama, and N. Kato. Fault recovery mechanism for multiprocessor servers. In Int. Symp. on Fault-Tol. Comp., June 1997.
[18]
C. Morin, A. Gefflaut, M. Banatre, and A. Kermarrec. COMA: An opportunity for building fault-tolerant scalable shared memory multiprocessors. In Int. Symp. on Comp. Arch., May 1996.
[19]
C. Morin, A. Kermarrec, M. Banatre, and A. Gefflaut. An efficient and scalable approach for implementing fault-tolerant DSM architectures. IEEE Trans. Comp., 49(5), 2000.
[20]
S. Mukherjee. Architecture Design for Soft Errors. Elsevier Inc., Burlington, MA, USA, 2008.
[21]
J. Nakano, P. Montesinos, K. Gharachorloo, and J. Torrellas. ReViveI/O: Efficient handling of I/O in highly-available rollback-recovery servers. In Int. Symp. on High-Perf. Comp. Arch., Feb 2006.
[22]
J. Plank and K. Li. Faster checkpointing with N+1 parity. In Int. Symp. on Fault-Tol. Comp., June 1994.
[23]
M. Prvulovic, Z. Zhang, and J. Torrellas. ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In Int. Symp. on Comp. Arch., May 2002.
[24]
B. Randell. System structure for software fault tolerance. IEEE Trans. on Soft. Eng., 1(2), June 1975.
[25]
S. Raoux, G. Burr, M. Breitwisch, C. Rettner, Y. Chen, R. Shelby, M. Salinga, D. Krebs, S. Chen, H. Lung, and C. Lam. Phase-change random access memory: A scalable technology. IBM Jou. of Res. and Dev., 52(4/5), 2008.
[26]
J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze, S. Sarangi, P. Sack, K. Strauss, and P. Montesinos. SESC simulator, Jan 2005. http://sesc.sourceforge.net.
[27]
D. Sorin, M. Martin, M. Hill, and D. Wood. SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Int. Symp. on Comp. Arch., May 2002.
[28]
F. Sultan, L. Iftode, and T. Nguyen. Scalable fault-tolerant distributed shared memory. In Int. Conf. on Super., 2000.
[29]
D. Sunada, M. Flynn, and D. Glasco. Multiprocessor architecture using an audit trail for fault tolerance. In Int. Symp. on Fault-Tol. Comp., June 1999.
[30]
D. Tarjan, S. Thoziyoor, and N. Jouppi. CACTI 4.0. Technical report, HPL-2006-86, HP Laboratories, 2006.
[31]
S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote, and N. Borkar. An 80-tile 1.28TFLOPS network-on-chip in 65nm CMOS. In Int. Sol. State Cir. Conf., Feb 2007.
[32]
D. Wang, B. Ganesh, N. Tuaycharoen, K. Baynes, A. Jaleel, and B. Jacob. DRAMsim: A memory system simulator. SIGARCH Comp. Arch. News, 33(4), 2005.
[33]
K. Wu, K. Fuchs, and J. Patel. Error recovery in shared memory multiprocessors using

Cited By

View all
  • (2020)ACR: Amnesic Checkpointing and Recovery2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00013(30-43)Online publication date: Feb-2020
  • (2017)Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital SystemsACM Computing Surveys10.1145/309269950:4(1-38)Online publication date: 4-Oct-2017
  • (2016)Architectural Support for Fault Tolerance in a Teradevice Dataflow SystemInternational Journal of Parallel Programming10.1007/s10766-014-0312-y44:2(208-232)Online publication date: 1-Apr-2016
  • Show More Cited By

Index Terms

  1. Rebound: scalable checkpointing for coherent shared memory

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ISCA '11: Proceedings of the 38th annual international symposium on Computer architecture
      June 2011
      488 pages
      ISBN:9781450304726
      DOI:10.1145/2000064
      • cover image ACM SIGARCH Computer Architecture News
        ACM SIGARCH Computer Architecture News  Volume 39, Issue 3
        ISCA '11
        June 2011
        462 pages
        ISSN:0163-5964
        DOI:10.1145/2024723
        Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 04 June 2011

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. faults
      2. scalable checkpointing
      3. shared-memory multiprocessors

      Qualifiers

      • Research-article

      Conference

      ISCA '11
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 543 of 3,203 submissions, 17%

      Upcoming Conference

      ISCA '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)11
      • Downloads (Last 6 weeks)6
      Reflects downloads up to 17 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2020)ACR: Amnesic Checkpointing and Recovery2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00013(30-43)Online publication date: Feb-2020
      • (2017)Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital SystemsACM Computing Surveys10.1145/309269950:4(1-38)Online publication date: 4-Oct-2017
      • (2016)Architectural Support for Fault Tolerance in a Teradevice Dataflow SystemInternational Journal of Parallel Programming10.1007/s10766-014-0312-y44:2(208-232)Online publication date: 1-Apr-2016
      • (2015)Lowering Minimum Supply Voltage for Power-Efficient Cache Design by Exploiting Data RedundancyACM Transactions on Design Automation of Electronic Systems10.1145/279522921:1(1-24)Online publication date: 2-Dec-2015
      • (2015)Differentiated Failure Remediation with Action Selection for Resilient ComputingProceedings of the 2015 IEEE 21st Pacific Rim International Symposium on Dependable Computing (PRDC)10.1109/PRDC.2015.42(199-208)Online publication date: 18-Nov-2015
      • (2015)Toward efficient check-pointing and rollback under on-demand SBST in chip multi-processors2015 IEEE 21st International On-Line Testing Symposium (IOLTS)10.1109/IOLTS.2015.7229842(110-115)Online publication date: Jul-2015
      • (2015)Soft-error mitigation by means of decoupled transactional memory threadsDistributed Computing10.1007/s00446-014-0215-628:2(75-90)Online publication date: 1-Apr-2015
      • (2015)Transactional Memory for ReliabilityTransactional Memory. Foundations, Algorithms, Tools, and Applications10.1007/978-3-319-14720-8_13(268-282)Online publication date: 2015
      • (2014)Globally precise-restartable execution of parallel programsACM SIGPLAN Notices10.1145/2666356.259430649:6(181-192)Online publication date: 9-Jun-2014
      • (2014)Globally precise-restartable execution of parallel programsProceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/2594291.2594306(181-192)Online publication date: 9-Jun-2014
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media