research-article

Rebound: scalable checkpointing for coherent shared memory

Authors:

Josep TorrellasAuthors Info & Claims

ISCA '11: Proceedings of the 38th annual international symposium on Computer architecture

Pages 153 - 164

https://doi.org/10.1145/2000064.2000083

Published: 04 June 2011 Publication History

Abstract

As we move to large manycores, the hardware-based global check-pointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include global operations, work lost to global rollback, and inefficiencies in imbalanced or I/O-intensive loads. Scalable checkpointing requires tracking inter-thread dependences and building the checkpoint and rollback operations around dynamic groups of communicating processors.

To address this problem, this paper introduces Rebound, the first hardware-based scheme for coordinated local checkpointing in multiprocessors with directory based cache coherence. Rebound leverages the transactions of a directory protocol to track inter-thread dependences. In addition, it boosts checkpointing efficiency by: (i) delaying the writeback of data to safe memory at checkpoints, (ii) supporting operation with multiple checkpoints, and (iii) optimizing checkpointing at barrier synchronization. Finally, Rebound introduces distributed algorithms for checkpointing and rollback sets of processors. Simulations of parallel programs with up to 64 threads show that Rebound is scalable and has very low overhead. For 64 processors, its average performance overhead is only 2%, compared to 15% for global checkpointing.

References

[1]

R. Ahmed, R. Frazier, and P. Marinos. Cache-aided rollback error recovery (CARER) algorithm for shared-memory multiprocessor systems. In Int. Symp. on Fault-Tol. Comp. Sys., June 1990.

[2]

M. Banatre, A. Gefflaut, P. Joubert, C. Morin, and P. Lee. An architecture for tolerating processor failures in shared-memory multiprocessors. IEEE Trans. Comp., 45(10), 1996.

Digital Library

[3]

M. Banatre and P. Joubert. Cache management in a tightly coupled fault tolerant multiprocessor. In Int. Symp. on Fault-Tol. Comp. Sys., June 1990.

[4]

B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7), 1970.

Digital Library

[5]

D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A framework for architectural-level power analysis and optimizations. In Int. Symp. on Comp. Arch., June 2000.

Digital Library

[6]

L. Ceze, J. Tuck, C. Cascaval, and J. Torrellas. Bulk disambiguation of speculative threads in multiprocessors. In Int. Symp. on Comp. Arch., June 2006.

Digital Library

[7]

T. J. Dell. A white paper on the benefits of Chipkill-correct ECC for PC server main memory. IBM Microelec. Div., Nov 2005.

[8]

E. Elnozahy, L. Alvisi, Y. Wang, and D. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comp. Surv., 1992.

Digital Library

[9]

E. Elnozahy and W. Zwaenepoel. Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Trans. on Comp., 41(5), May 1992.

Digital Library

[10]

A. Gupta, W. Weber, and T. Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Int. Conf. on Par. Proc., Aug 1990.

[11]

Intel Corporation. Single Chip Cloud Computing (SCC) platform overview, Feb 2010. techresearch.intel.com.

[12]

B. Janssens and K. Fuchs. The performance of cache-based error recovery in multiprocessors. IEEE Trans. Par. Dist. Syst., 5(10), 1994.

Digital Library

[13]

A. Kermarrec, G. Cabillic, A. Gefflaut, C. Morin, and I. Puaut. A recoverable distributed shared memory integrating coherence and recoverability. In Int. Symp. on Fault-Tol. Comp., June 1995.

Digital Library

[14]

R. Koo and S. Toueg. Checkpointing and rollback-recovery for distributed systems. IEEE Trans. Soft. Eng., 1987.

Digital Library

[15]

P. Lee and T. Anderson. Fault Tolerance: Principles and Practice. Springer-Verlag, Inc., 1990.

Digital Library

[16]

C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. Reddi, and K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In Prog. Lang. Design and Impl., June 2005.

Digital Library

[17]

Y. Masubuchi, S. Hoshina, T. Shimada, H. Hirayama, and N. Kato. Fault recovery mechanism for multiprocessor servers. In Int. Symp. on Fault-Tol. Comp., June 1997.

Digital Library

[18]

C. Morin, A. Gefflaut, M. Banatre, and A. Kermarrec. COMA: An opportunity for building fault-tolerant scalable shared memory multiprocessors. In Int. Symp. on Comp. Arch., May 1996.

Digital Library

[19]

C. Morin, A. Kermarrec, M. Banatre, and A. Gefflaut. An efficient and scalable approach for implementing fault-tolerant DSM architectures. IEEE Trans. Comp., 49(5), 2000.

Digital Library

[20]

S. Mukherjee. Architecture Design for Soft Errors. Elsevier Inc., Burlington, MA, USA, 2008.

Digital Library

[21]

J. Nakano, P. Montesinos, K. Gharachorloo, and J. Torrellas. ReViveI/O: Efficient handling of I/O in highly-available rollback-recovery servers. In Int. Symp. on High-Perf. Comp. Arch., Feb 2006.

[22]

J. Plank and K. Li. Faster checkpointing with N+1 parity. In Int. Symp. on Fault-Tol. Comp., June 1994.

[23]

M. Prvulovic, Z. Zhang, and J. Torrellas. ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In Int. Symp. on Comp. Arch., May 2002.

Digital Library

[24]

B. Randell. System structure for software fault tolerance. IEEE Trans. on Soft. Eng., 1(2), June 1975.

Digital Library

[25]

S. Raoux, G. Burr, M. Breitwisch, C. Rettner, Y. Chen, R. Shelby, M. Salinga, D. Krebs, S. Chen, H. Lung, and C. Lam. Phase-change random access memory: A scalable technology. IBM Jou. of Res. and Dev., 52(4/5), 2008.

Digital Library

[26]

J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze, S. Sarangi, P. Sack, K. Strauss, and P. Montesinos. SESC simulator, Jan 2005. http://sesc.sourceforge.net.

[27]

D. Sorin, M. Martin, M. Hill, and D. Wood. SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Int. Symp. on Comp. Arch., May 2002.

Digital Library

[28]

F. Sultan, L. Iftode, and T. Nguyen. Scalable fault-tolerant distributed shared memory. In Int. Conf. on Super., 2000.

Digital Library

[29]

D. Sunada, M. Flynn, and D. Glasco. Multiprocessor architecture using an audit trail for fault tolerance. In Int. Symp. on Fault-Tol. Comp., June 1999.

Digital Library

[30]

D. Tarjan, S. Thoziyoor, and N. Jouppi. CACTI 4.0. Technical report, HPL-2006-86, HP Laboratories, 2006.

[31]

S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote, and N. Borkar. An 80-tile 1.28TFLOPS network-on-chip in 65nm CMOS. In Int. Sol. State Cir. Conf., Feb 2007.

[32]

D. Wang, B. Ganesh, N. Tuaycharoen, K. Baynes, A. Jaleel, and B. Jacob. DRAMsim: A memory system simulator. SIGARCH Comp. Arch. News, 33(4), 2005.

Digital Library

[33]

K. Wu, K. Fuchs, and J. Patel. Error recovery in shared memory multiprocessors using

Cited By

Akturk IKarpuzcu U(2020)ACR: Amnesic Checkpointing and Recovery2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00013(30-43)Online publication date: Feb-2020
https://doi.org/10.1109/HPCA47549.2020.00013
Psychou GRodopoulos DSabry MGemmeke TAtienza DNoll TCatthoor F(2017)Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital SystemsACM Computing Surveys10.1145/309269950:4(1-38)Online publication date: 4-Oct-2017
https://dl.acm.org/doi/10.1145/3092699
Weis SGarbade AFechner BMendelson AGiorgi RUngerer T(2016)Architectural Support for Fault Tolerance in a Teradevice Dataflow SystemInternational Journal of Parallel Programming10.1007/s10766-014-0312-y44:2(208-232)Online publication date: 1-Apr-2016
https://dl.acm.org/doi/10.1007/s10766-014-0312-y
Show More Cited By

Index Terms

Rebound: scalable checkpointing for coherent shared memory
1. Hardware
  1. Hardware test
  2. Robustness

Recommendations

Rebound: scalable checkpointing for coherent shared memory
ISCA '11

As we move to large manycores, the hardware-based global check-pointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include global operations, work lost to global rollback, and inefficiencies in ...
Maintaining Cache Coherence through Compiler-Directed Data Prefetching

In this paper, we propose a compiler-directed cache coherence scheme which makes use of data prefetching to enforce cache coherence in large-scale distributed shared-memory (DSM) systems. TheCache Coherence With Data Prefetching(CCDP) scheme uses ...
Efficient Integration of Compiler-Directed Cache Coherence and Data Prefetching

Cache coherence enforcement and memory latency reduction and hiding are very important and challenging problems in the design of large-scale distributed shared-memory (DSM) multiprocessors. We propose an integrated approach to solve these problems ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '11: Proceedings of the 38th annual international symposium on Computer architecture

June 2011

488 pages

ISBN:9781450304726

DOI:10.1145/2000064

General Chairs:
Ravi Iyer
Intel
,
Qing Yang
University of Rhode Island
,
Program Chair:
Antonio González
Intel and UPC

ACM SIGARCH Computer Architecture News Volume 39, Issue 3
ISCA '11
June 2011
462 pages
ISSN:0163-5964
DOI:10.1145/2024723
Issue’s Table of Contents

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 June 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ISCA '11

Sponsor:

SIGARCH

ISCA '11: The 38th Annual International Symposium on Computer Architecture

June 4 - 8, 2011

California, San Jose, USA

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
447
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)6

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Akturk IKarpuzcu U(2020)ACR: Amnesic Checkpointing and Recovery2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00013(30-43)Online publication date: Feb-2020
https://doi.org/10.1109/HPCA47549.2020.00013
Psychou GRodopoulos DSabry MGemmeke TAtienza DNoll TCatthoor F(2017)Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital SystemsACM Computing Surveys10.1145/309269950:4(1-38)Online publication date: 4-Oct-2017
https://dl.acm.org/doi/10.1145/3092699
Weis SGarbade AFechner BMendelson AGiorgi RUngerer T(2016)Architectural Support for Fault Tolerance in a Teradevice Dataflow SystemInternational Journal of Parallel Programming10.1007/s10766-014-0312-y44:2(208-232)Online publication date: 1-Apr-2016
https://dl.acm.org/doi/10.1007/s10766-014-0312-y
Jung DLee HKim S(2015)Lowering Minimum Supply Voltage for Power-Efficient Cache Design by Exploiting Data RedundancyACM Transactions on Design Automation of Electronic Systems10.1145/279522921:1(1-24)Online publication date: 2-Dec-2015
https://dl.acm.org/doi/10.1145/2795229
Huang SFu SDeBardeleben NGuan QXu C(2015)Differentiated Failure Remediation with Action Selection for Resilient ComputingProceedings of the 2015 IEEE 21st Pacific Rim International Symposium on Dependable Computing (PRDC)10.1109/PRDC.2015.42(199-208)Online publication date: 18-Nov-2015
https://dl.acm.org/doi/10.1109/PRDC.2015.42
Skitsas MNicopoulos CMichael M(2015)Toward efficient check-pointing and rollback under on-demand SBST in chip multi-processors2015 IEEE 21st International On-Line Testing Symposium (IOLTS)10.1109/IOLTS.2015.7229842(110-115)Online publication date: Jul-2015
https://doi.org/10.1109/IOLTS.2015.7229842
Sánchez DCebrián JGarcía JAragón J(2015)Soft-error mitigation by means of decoupled transactional memory threadsDistributed Computing10.1007/s00446-014-0215-628:2(75-90)Online publication date: 1-Apr-2015
https://dl.acm.org/doi/10.1007/s00446-014-0215-6
Yalcin GUnsal O(2015)Transactional Memory for ReliabilityTransactional Memory. Foundations, Algorithms, Tools, and Applications10.1007/978-3-319-14720-8_13(268-282)Online publication date: 2015
https://doi.org/10.1007/978-3-319-14720-8_13
Gupta GSridharan SSohi G(2014)Globally precise-restartable execution of parallel programsACM SIGPLAN Notices10.1145/2666356.259430649:6(181-192)Online publication date: 9-Jun-2014
https://dl.acm.org/doi/10.1145/2666356.2594306
Gupta GSridharan SSohi GO'Boyle MPingali K(2014)Globally precise-restartable execution of parallel programsProceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/2594291.2594306(181-192)Online publication date: 9-Jun-2014
https://dl.acm.org/doi/10.1145/2594291.2594306
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten