skip to main content
10.1145/2228360.2228585acmconferencesArticle/Chapter ViewAbstractPublication PagesdacConference Proceedingsconference-collections
research-article

Specification and synthesis of hardware checkpointing and rollback mechanisms

Published: 03 June 2012 Publication History

Abstract

The increasing pressure to make hardware resilient to runtime failures has prompted development of design techniques for specific classes of systems, e.g. processors and routers. However, these techniques come at increased design and verification costs, thus limiting their broader application. In this work we describe a methodology for general RTL designs based on the widely usable checkpointing and rollback resiliency mechanism. We take a modeling and language approach that provides an appropriate set of abstractions for the resiliency logic. This cleanly separates the main design behavior from the resiliency behavior, leading to ease of design. Further, as the language abstractions can be automatically synthesized into resiliency logic, our methodology can merge with existing design flows. The concerns of verifying this additional resiliency logic can be addressed by synthesizing behavioral assertions capturing correct behavior. We demonstrate the use of this methodology on four examples, with synthesis for performance and area to estimate the overhead of the additional synthesis logic.

References

[1]
IEEE standard for verilog register transfer level synthesis. IEEE Std 1364.1-2002, 2002.
[2]
International Technology Roadmap for Semiconductors. http://www.itrs.net/, Dec. 2011.
[3]
OpenCores. http://www.opencores.org/, Dec. 2011.
[4]
Xilinx. http://www.xilinx.com/, Dec. 2011.
[5]
T. M. Austin. DIVA: a reliable substrate for deep submicron microarchitecture design. In MICRO-32., pages 196--207. IEEE, 1999.
[6]
N. S. Bowen and D. K. Pradham. Processor- and memory-based checkpoint and rollback recovery. Computer, 26(2):22--31, Feb. 1993.
[7]
E. W. Dijkstra. Guarded commands, nondeterminacy and formal derivation of programs. Commun. ACM, Aug. 1975.
[8]
E. N. M. Elnozahy, L. Alvisi, Y. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 34:375--408, Sept. 2002.
[9]
D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, and T. Mudge. Razor: a low-power pipeline based on circuit-level timing speculation. In MICRO-36., Dec. 2003.
[10]
D. Koch, C. Haubelt, and J. Teich. Efficient hardware checkpointing. page 188. ACM Press, 2007.
[11]
I. Lee, M. Basoglu, M. Sullivan, D. H. Yoon, L. Kaplan, and M. Erez. Survey of error and fault detection mechanisms. Technical Report TR-LPH-2011-002, The University of Texas at Austin, April 2011.
[12]
J. Lo. Reliable floating-point arithmetic algorithms for error-coded operands. IEEE Transactions on Computers, 43(4):400--412, Apr. 1994.
[13]
A. Mahmood and E. J. McCluskey. Concurrent error detection using watchdog processors-a survey. IEEE Trans. Comput., 37:160--174, February 1988.
[14]
J. Martinez, J. Renau, M. Huang, and M. Prvulovic. Cherry: Checkpointed early resource recycling in out-of-order microprocessors. In MICRO-35, 2002.
[15]
V. P. Nelson. Fault-tolerant computing: fundamental concepts. Computer, 23(7):19--25, July 1990.
[16]
S.-B. Park and S. Mitra. IFRA: Instruction footprint recording and analysis for post-silicon bug localization in processors. In DAC 2008., 2008.
[17]
J. S. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt: transparent checkpointing under unix. In Proceedings of the USENIX 1995 Technical Conference Proceedings, TCON'95, Berkeley, CA, USA, 1995. USENIX Association.
[18]
M. Prvulovic, Z. Zhang, and J. Torrellas. ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In ISCA, 2002. IEEE.
[19]
S. Shyam, K. Constantinides, S. Phadke, V. Bertacco, and T. Austin. Ultra low-cost defect protection for microprocessor pipelines. SIGARCH Comput. Archit. News, 34(5):73--82, Oct. 2006.
[20]
D. J. Sorin, M. M. Martin, M. D. Hill, and D. A. Wood. SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In 29th Annual International Symposium on Computer Architecture, 2002, pages 123--134. IEEE, 2002.
[21]
Y. Tamir, M. Tremblay, and D. A. Rennels. The implementation and application of micro rollback in fault-tolerant VLSI systems. In, Eighteenth International Symposium on Fault-Tolerant Computing, 1988. FTCS-18, Digest of Papers, pages 234--239. IEEE, June 1988.
[22]
R. Teodorescu, J. Nakano, and J. Torrellas. SWICH: a prototype for efficient Cache-Level checkpointing and rollback. IEEE Micro, 26(5):28--40, Oct. 2006.

Cited By

View all
  • (2019)Fault Tolerance in VLSI Circuit with Reducing Rollback Cost using FSM2019 3rd International Conference on Computing Methodologies and Communication (ICCMC)10.1109/ICCMC.2019.8819739(1162-1166)Online publication date: Mar-2019
  • (2018)Reducing Rollback Cost in VLSI Circuits to Improve Fault ToleranceIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2018.281802126:8(1438-1451)Online publication date: Aug-2018
  • (2015)Formal verification of automatic circuit transformations for fault-toleranceProceedings of the 15th Conference on Formal Methods in Computer-Aided Design10.5555/2893529.2893542(41-48)Online publication date: 27-Sep-2015
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DAC '12: Proceedings of the 49th Annual Design Automation Conference
June 2012
1357 pages
ISBN:9781450311991
DOI:10.1145/2228360
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CpR-verilog
  2. backward error recovery

Qualifiers

  • Research-article

Funding Sources

Conference

DAC '12
Sponsor:
DAC '12: The 49th Annual Design Automation Conference 2012
June 3 - 7, 2012
California, San Francisco

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Upcoming Conference

DAC '25
62nd ACM/IEEE Design Automation Conference
June 22 - 26, 2025
San Francisco , CA , USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2019)Fault Tolerance in VLSI Circuit with Reducing Rollback Cost using FSM2019 3rd International Conference on Computing Methodologies and Communication (ICCMC)10.1109/ICCMC.2019.8819739(1162-1166)Online publication date: Mar-2019
  • (2018)Reducing Rollback Cost in VLSI Circuits to Improve Fault ToleranceIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2018.281802126:8(1438-1451)Online publication date: Aug-2018
  • (2015)Formal verification of automatic circuit transformations for fault-toleranceProceedings of the 15th Conference on Formal Methods in Computer-Aided Design10.5555/2893529.2893542(41-48)Online publication date: 27-Sep-2015
  • (2015)Automatic Time-Redundancy Transformation for Fault-Tolerant CircuitsProceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/2684746.2689058(218-227)Online publication date: 22-Feb-2015
  • (2015)Formal verification of automatic circuit transformations for fault-tolerance2015 Formal Methods in Computer-Aided Design (FMCAD)10.1109/FMCAD.2015.7542251(41-48)Online publication date: Sep-2015
  • (2015)Time-redundancy transformations for adaptive fault-tolerant circuits2015 NASA/ESA Conference on Adaptive Hardware and Systems (AHS)10.1109/AHS.2015.7231164(1-8)Online publication date: Jun-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media