skip to main content
research-article

A low-power instruction replay mechanism for design of resilient microprocessors

Published: 10 March 2014 Publication History

Abstract

There is a growing concern about the increasing rate of defects in computing substrates. Traditional redundancy solutions prove to be too expensive for commodity microprocessor systems. Modern microprocessors feature multiple execution units to take advantage of instruction level parallelism. However, most workloads do not exhibit the level of instruction level parallelism that a typical microprocessor is resourced for. This offers an opportunity to reexecute instructions using idle execution units. But, relying solely on idle resources will not provide full instruction coverage and there is a need to explore other alternatives. To that end, we propose and evaluate two instruction replay schemes within the same core for online testing of the execution units. One scheme (RER) reexecutes only the retired instructions, while the other (REI) reexecutes all the issued instructions. The complete proposed solution requires a comparator and minor modifications to control logic, resulting in negligible hardware overhead. Both soft and hard error detection are considered and the performance and energy impact of both schemes are evaluated and compared against previously proposed redundant execution schemes. Results show that even though the proposed schemes result in a small performance penalty when compared to previous work, the energy overhead is significantly reduced.

References

[1]
T. Austin. 1999. DIVA: a reliable substrate for deep submicron microarchitecture design. In Proceedings of the 32nd Annual International Symposium on Microarchitecture.
[2]
R. Baumann. 2005. Soft errors in advanced computer systems. IEEE Des. Test Comput. 22, 3.
[3]
S. Borkar. 2005. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro.
[4]
F. Bower, D. Sorin, and S. Ozev. 2005. A mechanism for online diagnosis of hard faults in microprocessors. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture.
[5]
H. Hana and B. Johnson. 1986. Concurrent error detection in VLSI circuits using time redundancy. In Proceedings of the IEEE Southeastcon'86 Regional Conference.
[6]
A. Mendelson and N. Suri. 2000. Designing high-performance and reliable superscalar architectures: The out of order reliable superscalar (O3RS) approach. In Proceedings of the International Conference on Dependable Systems and Networks (DSN'00).
[7]
E. Mizan, T. Amimeur, and M. Jacome. 2007. Self-imposed temporal redundancy: An efficient technique to enhance the reliability of pipelined functional units. In Proceedings of the 19th International Symposium on Computer Architecture and High Performance Computing.
[8]
J. H. Patel and L. Y. Fung. 1982. Concurrent error detection in ALU's by recomputing with shifted operands. IEEE Trans. Comput. 31, 7.
[9]
J. Ray, J. Hoe, and B. Falsafi. 2001. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the 34th ACM/IEEE International Symposium on Microarchitecture.
[10]
S. K. Reinhardt and S. S. Mukherjee. 2000. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA'00).
[11]
J. Renau. 2005. SESC: SuperESCalar simulator. Tech. rep., University of California at Santa Cruz.
[12]
R. Rodrigues and S. Kundu. 2011. An online mechanism to verify datapath execution using existing resources in chip multiprocessors. In Proceedings of the 20th Asian Test Symposium. 161--166.
[13]
E. Rotenberg. 1999. AR-SMT: a microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing (Digest of Papers).
[14]
S. Rusu, S. Tam, H. Muljono, D. Ayers, and J. Chang. 2006. A dual-core multi-threaded Xeon processor with 16mb l3 cache. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC'06) (Digest of Technical Papers). 315--324.
[15]
S. Shyam, K. Constantinides, S. Phadke, V. Bertacco, and T. Austin. 2006. Ultra low-cost defect protection for microprocessor pipelines. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems.
[16]
D. P. Siewiorek and R. S. Swarz. 1998. Reliable Computer Systems: Design and Evaluation. AK Peters, Ltd.
[17]
J. Smolens, J. Kim, J. Hoe, and B. Falsafi. 2004. Efficient resource sharing in concurrent error detecting superscalar microarchitectures. In Proceedings of the 37th International Symposium on Microarchitecture. 257--268.
[18]
D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood. 2002. SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA'02).
[19]
SPEC2000. The Standard Performance Evaluation Corporation (Spec CPI2000 suite).
[20]
A. Timor, A. Mendelson, Y. Birk, and N. Suri. 2010. Using Underutilized CPU Resources to Enhance Its Reliability. IEEE Trans. Depend. Secure Comput.
[21]
D. Vasudevan and P. Lala. 2005. A technique for modular design of self-checking carry-select adder. In Proceedings of the 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT'05).
[22]
M. Yilmaz, D. R. Hower, S. Ozev, and D. J. Sorin. 2006. Self-checking and self-diagnosing 32-bit microprocessor multiplier. In Proceedings of the IEEE International Test Conference.
[23]
M. Yilmaz, A. Meixner, S. Ozev, and D. J. Sorin. 2007. Lazy error detection for microprocessor functional units. In Proceedings of the 22nd IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT'07).

Cited By

View all
  • (2017)NEDA: NOP Exploitation with Dependency Awareness for Reliable VLIW Processors2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI.2017.75(391-396)Online publication date: Jul-2017
  • (2015)REPAIR: Hard-error recovery via re-execution2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS)10.1109/DFT.2015.7315139(76-79)Online publication date: Oct-2015

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 13, Issue 4
Regular Papers
November 2014
647 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/2592905
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 10 March 2014
Accepted: 01 August 2013
Revised: 01 June 2013
Received: 01 December 2012
Published in TECS Volume 13, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Inter Core Queue (ICQ)
  2. Online error detection
  3. Out of Order Reliable Superscalar (O3RS)
  4. Re-execute on Issue (REI)
  5. Re-execute on Retire (RER)
  6. SHared REsource Checker (SHREC)
  7. dual use of superscalar datapath (DUAL)
  8. energy and performance efficient test
  9. functional unit test
  10. out-of-order (OOO)
  11. self imposed redundancy (SELF)
  12. using underutilized CPU resources for reliability (UCPU)

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2017)NEDA: NOP Exploitation with Dependency Awareness for Reliable VLIW Processors2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI.2017.75(391-396)Online publication date: Jul-2017
  • (2015)REPAIR: Hard-error recovery via re-execution2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS)10.1109/DFT.2015.7315139(76-79)Online publication date: Oct-2015

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media