skip to main content
research-article

A low-power instruction replay mechanism for design of resilient microprocessors

Published:10 March 2014Publication History
Skip Abstract Section

Abstract

There is a growing concern about the increasing rate of defects in computing substrates. Traditional redundancy solutions prove to be too expensive for commodity microprocessor systems. Modern microprocessors feature multiple execution units to take advantage of instruction level parallelism. However, most workloads do not exhibit the level of instruction level parallelism that a typical microprocessor is resourced for. This offers an opportunity to reexecute instructions using idle execution units. But, relying solely on idle resources will not provide full instruction coverage and there is a need to explore other alternatives. To that end, we propose and evaluate two instruction replay schemes within the same core for online testing of the execution units. One scheme (RER) reexecutes only the retired instructions, while the other (REI) reexecutes all the issued instructions. The complete proposed solution requires a comparator and minor modifications to control logic, resulting in negligible hardware overhead. Both soft and hard error detection are considered and the performance and energy impact of both schemes are evaluated and compared against previously proposed redundant execution schemes. Results show that even though the proposed schemes result in a small performance penalty when compared to previous work, the energy overhead is significantly reduced.

References

  1. T. Austin. 1999. DIVA: a reliable substrate for deep submicron microarchitecture design. In Proceedings of the 32nd Annual International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Baumann. 2005. Soft errors in advanced computer systems. IEEE Des. Test Comput. 22, 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Borkar. 2005. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. F. Bower, D. Sorin, and S. Ozev. 2005. A mechanism for online diagnosis of hard faults in microprocessors. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. H. Hana and B. Johnson. 1986. Concurrent error detection in VLSI circuits using time redundancy. In Proceedings of the IEEE Southeastcon'86 Regional Conference.Google ScholarGoogle Scholar
  6. A. Mendelson and N. Suri. 2000. Designing high-performance and reliable superscalar architectures: The out of order reliable superscalar (O3RS) approach. In Proceedings of the International Conference on Dependable Systems and Networks (DSN'00). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. Mizan, T. Amimeur, and M. Jacome. 2007. Self-imposed temporal redundancy: An efficient technique to enhance the reliability of pipelined functional units. In Proceedings of the 19th International Symposium on Computer Architecture and High Performance Computing.Google ScholarGoogle Scholar
  8. J. H. Patel and L. Y. Fung. 1982. Concurrent error detection in ALU's by recomputing with shifted operands. IEEE Trans. Comput. 31, 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Ray, J. Hoe, and B. Falsafi. 2001. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the 34th ACM/IEEE International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. K. Reinhardt and S. S. Mukherjee. 2000. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA'00). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Renau. 2005. SESC: SuperESCalar simulator. Tech. rep., University of California at Santa Cruz.Google ScholarGoogle Scholar
  12. R. Rodrigues and S. Kundu. 2011. An online mechanism to verify datapath execution using existing resources in chip multiprocessors. In Proceedings of the 20th Asian Test Symposium. 161--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. E. Rotenberg. 1999. AR-SMT: a microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing (Digest of Papers). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Rusu, S. Tam, H. Muljono, D. Ayers, and J. Chang. 2006. A dual-core multi-threaded Xeon processor with 16mb l3 cache. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC'06) (Digest of Technical Papers). 315--324.Google ScholarGoogle Scholar
  15. S. Shyam, K. Constantinides, S. Phadke, V. Bertacco, and T. Austin. 2006. Ultra low-cost defect protection for microprocessor pipelines. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. P. Siewiorek and R. S. Swarz. 1998. Reliable Computer Systems: Design and Evaluation. AK Peters, Ltd. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Smolens, J. Kim, J. Hoe, and B. Falsafi. 2004. Efficient resource sharing in concurrent error detecting superscalar microarchitectures. In Proceedings of the 37th International Symposium on Microarchitecture. 257--268. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood. 2002. SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA'02). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. SPEC2000. The Standard Performance Evaluation Corporation (Spec CPI2000 suite).Google ScholarGoogle Scholar
  20. A. Timor, A. Mendelson, Y. Birk, and N. Suri. 2010. Using Underutilized CPU Resources to Enhance Its Reliability. IEEE Trans. Depend. Secure Comput. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Vasudevan and P. Lala. 2005. A technique for modular design of self-checking carry-select adder. In Proceedings of the 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT'05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Yilmaz, D. R. Hower, S. Ozev, and D. J. Sorin. 2006. Self-checking and self-diagnosing 32-bit microprocessor multiplier. In Proceedings of the IEEE International Test Conference.Google ScholarGoogle Scholar
  23. M. Yilmaz, A. Meixner, S. Ozev, and D. J. Sorin. 2007. Lazy error detection for microprocessor functional units. In Proceedings of the 22nd IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT'07). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A low-power instruction replay mechanism for design of resilient microprocessors

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader