skip to main content
10.1145/1555754.1555769acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Architectural core salvaging in a multi-core processor for hard-error tolerance

Published:20 June 2009Publication History

ABSTRACT

The incidence of hard errors in CPUs is a challenge for future multicore designs due to increasing total core area. Even if the location and nature of hard errors are known a priori, either at manufacture-time or in the field, cores with such errors must be disabled in the absence of hard-error tolerance. While caches, with their regular and repetitive structures, are easily covered against hard errors by providing spare arrays or spare lines, structures within a core are neither as regular nor as repetitive. Previous work has proposed microarchitectural core salvaging to exploit structural redundancy within a core and maintain functionality in the presence of hard errors. Unfortunately microarchitectural salvaging introduces complexity and may provide only limited coverage of core area against hard errors due to a lack of natural redundancy in the core.

This paper makes a case for architectural core salvaging. We observe that even if some individual cores cannot execute certain operations, a CPU die can be instruction-set-architecture (ISA) compliant, that is execute all of the instructions required by its ISA, by exploiting natural cross-core redundancy. We propose using hardware to migrate offending threads to another core that can execute the operation. Architectural core salvaging can cover a large core area against faults, and be implemented by leveraging known techniques that minimize changes to the microarchitecture. We show it is possible to optimize architectural core salvaging such that the performance on a faulty die approaches that of a fault-free die--assuring significantly better performance than core disabling for many workloads and no worse performance than core disabling for the remainder.

References

  1. T. M. Austin. DIVA: A reliable substrate for deep submicron microarchitecture design. In Proceedings of the 32nd International Symposium on Microarchitecture (MICRO 32), Nov. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. F. A. Bower, P. G. Shealy, S. Ozev, and D. J. Sorin. Tolerating hard faults in microprocessor array structures. In International Conference on Dependable Systems and Networks (DSN2004), pages 51--60, June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. F. A. Bower, D. J. Sorin, and S. Ozev. A mechanism for online diagnosis of hard faults in microprocessors. In Proceedings of the 38th International Symposium on Microarchitecture (MICRO 38), pages 197--208, Nov. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Bushnell and V. Agrawal. Essentials of Electronic Testing for Digital, Memory, and Mixed-Signal VLSI Circuits. Springer, 2000.Google ScholarGoogle Scholar
  5. J. Chang, M. Huang, J. Shoemaker, J. Benoit, S.-L. Chen, W. Chen, S. Chiu, R. Ganesan, G. Leong, V. Lukka, S. Rusu, and D. Srivastava. The 65nm 16mb on-die l3 cache for a dual core multi-threaded xeon processor. In 2006 Symposium on VLSI Circuits, pages 126--127, Feb. 2006.Google ScholarGoogle ScholarCross RefCross Ref
  6. K. Constantinides, O. Mutlu, T. Austin, and V. Bertacco. Software-based online detection of hardware defects: Mechanisms, architectural support, and evaluation. In Proceedings of the 40th International Symposium on Microarchitecture (MICRO 40), pages 97--108, Dec. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Emer, P. Ahuja, E. Borch, A. Klauser, C.-K. Luk, S. Manne, S. S. Mukherjee, H. Patel, S. Wallace, N. Binkert, R. Espasa, and T. Juan. Asim: A performance model framework. In IEEE Computer 0018-9162:68-76, pages 68--76, Feb. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Gerosa, S. Curtis, M. D'Addeo, B. Jiang, B. Kuttanna, F. Merchant, B. Patel, M. Taufique, and H. Samarchi. A sub-lw to 2w low-power IA processor formobile internet devices and ultra-mobile PCs in 45nm hi-k metal gate CMOS. In 2008 IEEE International Solid-State Circuits Conference, Feb. 2008.Google ScholarGoogle ScholarCross RefCross Ref
  9. M. Gschwind, P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. A novel simd architecture for the cell heterogeneous chip-multiprocessor. In Proceedings of Seventeenth Symposium of IEEE Hot Chips, Aug. 2005.Google ScholarGoogle ScholarCross RefCross Ref
  10. S. Gunther, F. Binns, D. M. Carmean, and J. C. Hall. Managing the impact of increasing microprocessor power consumption. In Intel Technology Journal Q1 2001, Q1 2001.Google ScholarGoogle Scholar
  11. Intel Corporation. First Details on a Future Intel Design Codenamed Larrabee. http://www.intel.com/pressroom/archive/releases/20080804fact.htm, Aug. 2008.Google ScholarGoogle Scholar
  12. Intel Corporation. Intel Core 2 Duo Processor and Intel Core 2 Extreme Processor on 45-nm Process for Platforms Based on Mobile Intel 965 Express Chipset Family. ftp://download.intel.com/design/mobile/datashts/31891401.pdf, Jan. 2008.Google ScholarGoogle Scholar
  13. Intel Corporation. Intel Corporation's Multicore Architecture Briefing. http://www.intel.com/pressroom/archive/releases/20080317fact.htm, Mar. 2008.Google ScholarGoogle Scholar
  14. D. A. Jimenez. Reconsidering complex branch predictors. In Ninth International Symposium on High Performance Computer Architecture (HPCA), pages 43--52, Feb. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Joseph. Exploring salvage techniques for multi-core architectures. In Workshop on High Performance Computing Reliability Issues (HPCRI) 2005, Feb. 2005.Google ScholarGoogle Scholar
  16. A. Meixner and D. J. Sorin. Detouring: Translating software to circumvent hard faults in simple cores. In International Conference on Dependable Systems and Networks (DSN2008), pages 80--89, June 2008.Google ScholarGoogle ScholarCross RefCross Ref
  17. M. D. Powell, A. Biswas, J. Emer, S. S. Mukherjee, B. R. Sheikh, and S. Yardi. CAMP: A technique to estimate per-structure power at run-time using a few simple parameters. In Fifteenth International Symposium on High Performance Computer Architecture (HPCA), Feb. 2009.Google ScholarGoogle ScholarCross RefCross Ref
  18. B. F. Romanescu and D. J. Sorin. Core cannibalization architecture: Improving lifetime chip performance for multicore processors in the presence of hard faults. In Proceedings of the 2008 International Conference on Parallel Architectures and Compiliation, pages 43--51, Oct. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. Schuchman and T. N. Vijaykumar. Rescue: A microarchitecture for testability and defect tolerance. In Proceedings of the 32st International Symposium on Computer Architecture (ISCA 32), pages 160--171, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. E. Schuchman and T. N. Vijaykumar. Blackjack: Hard error detection with redundant threads on smt. In International Conference on Dependable Systems and Networks (DSN2007), pages 327--337, June 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Shivakumar, S. W. Keckler, C. R. Moore, and D. Burger. Exploiting microarchitectural redundancy for defect tolerance. In International Conference on Computer Design (ICCD), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. C. Smolens, B. T. Gold, J. C. Hoe, B. Falsafi, and K. Mai. Detecting emerging wearout faults. In Workshop on Silicon Errors in Logic - System Effects (SELSE-3), Apr. 2007.Google ScholarGoogle Scholar
  23. J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers. Exploiting structural duplication for lifetime reliability enhancement. In Proceedings of the 32st International Symposium on Computer Architecture (ISCA 32), June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. The Standard Performance Evaluation Corporation. Spec CPU2000 suite. http://www.specbench.org/osg/cpu2000/.Google ScholarGoogle Scholar
  25. The Standard Performance Evaluation Corporation. Spec CPU2006 suite. http://www.specbench.org/osg/cpu2006/.Google ScholarGoogle Scholar
  26. S. Thoziyoor, N. Muralimanohar, J. H. Ahn, and N. P. Jouppi. CACTI 5.1. Technical report, HP Laboratories, Palo Alto, 2008.Google ScholarGoogle Scholar
  27. D. Weiss, J. J. Wuu, and V. Chin. The on-chip 3-MB subarray-based third-level cache on an Itanium microprocessor. IEEE Journal of Solid-State Circuits, 37(11):1523--1529, 2002.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Architectural core salvaging in a multi-core processor for hard-error tolerance

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture
          June 2009
          510 pages
          ISBN:9781605585260
          DOI:10.1145/1555754
          • cover image ACM SIGARCH Computer Architecture News
            ACM SIGARCH Computer Architecture News  Volume 37, Issue 3
            June 2009
            495 pages
            ISSN:0163-5964
            DOI:10.1145/1555815
            Issue’s Table of Contents

          Copyright © 2009 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 20 June 2009

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate543of3,203submissions,17%

          Upcoming Conference

          ISCA '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader