ABSTRACT
The incidence of hard errors in CPUs is a challenge for future multicore designs due to increasing total core area. Even if the location and nature of hard errors are known a priori, either at manufacture-time or in the field, cores with such errors must be disabled in the absence of hard-error tolerance. While caches, with their regular and repetitive structures, are easily covered against hard errors by providing spare arrays or spare lines, structures within a core are neither as regular nor as repetitive. Previous work has proposed microarchitectural core salvaging to exploit structural redundancy within a core and maintain functionality in the presence of hard errors. Unfortunately microarchitectural salvaging introduces complexity and may provide only limited coverage of core area against hard errors due to a lack of natural redundancy in the core.
This paper makes a case for architectural core salvaging. We observe that even if some individual cores cannot execute certain operations, a CPU die can be instruction-set-architecture (ISA) compliant, that is execute all of the instructions required by its ISA, by exploiting natural cross-core redundancy. We propose using hardware to migrate offending threads to another core that can execute the operation. Architectural core salvaging can cover a large core area against faults, and be implemented by leveraging known techniques that minimize changes to the microarchitecture. We show it is possible to optimize architectural core salvaging such that the performance on a faulty die approaches that of a fault-free die--assuring significantly better performance than core disabling for many workloads and no worse performance than core disabling for the remainder.
- T. M. Austin. DIVA: A reliable substrate for deep submicron microarchitecture design. In Proceedings of the 32nd International Symposium on Microarchitecture (MICRO 32), Nov. 1999. Google ScholarDigital Library
- F. A. Bower, P. G. Shealy, S. Ozev, and D. J. Sorin. Tolerating hard faults in microprocessor array structures. In International Conference on Dependable Systems and Networks (DSN2004), pages 51--60, June 2004. Google ScholarDigital Library
- F. A. Bower, D. J. Sorin, and S. Ozev. A mechanism for online diagnosis of hard faults in microprocessors. In Proceedings of the 38th International Symposium on Microarchitecture (MICRO 38), pages 197--208, Nov. 2005. Google ScholarDigital Library
- M. Bushnell and V. Agrawal. Essentials of Electronic Testing for Digital, Memory, and Mixed-Signal VLSI Circuits. Springer, 2000.Google Scholar
- J. Chang, M. Huang, J. Shoemaker, J. Benoit, S.-L. Chen, W. Chen, S. Chiu, R. Ganesan, G. Leong, V. Lukka, S. Rusu, and D. Srivastava. The 65nm 16mb on-die l3 cache for a dual core multi-threaded xeon processor. In 2006 Symposium on VLSI Circuits, pages 126--127, Feb. 2006.Google ScholarCross Ref
- K. Constantinides, O. Mutlu, T. Austin, and V. Bertacco. Software-based online detection of hardware defects: Mechanisms, architectural support, and evaluation. In Proceedings of the 40th International Symposium on Microarchitecture (MICRO 40), pages 97--108, Dec. 2007. Google ScholarDigital Library
- J. Emer, P. Ahuja, E. Borch, A. Klauser, C.-K. Luk, S. Manne, S. S. Mukherjee, H. Patel, S. Wallace, N. Binkert, R. Espasa, and T. Juan. Asim: A performance model framework. In IEEE Computer 0018-9162:68-76, pages 68--76, Feb. 2002. Google ScholarDigital Library
- G. Gerosa, S. Curtis, M. D'Addeo, B. Jiang, B. Kuttanna, F. Merchant, B. Patel, M. Taufique, and H. Samarchi. A sub-lw to 2w low-power IA processor formobile internet devices and ultra-mobile PCs in 45nm hi-k metal gate CMOS. In 2008 IEEE International Solid-State Circuits Conference, Feb. 2008.Google ScholarCross Ref
- M. Gschwind, P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. A novel simd architecture for the cell heterogeneous chip-multiprocessor. In Proceedings of Seventeenth Symposium of IEEE Hot Chips, Aug. 2005.Google ScholarCross Ref
- S. Gunther, F. Binns, D. M. Carmean, and J. C. Hall. Managing the impact of increasing microprocessor power consumption. In Intel Technology Journal Q1 2001, Q1 2001.Google Scholar
- Intel Corporation. First Details on a Future Intel Design Codenamed Larrabee. http://www.intel.com/pressroom/archive/releases/20080804fact.htm, Aug. 2008.Google Scholar
- Intel Corporation. Intel Core 2 Duo Processor and Intel Core 2 Extreme Processor on 45-nm Process for Platforms Based on Mobile Intel 965 Express Chipset Family. ftp://download.intel.com/design/mobile/datashts/31891401.pdf, Jan. 2008.Google Scholar
- Intel Corporation. Intel Corporation's Multicore Architecture Briefing. http://www.intel.com/pressroom/archive/releases/20080317fact.htm, Mar. 2008.Google Scholar
- D. A. Jimenez. Reconsidering complex branch predictors. In Ninth International Symposium on High Performance Computer Architecture (HPCA), pages 43--52, Feb. 2003. Google ScholarDigital Library
- R. Joseph. Exploring salvage techniques for multi-core architectures. In Workshop on High Performance Computing Reliability Issues (HPCRI) 2005, Feb. 2005.Google Scholar
- A. Meixner and D. J. Sorin. Detouring: Translating software to circumvent hard faults in simple cores. In International Conference on Dependable Systems and Networks (DSN2008), pages 80--89, June 2008.Google ScholarCross Ref
- M. D. Powell, A. Biswas, J. Emer, S. S. Mukherjee, B. R. Sheikh, and S. Yardi. CAMP: A technique to estimate per-structure power at run-time using a few simple parameters. In Fifteenth International Symposium on High Performance Computer Architecture (HPCA), Feb. 2009.Google ScholarCross Ref
- B. F. Romanescu and D. J. Sorin. Core cannibalization architecture: Improving lifetime chip performance for multicore processors in the presence of hard faults. In Proceedings of the 2008 International Conference on Parallel Architectures and Compiliation, pages 43--51, Oct. 2008. Google ScholarDigital Library
- E. Schuchman and T. N. Vijaykumar. Rescue: A microarchitecture for testability and defect tolerance. In Proceedings of the 32st International Symposium on Computer Architecture (ISCA 32), pages 160--171, June 2005. Google ScholarDigital Library
- E. Schuchman and T. N. Vijaykumar. Blackjack: Hard error detection with redundant threads on smt. In International Conference on Dependable Systems and Networks (DSN2007), pages 327--337, June 2007. Google ScholarDigital Library
- P. Shivakumar, S. W. Keckler, C. R. Moore, and D. Burger. Exploiting microarchitectural redundancy for defect tolerance. In International Conference on Computer Design (ICCD), 2003. Google ScholarDigital Library
- J. C. Smolens, B. T. Gold, J. C. Hoe, B. Falsafi, and K. Mai. Detecting emerging wearout faults. In Workshop on Silicon Errors in Logic - System Effects (SELSE-3), Apr. 2007.Google Scholar
- J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers. Exploiting structural duplication for lifetime reliability enhancement. In Proceedings of the 32st International Symposium on Computer Architecture (ISCA 32), June 2005. Google ScholarDigital Library
- The Standard Performance Evaluation Corporation. Spec CPU2000 suite. http://www.specbench.org/osg/cpu2000/.Google Scholar
- The Standard Performance Evaluation Corporation. Spec CPU2006 suite. http://www.specbench.org/osg/cpu2006/.Google Scholar
- S. Thoziyoor, N. Muralimanohar, J. H. Ahn, and N. P. Jouppi. CACTI 5.1. Technical report, HP Laboratories, Palo Alto, 2008.Google Scholar
- D. Weiss, J. J. Wuu, and V. Chin. The on-chip 3-MB subarray-based third-level cache on an Itanium microprocessor. IEEE Journal of Solid-State Circuits, 37(11):1523--1529, 2002.Google ScholarCross Ref
Index Terms
- Architectural core salvaging in a multi-core processor for hard-error tolerance
Recommendations
Architectural core salvaging in a multi-core processor for hard-error tolerance
The incidence of hard errors in CPUs is a challenge for future multicore designs due to increasing total core area. Even if the location and nature of hard errors are known a priori, either at manufacture-time or in the field, cores with such errors ...
An efficient, dynamically adaptive method to tolerate transient faults in multi-core systems
EWDC '11: Proceedings of the 13th European Workshop on Dependable ComputingThis paper presents an adaptive, CPU-aware, fault detection and recovery approach which dynamically modifies the number of replicas in the system. This technique utilizes available unused resources as redundancy. It is transparent for users and does not ...
Evaluating multi-core and many-core architectures through accelerating the three-dimensional Lax-Wendroff correction stencil
Wave propagation forward modeling is a widely used computational method in oil and gas exploration. The iterative stencil loops in such problems have broad applications in scientific computing. However, executing such loops can be highly time-consuming, ...
Comments