skip to main content
10.1145/2380445.2380461acmconferencesArticle/Chapter ViewAbstractPublication PagesesweekConference Proceedingsconference-collections
research-article

A novel NoC-based design for fault-tolerance of last-level caches in CMPs

Published: 07 October 2012 Publication History

Abstract

Advances in technology scaling, coupled with aggressive voltage scaling results in significant reliability challenges for emerging Chip Multiprocessor (CMP) platforms, where error-prone caches continue to dominate the chip area. Network-on-Chip (NoC) fabrics are increasingly used to manage the scalability of these CMPs. We present a novel fault-tolerant scheme for Last Level Cache (LLC) in CMP architectures that leverages the interconnection network to protect the LLC cache banks against permanent faults. During a LLC access to a faulty area, the network detects and corrects the faults, returning the fault-free data to the requesting core. By leveraging the NoC interconnection fabric, we can implement any cache fault-tolerant scheme in an efficient, modular, and scalable manner. We perform extensive design space exploration on NoC benchmarks to demonstrate the utility and efficacy of our approach. The overheads of leveraging the NoC fabric are minimal: on an 8-core, 16-cache-bank CMP we demonstrate reliable access to LLCs with additional overheads of less than 3% in area and less than 7% in power.

References

[1]
F. Angiolini, D. Atienza, S. Murali, L. Benini, and G. De Micheli. Reliability Support for On-Chip Memories Using Networks-on-Chip. In Proc. ICCD, 2006.
[2]
Y. Wang, L. Zhang, Y. Han, H. Li, and X. Li. Address Remapping for Static NUCA in NoC-Based Degradable Chip-Multiprocessors. In Proc. PRDC, 2010.
[3]
R. Marculescu, et al. Outstanding Research Problems in NoC Design: System. Microarchitecture, and Circuit Perspectives," IEEE Transactions on CAD, vol. 28, no. 1, pp. 3--21, Jan. 2009.
[4]
P. Bogdan, T. Dumitras, and R. Marculescu. Stochastic communication: A new paradigm for fault-tolerant networks-on-chip. In Proc. VLSI Design, Feb. 2007.
[5]
V. Puente, J. A. Gregorio, F. Vallejo, and R. Beivide. Immunet: A cheap and robust fault-tolerant packet routing mechanism. In Proc. ISCA, Jun. 2004.
[6]
M. Pirretti, G. M. Link, R. R. Brooks, N. Vijaykrishnan, M. Kandemir, andM. J. Irwin. Fault tolerant algorithms for network-on-chip interconnect. In Proc. IEEE Symp. VLSI, Feb. 2004.
[7]
W. Tsai, D. Zheng, S. Chen, and Y.H. Hu. A fault-tolerant NoC scheme using bidirectional channel. In Proc. DAC, 2011.
[8]
J. Kim, C. Nicopoulos, and D. Park. A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks. In Proc. ISCA, 2006.
[9]
C. Kim, D. Burger, and S. W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Proc. ASPLOS, 2002.
[10]
N. Aggarwal, et al. Configurable isolation: building high availability systems with commodity multi-core processors. In Proc. ISCA, 2007.
[11]
A. Agarwal, B. C. Paul, H. Mahmoodi-Meimand, A. Datta, and K. Roy. A process-tolerant cache architecture for improved yield in nanoscale technologies. IEEE Trans. VLSI Syst., 13(1):27--38, 2005.
[12]
S. Ozdemir, D. Sinha, G. Memik, J. Adams, and H. Zhou. Yield-aware cache architectures. In Proc. MICRO, 2006.
[13]
C. Wilkerson, H. Gao, et al. Trading off Cache Capacity for Reliability to Enable Low Voltage Operation. In Proc. ISCA, 2008.
[14]
C. K. Koh, W. F. Wong, Y. Chen, and H. Li. Tolerating process variations in large, set associative caches: The buddy cache. ACM TACO, 6(2):1--34, Jun 2009.
[15]
A. Ansari, et al. Archipelago: A polymorphic cache design for enabling robust near-threshold operation. In Proc. HPCA, 2011.
[16]
A. BanaiyanMofrad, Houman Homayoun, and Nikil Dutt. FFT-Cache: A Flexible Fault-Tolerant Cache Architecture for Ultra Low Voltage Operation. In Proc. CASES, 2011.
[17]
B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. In Proc. MICRO, 2004.
[18]
C.A. Zeferino and A.A. Susin. SoCIN: A Parametric and Scalable Network-on-Chip. In Proc. SBCCI. 2003.
[19]
G. Girao, D. Barcelos, and F.R. Wagner. Performance and Energy Evaluation of Memory Organizations in NoC-Based MPSoCs under Latency and Task Migration. In Proc. VLSI-SoC, 2009.
[20]
L. Kunz, G. Girao, and F.R. Wagner. Improving the efficiency of a hardware transactional memory on an NoC-based MPSoC. In Proc. DATE, 2011.
[21]
P.S. Magnusson, et al. Simics: A Full System Simulation Platform. IEEE Computer, 35(2): 50--58, 2002.
[22]
S.C. Woo, et al. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proc. ISCA, 1995.
[23]
C. Bienia, S. Kumar, J.P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In Proc. PACT, 2008.
[24]
S.M.Z. Iqbal, Y. Liang, H. Grahn. ParMiBench: An Open Source Benchmark for Embedded Multiprocessor Systems. In Proc. Computer Architecture Letters, 2010.
[25]
N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Cacti 6.5. In HP Laboratories, Technical Report, 2009.
[26]
A.B. Kahng, B. Li, L.S. Peh, and K. Samadi. ORION 2.0: a fast and accurate NoC power and area model for early-stage design space exploration. In Proc. DATE, 2009.
[27]
T.Marescaux, E. Brockmeyer, and H. Corporaal. The Impact of Higher Communication Layers on NoC Supported MPSoCs. In Proc. NOCS, May 2007.
[28]
M. Monchiero, G. Palermo, C. Silvano, and O. Villa. Exploration of Distributed Shared Memory Architectures for NoC-based Multiprocessors. In Proc. IC-SAMOS, July 2006.
[29]
N. Enright-Jerger, L.-S. Peh, and M. Lipasti. Virtual Tree Coherence: Leveraging Regions and In-Network Multicast Trees for Scalable Cache Coherence. In Proc. MICRO, Nov 2008.
[30]
A.G. Wassal, H.H. Sarhan, A. ElSherief. Novel 3D memory-centric NoC architecture for transaction-based SoC applications. In Proc. SIECPC, 2011.
[31]
Donghyun Kim, et al. Solutions for Real Chip Implementation Issues of NoC and Their Application to Memory-Centric NoC. In Proc. NOCS, 2007.
[32]
B. Calhoun and A. Chandrakasan. A 256 kb sub-threshold sram in 65nm cmos. In Proc. ISSCC, 2006.
[33]
C. Chen and M. Hsiao. Error-correcting codes for semiconductor memory applications: A state of the art review. IBM Journal of Research and Development, 28(2):124--134, 1984.
[34]
C. Wilkerson, et al. Reducing Cache Power with Low-Cost, Multi-Bit Error-Correcting Codes. In Proc. ISCA, June 2010.
[35]
J. Kim, et al. Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding. In Proc. MICRO, 2007.
[36]
T. Thomas and B. Anthony. Area, Performance, and Yield Implications of Redundancy in On-Chip Caches. In Proc. ICCD, Feb. 1999.
[37]
D. Roberts, N. S. Kim, and T. Mudge. On-chip cache device scaling limits and effective fault repair techniques in future nanoscale technology. In Proc. DSD, 2007.
[38]
S. Manolache, P. Eles, and Z. Peng. Fault and energy-aware communication mapping with guaranteed latency for applications implemented on NoC. In Proc. DAC, Jul. 2005.
[39]
D. Bertozzi, L. Benini, and G. De Micheli. Error control schemes for on-chip communication links: The energy--reliability tradeoff. IEEE TCAD, 24(6): 818--831, Jun. 2000.

Cited By

View all
  • (2017)Leveraging on Deep Memory Hierarchies to Minimize Energy Consumption and Data Access Latency on Single-Chip Cloud ComputersIEEE Transactions on Sustainable Computing10.1109/TSUSC.2017.27066202:2(154-166)Online publication date: 1-Apr-2017
  • (2014)A GALS Router for Asynchronous Network-on-ChipProceedings of International Workshop on Manycore Embedded Systems10.1145/2613908.2613918(52-55)Online publication date: 15-Jun-2014
  • (2014)Multi-Layer Memory ResiliencyProceedings of the 51st Annual Design Automation Conference10.1145/2593069.2596684(1-6)Online publication date: 1-Jun-2014
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CODES+ISSS '12: Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
October 2012
596 pages
ISBN:9781450314268
DOI:10.1145/2380445
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 October 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. chip multiprocessor
  2. fault-tolerant cache
  3. network-on-chip
  4. reliability
  5. remapping

Qualifiers

  • Research-article

Conference

ESWEEK'12
ESWEEK'12: Eighth Embedded System Week
October 7 - 12, 2012
Tampere, Finland

Acceptance Rates

CODES+ISSS '12 Paper Acceptance Rate 48 of 163 submissions, 29%;
Overall Acceptance Rate 280 of 864 submissions, 32%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2017)Leveraging on Deep Memory Hierarchies to Minimize Energy Consumption and Data Access Latency on Single-Chip Cloud ComputersIEEE Transactions on Sustainable Computing10.1109/TSUSC.2017.27066202:2(154-166)Online publication date: 1-Apr-2017
  • (2014)A GALS Router for Asynchronous Network-on-ChipProceedings of International Workshop on Manycore Embedded Systems10.1145/2613908.2613918(52-55)Online publication date: 15-Jun-2014
  • (2014)Multi-Layer Memory ResiliencyProceedings of the 51st Annual Design Automation Conference10.1145/2593069.2596684(1-6)Online publication date: 1-Jun-2014
  • (2014)NoC-based fault-tolerant cache design in chip multiprocessorsACM Transactions on Embedded Computing Systems10.1145/256793913:3s(1-26)Online publication date: 28-Mar-2014
  • (2013)REMEDIATE: A scalable fault-tolerant architecture for low-power NUCA cache in tiled CMPs2013 International Green Computing Conference Proceedings10.1109/IGCC.2013.6604500(1-10)Online publication date: Jun-2013

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media