skip to main content
10.1145/2155620.2155627acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

A systematic methodology to develop resilient cache coherence protocols

Published: 03 December 2011 Publication History

Abstract

Aggressive transistor scaling continues to increase integration capacity with each new technology node, but technology downscaling also increases the vulnerability of semiconductor devices and causes silicon failures. Thus, fault-tolerant architectures are emerging to guarantee reliable functionality on unreliable silicon. While tolerating faults within a processor core has been extensively researched, the many-core era introduces the challenge of reliable on-chip communication in Chip Multi-Processors (CMPs). In CMP systems, an unreliable interconnection network can lose or corrupt coherence messages, causing the entire chip to deadlock. In this work, we argue for a system-level resiliency solution to tolerate an unreliable underlying Network-on-Chip (NoC). We introduce a systematic methodology to transform a coherence protocol to a resilient one, by extending its Finite State Machine (FSM) with safe states and incorporating additional handshaking messages into transactions. The modified protocol ensures coherent and reliable transactions over any lossy NoC. Our approach is generic and can be applied to a wide range of protocols. It requires minimal hardware modifications and introduces only a slight performance overhead (an average of 0.8% during fault-free operation, and 1.9% even at an aggressive fault rate of one fault per msec).

References

[1]
N. Agarwal, L.-S. Peh, and N. K. Jha, "In-Network Snoop Ordering (INSO): snoopy coherence on unordered networks," in Proceedings of the 15th International Symposium on High-Performance Computer Architecture, 2009.
[2]
A. Ahmed, P. Conway, B. Hughes, and F. Weber, "AMD opteron shared memory MP systems," in Proceedings of the 14th HotChips Symposium, 2002.
[3]
K. Aisopos, C.-H. O. Chen, and L.-S. Peh, "Enabling system-level modeling of variation-induced faults in networks-on-chip," in Proceedings of the 48th Design Automation Conference, 2011.
[4]
K. Aisopos, A. DeOrio, L.-S. Peh, and V. Bertacco, "ARIADNE: Agnostic Reconfiguration In A Disconnected Network Environment," in Proceedings of the International conference on Parallel Architectures and Compilation Techniques, 2011.
[5]
A. R. Alameldeen and D. A. Wood, "IPC considered harmful for multiprocessor workloads," IEEE Micro, vol. 26, no. 4, 2006.
[6]
R. Bauman, "Soft errors in advanced computer systems," IEEE Design Test of Computers, vol. 22, no. 3, 2005.
[7]
D. Bertozzi, L. Benini, and G. De Micheli, "Error control schemes for on-chip communication links: the energy-reliability tradeoff," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 24, no. 6, 2005.
[8]
C. Bienia, S. Kumar, J. P. Singh, and K. Li, "The PARSEC benchmark suite: Characterization and architectural implications," in Proceedings of the International conference on Parallel Architectures and Compilation Techniques, 2008.
[9]
K. Constantinides, S. Plaza, J. Blome, B. Zhang, V. Bertacco, S. Mahlke, T. Austin, and M. Orshansky, "Bulletproof: a defect-tolerant CMP switch architecture," in Proceedings of the International Symposium on High Performance Computer Architecture, 2006.
[10]
A. DeOrio, K. Aisopos, V. Bertacco, and L.-S. Peh, "DRAIN: Distributed Recovery Architecture for Inaccessible Nodes in Multi-Core Chips," in Proceedings of Design Automation Conference, 2011.
[11]
R. Fernández-Pascual, J. M. García, M. E. Acacio, and J. Duato, "A low overhead fault tolerant coherence protocol for CMP architectures," in Proceedings of the 13th International Symposium on High-Performance Computer Architecture, 2007.
[12]
D. Fick, A. DeOrio, J. Hu, V. Bertacco, D. Blaauw, and D. Sylvester, "Vicis: a reliable network for unreliable silicon," in Proceedings of the Design Automation Conference, 2009.
[13]
J. Graham, "Soft errors a problem as SRAM geometries shrink," EE Times, 2002.
[14]
M. Martin, M. D. Hill, and D. A. Wood, "Token coherence: decoupling performance and correctness," in Proceedings of the 30th annual International Symposium on Computer Architecture, 2003.
[15]
M. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu, A. Alameldeen, K. Moore, M. Hill, and D. Wood, "Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) toolset," Special Interest Group on Computer Architecture, vol. 33, no. 4, 2005.
[16]
R. F. Pascual, J. M. García, M. E. Acacio, and J. Duato, "A fault-tolerant directory-based cache coherence protocol for CMP architectures," in Proceedings of the 38th International Conference on Dependable Systems and Networks, 2008.
[17]
M. Prvulovic, Z. Zhang, and J. Torrellas, "Revive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors," in Proceedings of the 29th International Symposium on Computer architecture, 2002.
[18]
V. Puente, J. A. Gregorio, F. Vallejo, and R. Beivide, "Immunet: A cheap and robust fault-tolerant packet routing mechanism," Special Interest Group on Computer Architecture, vol. 32, no. 2, 2004.
[19]
S. Woo et al, "The SPLASH-2 programs: characterization and methodological considerations," in Proceedings of the International Symposium on Computer Architecture, 1995.
[20]
D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood, "Safetynet: improving the availability of shared memory multiprocessors with global checkpoint/recovery," in Proceedings of the 29th International Symposium on Computer Architecture, 2002.
[21]
L. Spainhower and T. A. Gregg, "IBM S/390 parallel enterprise server g5 fault tolerance: A historical perspective," IBM Journal of Research and Development, vol. 43, no. 5.6, 1999.

Cited By

View all
  • (2023)Āpta: Fault-tolerant object-granular CXL disaggregated memory for accelerating FaaS2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58367.2023.00030(201-215)Online publication date: Jun-2023
  • (2021)PRISMACM Transactions on Architecture and Code Optimization10.1145/345052318:3(1-25)Online publication date: 8-Jun-2021
  • (2018)Declarative ResilienceACM Transactions on Embedded Computing Systems10.1145/321055917:4(1-27)Online publication date: 24-Jul-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
December 2011
519 pages
ISBN:9781450310536
DOI:10.1145/2155620
  • Conference Chair:
  • Carlo Galuzzi,
  • General Chair:
  • Luigi Carro,
  • Program Chairs:
  • Andreas Moshovos,
  • Milos Prvulovic
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 December 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. coherence protocol
  2. fault tolerance
  3. resilience

Qualifiers

  • Research-article

Funding Sources

Conference

MICRO-44
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)3
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Āpta: Fault-tolerant object-granular CXL disaggregated memory for accelerating FaaS2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58367.2023.00030(201-215)Online publication date: Jun-2023
  • (2021)PRISMACM Transactions on Architecture and Code Optimization10.1145/345052318:3(1-25)Online publication date: 8-Jun-2021
  • (2018)Declarative ResilienceACM Transactions on Embedded Computing Systems10.1145/321055917:4(1-27)Online publication date: 24-Jul-2018
  • (2018)System States Transition Safety Analysis Method Based on FSM and NuSMVProceedings of the 2018 2nd International Conference on Management Engineering, Software Engineering and Service Sciences10.1145/3180374.3181346(107-112)Online publication date: 13-Jan-2018
  • (2017)Detecting Software Cache Coherence Violations in MPSoC Using Traces Captured on Virtual PlatformsACM Transactions on Embedded Computing Systems10.1145/299019316:2(1-21)Online publication date: 2-Jan-2017
  • (2016)Resource Conscious Diagnosis and Reconfiguration for NoC Permanent FaultsIEEE Transactions on Computers10.1109/TC.2015.247958665:7(2241-2256)Online publication date: 1-Jul-2016
  • (2014)Revisiting the Complexity of Hardware Cache Coherence and Some ImplicationsACM Transactions on Architecture and Code Optimization10.1145/266334511:4(1-22)Online publication date: 8-Dec-2014
  • (2014)ForEVeRACM Transactions on Embedded Computing Systems10.1145/251487113:3s(1-30)Online publication date: 28-Mar-2014
  • (2013)uDIRECProceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/2540708.2540722(148-159)Online publication date: 7-Dec-2013
  • (2013)Toward a Coherent Multicore Memory ModelComputer10.1109/MC.2013.37346:10(30-31)Online publication date: 1-Oct-2013
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media