research-article

Improving coherence protocol reactiveness by trading bandwidth for latency

Authors:

Lucia G. Menezo,

Valentin Puente,

Jose Angel GregorioAuthors Info & Claims

CF '12: Proceedings of the 9th conference on Computing Frontiers

Pages 143 - 152

https://doi.org/10.1145/2212908.2212929

Published: 15 May 2012 Publication History

Abstract

This paper describes how on-chip network particularities could be used to improve coherence protocol responsiveness. In order to achieve this, a new coherence protocol, named LOCKE, is proposed. LOCKE successfully exploits large on-chip bandwidth availability to improve cache-coherent chip multiprocessor performance and energy efficiency. Provided that the interconnection network is designed to support multicast traffic and the protocol maximizes the potential advantages that direct coherence brings, we demonstrate that a multicast-based coherence protocol could reduce energy requirements in the CMP memory hierarchy. The key idea presented is to establish a suitable level of on-chip network throughput to accelerate synchronization by two means: avoiding the protocol serialization, inherent to directory-based coherence protocol, and reducing average access time more than in other snoop-based coherence protocols, when shared data is truly contended. LOCKE is developed on top of a Token coherence performance substrate, with a new set of simple proactive policies that speeds up data synchronization and eliminates the passive token starvation avoidance mechanism. Using a full-system simulator that faithfully models on-chip interconnection, aggressive core architecture and precise memory hierarchy details, while running a broad spectrum of workloads, our proposal can improve both directory-based and token-based coherence protocols both in terms of energy and performance, at least in systems with up to 16 aggressive out-of-order processors in the chip.

References

[1]

P. Abad, V. Puente, and J.-A. Gregorio. MRR: Enabling fully adaptive multicast routing for CMP interconnection networks. In 15th Int S High Perf Comp (HPCA), 355--366, 2009.

[2]

N. Agarwal, L.-S. Peh, and N. K. Jha. In-Network Snoop Ordering (INSO): Snoopy coherence on unordered interconnects. In 15th Int S High Perf Comp (HPCA), 67--78, 2009.

[3]

A. R. Alameldeen et al. Simulating a $2M Commercial Server on a $2K PC. Computer, vol. 36, 50--57, 2003.

Digital Library

[4]

K. Asanovic et al. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report. EECS Dept. U. of California Berkeley, vol. 18, no. UCB/EECS-2006-183, 2006.

[5]

C. Bienia and K. Li. PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors. In MoBS, 2009.

[6]

M. Butler. AMD 'Bulldozer' Core - a new approach to multithreaded compute. In HOT Chips 22, 2010.

[7]

P. Conway et al. Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor. IEEE Micro, vol. 30, no. 2, 16--29, 2010.

Digital Library

[8]

B. Cuesta, A. Robles, and J. Duato. An effective starvation avoidance mechanism to enhance the token coherence protocol. In 15th Euromicro Conf Proc, 47--54, 2007.

Digital Library

[9]

J. Duato. A theory of deadlock-free adaptive multicast routing in wormhole networks. IEEE Transactions on Parallel and Distributed Systems, vol. 6, no. 9, 976--987, 1995.

Digital Library

[10]

N. D. Enright Jerger, L.-S. Peh, and M. Lipasti. Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support. In Int S Comp Arch (ISCA), 229--240, 2008.

Digital Library

[11]

N. D. Enright Jerger, L.-S. Peh, and M. H. Lipasti. Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence. In 41st Int Symp Microarch, 35--46, Nov. 2008.

Digital Library

[12]

M. D. Hill and M. R. Marty. Amdahl's Law in the Multicore Era. Computer, vol. 41, no. 7, 33--38, Jul. 2008.

Digital Library

[13]

H. P. Hofstee. Power Efficient Processor Architecture and The Cell Processor. In Int S High Perf Comp (HPCA), 258--262, 2005.

Digital Library

[14]

J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler. A NUCA substrate for flexible CMP cache sharing. In 19th Int Conf Supercomputing (ICS), 31--40, 2005.

Digital Library

[15]

ITRS. Roadmap 2010.

[16]

H. Jin, M. Frumkin, and J. Yan. The OpenMP Implementation of NAS Parallel Benchmarks and its Performance. NAS Technical Report NAS-99-011, NASA Ames Research Center, Moffett Field, CA, 1999.

[17]

A. B. Kahng et al. ORION 2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration. In Design, Automation & Test, 423--428, 2009.

Digital Library

[18]

R. Kalla, B. Sinharoy, W. J. Starke, and M. Floyd. Power7: IBM's Next-Generation Server Processor. IEEE Micro, vol. 30, no. 2, 7--15, 2010.

Digital Library

[19]

C. N. Keltcher, K. J. McGrath, A. Ahmed, and P. Conway. The AMD Opteron processor for multiprocessor servers. IEEE Micro, vol. 23, no. 2, 66--76, 2003.

Digital Library

[20]

K. Lee, S.-joong Lee, and H.-jun Yoo. Low-power network-on-chip for high-performance SoC design. In IEEE Trans. on Very Large Scale Int. (VLSI) Systems, vol. 14, no. 2, 148--160, 2006.

Digital Library

[21]

D. Lenoski et al. The Stanford Dash multiprocessor. Computer, vol. 25, no. 3, 63--79, Mar. 1992.

Digital Library

[22]

M. M. K. Martin, M. D. Hill, and D. A. Wood. Token Coherence: a new framework for shared-memory multiprocessors. IEEE Micro, vol. 23, no. 6, 108--116, 2003.

Digital Library

[23]

M. M. K. Martin et al. Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset. Computer Architecture News, vol. 33, 4, Nov. 2005.

Digital Library

[24]

M. M. K. Martin et al. Timestamp Snooping: An Approach for Extending SMPs. In Architectural Support for Prog. Lang. and O. Systems (ASPLOS), vol. 1, no. 212, 1--12, 2000.

Digital Library

[25]

M. R. Marty, J. D. Bingham, M. D. Hill, A. J. Hu, M. M. K. Martin, and D. A. Wood. Improving Multiple-CMP Systems Using Token Coherence. In 11th Int S High Perf Comp (HPCA), 328--339, Feb 2005.

Digital Library

[26]

M. Marty and M. Hill. Coherence Ordering for Ring-based Chip Multiprocessors. In 39th Int Symp Microarch (MICRO), 309--320, 2006.

Digital Library

[27]

L.G. Menezo, V. Puente, JA. Gregorio. Locke Formal Specification Tables. Technical Report. Available online: http://sg.sg/GPGFef. 2011.

[28]

C. Park et al. A 1.2 TB/s on-chip ring interconnect for 45nm 8-core enterprise Xeon® processor. In 2010 IEEE International SolidState Circuits Conference( ISSCC), 180--181, 2010.

[29]

V. Puente, J. A. Gregorio, and R. Beivide. SICOSYS: An Integrated Framework for Studying Interconnection Network Performance in Multiprocessor Systems. IEEE Comput. Soc, pp. 15--22, 2002.

Digital Library

[30]

A. Raghavan, C. Blundell, and M. M. K. Martin. Token tenure: PATCHing token counting using directory-based cache coherence. In 41st Intl S Microarch, 47--58, Nov. 2008.

Digital Library

[31]

Y. H. Song and T. M. Pinkston. Efficient handling of message-dependent deadlock. In 15th Int Parallel & Distributed Proc Symp (IPDPS), 2001.

Digital Library

[32]

SPEC Standard Performance Evaluation Corporation. SPEC 2006. {Online}. Available: http://www.spec.org.

[33]

K. Strauss, X. Shen, and J. Torrellas. Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors. In 40th Int S Microarch (MICRO), 327--342. 2007.

Digital Library

[34]

K. Strauss, X. Shen, and J. Torrellas. Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors. In 33rd Int S Comp Arch (ISCA), 327--338, 2006.

Digital Library

[35]

M. Suleman, O. Mutlu, M. Qureshi, and Y. N. Patt. Accelerating critical section execution with asymmetric multi-core architectures. In 14th Intl. Conf. on Architectural Support for Progr. Lang. and OS (ASPLOS), 253--264, 2009.

Digital Library

[36]

D. Tarjan, S. Thoziyoor, and N. P. Jouppi, CACTI 4.0. 2006.

[37]

A. W. Topol et al. Three-dimensional integrated circuits. IBM J. of Research and Development, vol. 50, no. 4, 491--506, Jul. 2006.

Digital Library

[38]

V. Zyuban, and P. Kogge. Optimization of high-performance superscalar architectures for energy efficiency. In Intl S on Low Power Electronics & Design, 84--89, 2000.

Digital Library

Cited By

Menezo LPuente VGregorio J(2020) Rainbow: A composable coherence protocol for multi‐chip servers Concurrency and Computation: Practice and Experience10.1002/cpe.594732:24Online publication date: 21-Jul-2020
https://doi.org/10.1002/cpe.5947
Menezo LPuente VAbad PGregorio J(2018)MosaicInternational Journal of Parallel Programming10.1007/s10766-018-0557-y46:6(1110-1138)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s10766-018-0557-y
Menezo LPuente VGregorio JFensch CO'Boyle MSeznec ABodin F(2013)The case for a scalable coherence protocol for complex on-chip cache hierarchies in many core systemsProceedings of the 22nd international conference on Parallel architectures and compilation techniques10.5555/2523721.2523760(279-288)Online publication date: 7-Oct-2013
https://dl.acm.org/doi/10.5555/2523721.2523760
Show More Cited By

Index Terms

Improving coherence protocol reactiveness by trading bandwidth for latency
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

An adaptive cache coherence protocol

This paper introduces a new adaptive cache coherence protocol which minimizes energy requirements and guarantees scalability. It includes two complementary parts: a non-inclusive sparse-directory to track only actively shared blocks and a structure to ...
A composite and scalable cache coherence protocol for large scale CMPs
ICS '11: Proceedings of the international conference on Supercomputing

The number of on-chip cores of modern chip multiprocessors (CMPs) is growing fast with technology scaling. However, it remains a big challenge to efficiently support cache coherence for large scale CMPs. The conventional snoopy and directory coherence ...
Snooping and Ordering Ring - An Efficient Cache Coherence Protocol for Ring Connected CMP
ICPADS '09: Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems

Ring is a promising on-chip interconnection for CMP. It is more scalable than bus and much simpler than packet-switched networks. The ordering property of ring can be used to optimize cache coherence protocol design. Existing ring protocols, such as the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '12: Proceedings of the 9th conference on Computing Frontiers

May 2012

320 pages

ISBN:9781450312158

DOI:10.1145/2212908

General Chair:
John Feo
Pacific Northwest National Laboratory, USA
,
Program Chairs:
Paolo Faraboschi
HP Labs, Spain
,
Oreste Villa
Pacific Northwest National Laboratory, USA

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 May 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CF'12

Sponsor:

SIGMICRO

CF'12: Computing Frontiers Conference

May 15 - 17, 2012

Cagliari, Italy

Acceptance Rates

Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Sponsor:
sigmicro

22nd ACM International Conference on Computing Frontiers

May 28 - 30, 2025

Cagliari , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
170
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Menezo LPuente VGregorio J(2020) Rainbow: A composable coherence protocol for multi‐chip servers Concurrency and Computation: Practice and Experience10.1002/cpe.594732:24Online publication date: 21-Jul-2020
https://doi.org/10.1002/cpe.5947
Menezo LPuente VAbad PGregorio J(2018)MosaicInternational Journal of Parallel Programming10.1007/s10766-018-0557-y46:6(1110-1138)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s10766-018-0557-y
Menezo LPuente VGregorio JFensch CO'Boyle MSeznec ABodin F(2013)The case for a scalable coherence protocol for complex on-chip cache hierarchies in many core systemsProceedings of the 22nd international conference on Parallel architectures and compilation techniques10.5555/2523721.2523760(279-288)Online publication date: 7-Oct-2013
https://dl.acm.org/doi/10.5555/2523721.2523760
Menezo LPuente VGregorio J(2013)Vectorization past dependent branches through speculationProceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2013.6618824(353-362)Online publication date: Oct-2013
https://doi.org/10.1109/PACT.2013.6618824

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten