skip to main content
10.1145/2212908.2212929acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
research-article

Improving coherence protocol reactiveness by trading bandwidth for latency

Published: 15 May 2012 Publication History

Abstract

This paper describes how on-chip network particularities could be used to improve coherence protocol responsiveness. In order to achieve this, a new coherence protocol, named LOCKE, is proposed. LOCKE successfully exploits large on-chip bandwidth availability to improve cache-coherent chip multiprocessor performance and energy efficiency. Provided that the interconnection network is designed to support multicast traffic and the protocol maximizes the potential advantages that direct coherence brings, we demonstrate that a multicast-based coherence protocol could reduce energy requirements in the CMP memory hierarchy. The key idea presented is to establish a suitable level of on-chip network throughput to accelerate synchronization by two means: avoiding the protocol serialization, inherent to directory-based coherence protocol, and reducing average access time more than in other snoop-based coherence protocols, when shared data is truly contended. LOCKE is developed on top of a Token coherence performance substrate, with a new set of simple proactive policies that speeds up data synchronization and eliminates the passive token starvation avoidance mechanism. Using a full-system simulator that faithfully models on-chip interconnection, aggressive core architecture and precise memory hierarchy details, while running a broad spectrum of workloads, our proposal can improve both directory-based and token-based coherence protocols both in terms of energy and performance, at least in systems with up to 16 aggressive out-of-order processors in the chip.

References

[1]
P. Abad, V. Puente, and J.-A. Gregorio. MRR: Enabling fully adaptive multicast routing for CMP interconnection networks. In 15th Int S High Perf Comp (HPCA), 355--366, 2009.
[2]
N. Agarwal, L.-S. Peh, and N. K. Jha. In-Network Snoop Ordering (INSO): Snoopy coherence on unordered interconnects. In 15th Int S High Perf Comp (HPCA), 67--78, 2009.
[3]
A. R. Alameldeen et al. Simulating a $2M Commercial Server on a $2K PC. Computer, vol. 36, 50--57, 2003.
[4]
K. Asanovic et al. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report. EECS Dept. U. of California Berkeley, vol. 18, no. UCB/EECS-2006-183, 2006.
[5]
C. Bienia and K. Li. PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors. In MoBS, 2009.
[6]
M. Butler. AMD 'Bulldozer' Core - a new approach to multithreaded compute. In HOT Chips 22, 2010.
[7]
P. Conway et al. Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor. IEEE Micro, vol. 30, no. 2, 16--29, 2010.
[8]
B. Cuesta, A. Robles, and J. Duato. An effective starvation avoidance mechanism to enhance the token coherence protocol. In 15th Euromicro Conf Proc, 47--54, 2007.
[9]
J. Duato. A theory of deadlock-free adaptive multicast routing in wormhole networks. IEEE Transactions on Parallel and Distributed Systems, vol. 6, no. 9, 976--987, 1995.
[10]
N. D. Enright Jerger, L.-S. Peh, and M. Lipasti. Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support. In Int S Comp Arch (ISCA), 229--240, 2008.
[11]
N. D. Enright Jerger, L.-S. Peh, and M. H. Lipasti. Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence. In 41st Int Symp Microarch, 35--46, Nov. 2008.
[12]
M. D. Hill and M. R. Marty. Amdahl's Law in the Multicore Era. Computer, vol. 41, no. 7, 33--38, Jul. 2008.
[13]
H. P. Hofstee. Power Efficient Processor Architecture and The Cell Processor. In Int S High Perf Comp (HPCA), 258--262, 2005.
[14]
J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler. A NUCA substrate for flexible CMP cache sharing. In 19th Int Conf Supercomputing (ICS), 31--40, 2005.
[15]
ITRS. Roadmap 2010.
[16]
H. Jin, M. Frumkin, and J. Yan. The OpenMP Implementation of NAS Parallel Benchmarks and its Performance. NAS Technical Report NAS-99-011, NASA Ames Research Center, Moffett Field, CA, 1999.
[17]
A. B. Kahng et al. ORION 2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration. In Design, Automation & Test, 423--428, 2009.
[18]
R. Kalla, B. Sinharoy, W. J. Starke, and M. Floyd. Power7: IBM's Next-Generation Server Processor. IEEE Micro, vol. 30, no. 2, 7--15, 2010.
[19]
C. N. Keltcher, K. J. McGrath, A. Ahmed, and P. Conway. The AMD Opteron processor for multiprocessor servers. IEEE Micro, vol. 23, no. 2, 66--76, 2003.
[20]
K. Lee, S.-joong Lee, and H.-jun Yoo. Low-power network-on-chip for high-performance SoC design. In IEEE Trans. on Very Large Scale Int. (VLSI) Systems, vol. 14, no. 2, 148--160, 2006.
[21]
D. Lenoski et al. The Stanford Dash multiprocessor. Computer, vol. 25, no. 3, 63--79, Mar. 1992.
[22]
M. M. K. Martin, M. D. Hill, and D. A. Wood. Token Coherence: a new framework for shared-memory multiprocessors. IEEE Micro, vol. 23, no. 6, 108--116, 2003.
[23]
M. M. K. Martin et al. Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset. Computer Architecture News, vol. 33, 4, Nov. 2005.
[24]
M. M. K. Martin et al. Timestamp Snooping: An Approach for Extending SMPs. In Architectural Support for Prog. Lang. and O. Systems (ASPLOS), vol. 1, no. 212, 1--12, 2000.
[25]
M. R. Marty, J. D. Bingham, M. D. Hill, A. J. Hu, M. M. K. Martin, and D. A. Wood. Improving Multiple-CMP Systems Using Token Coherence. In 11th Int S High Perf Comp (HPCA), 328--339, Feb 2005.
[26]
M. Marty and M. Hill. Coherence Ordering for Ring-based Chip Multiprocessors. In 39th Int Symp Microarch (MICRO), 309--320, 2006.
[27]
L.G. Menezo, V. Puente, JA. Gregorio. Locke Formal Specification Tables. Technical Report. Available online: http://sg.sg/GPGFef. 2011.
[28]
C. Park et al. A 1.2 TB/s on-chip ring interconnect for 45nm 8-core enterprise Xeon® processor. In 2010 IEEE International SolidState Circuits Conference( ISSCC), 180--181, 2010.
[29]
V. Puente, J. A. Gregorio, and R. Beivide. SICOSYS: An Integrated Framework for Studying Interconnection Network Performance in Multiprocessor Systems. IEEE Comput. Soc, pp. 15--22, 2002.
[30]
A. Raghavan, C. Blundell, and M. M. K. Martin. Token tenure: PATCHing token counting using directory-based cache coherence. In 41st Intl S Microarch, 47--58, Nov. 2008.
[31]
Y. H. Song and T. M. Pinkston. Efficient handling of message-dependent deadlock. In 15th Int Parallel & Distributed Proc Symp (IPDPS), 2001.
[32]
SPEC Standard Performance Evaluation Corporation. SPEC 2006. {Online}. Available: http://www.spec.org.
[33]
K. Strauss, X. Shen, and J. Torrellas. Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors. In 40th Int S Microarch (MICRO), 327--342. 2007.
[34]
K. Strauss, X. Shen, and J. Torrellas. Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors. In 33rd Int S Comp Arch (ISCA), 327--338, 2006.
[35]
M. Suleman, O. Mutlu, M. Qureshi, and Y. N. Patt. Accelerating critical section execution with asymmetric multi-core architectures. In 14th Intl. Conf. on Architectural Support for Progr. Lang. and OS (ASPLOS), 253--264, 2009.
[36]
D. Tarjan, S. Thoziyoor, and N. P. Jouppi, CACTI 4.0. 2006.
[37]
A. W. Topol et al. Three-dimensional integrated circuits. IBM J. of Research and Development, vol. 50, no. 4, 491--506, Jul. 2006.
[38]
V. Zyuban, and P. Kogge. Optimization of high-performance superscalar architectures for energy efficiency. In Intl S on Low Power Electronics & Design, 84--89, 2000.

Cited By

View all
  • (2020) Rainbow: A composable coherence protocol for multi‐chip servers Concurrency and Computation: Practice and Experience10.1002/cpe.594732:24Online publication date: 21-Jul-2020
  • (2018)MosaicInternational Journal of Parallel Programming10.1007/s10766-018-0557-y46:6(1110-1138)Online publication date: 1-Dec-2018
  • (2013)The case for a scalable coherence protocol for complex on-chip cache hierarchies in many core systemsProceedings of the 22nd international conference on Parallel architectures and compilation techniques10.5555/2523721.2523760(279-288)Online publication date: 7-Oct-2013
  • Show More Cited By

Index Terms

  1. Improving coherence protocol reactiveness by trading bandwidth for latency

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CF '12: Proceedings of the 9th conference on Computing Frontiers
    May 2012
    320 pages
    ISBN:9781450312158
    DOI:10.1145/2212908
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 May 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cmp
    2. coherence protocol
    3. memory hierarchy

    Qualifiers

    • Research-article

    Conference

    CF'12
    Sponsor:
    CF'12: Computing Frontiers Conference
    May 15 - 17, 2012
    Cagliari, Italy

    Acceptance Rates

    Overall Acceptance Rate 273 of 785 submissions, 35%

    Upcoming Conference

    CF '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 09 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2020) Rainbow: A composable coherence protocol for multi‐chip servers Concurrency and Computation: Practice and Experience10.1002/cpe.594732:24Online publication date: 21-Jul-2020
    • (2018)MosaicInternational Journal of Parallel Programming10.1007/s10766-018-0557-y46:6(1110-1138)Online publication date: 1-Dec-2018
    • (2013)The case for a scalable coherence protocol for complex on-chip cache hierarchies in many core systemsProceedings of the 22nd international conference on Parallel architectures and compilation techniques10.5555/2523721.2523760(279-288)Online publication date: 7-Oct-2013
    • (2013)Vectorization past dependent branches through speculationProceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2013.6618824(353-362)Online publication date: Oct-2013

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media