research-article

Leveraging on-chip networks for data cache migration in chip multiprocessors

Authors:

Li ShangAuthors Info & Claims

PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

Pages 197 - 207

https://doi.org/10.1145/1454115.1454144

Published: 25 October 2008 Publication History

Abstract

Recently, chip multiprocessors (CMPs) have arisen as the de facto design for modern high-performance processors, with increasing core counts. An important property of CMPs is that remote, but on-chip, L2 cache accesses are less costly than off-chip accesses; this is in contrast to earlier chip-to-chip or board-to-board multiprocessors, where an access to a remote node is just as costly if not more so than a main memory access. This motivates on-chip cache migration as a means to retain more data on-chip. However, previously proposed techniques do not scale to high core counts: they do not leverage the on-chip caches of all cores nor have a scalable migration mechanism. In this paper we propose ascalable in-network migration technique which uses hints embedded within the router microarchitecture to steer L2 cache evictions towards free/invalid cache slots in any on-chip core cache, rather than evicting it off-chip. We show that our technique can provide an average of a 19% reduction in the number of off-chip memory accesses over the state-of-the-art, beating the performance of a pseudo-optimal migration technique. This can be done with negligible area overhead and a manageable traffic overhead of 13.4%.

References

[1]

B. M. Beckmann et al. ASR: Adaptive Selective Replication for CMP Caches. In Proc. of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 443--454, December, 2006.

Digital Library

[2]

D. Burger et al. Memory Bandwidth Limitations of Future Microprocessors. In Proc. of the 23rd Annual International Symposium on Computer Architecture, pp. 78--89, May, 1996.

Digital Library

[3]

H. Cain et al. Precise and Accurate Processor Simulation. In Proc. of the 5th Workshop on Computer Architecture Evaluation Using Commercial Workloads, pp. 13--22, February, 2006.

[4]

J. Chang et al. Cooperative Caching for Chip Multiprocessors. In Proc. of the 33rd Annual International Symposium on Computer Architecture, pp. 264--276, May, 2006.

Digital Library

[5]

J. Chen et al. Hardware-Modulated Parallelism in Chip Multiprocessors. In DASCMP, November, 2005.

Digital Library

[6]

Z. Chishti et al. Optimizing Replication, Communication, and Capacity Allocation in CMPs. In Proc. of the 32nd Annual International Symposium on Computer Architecture, pp. 357--368, May, 2005.

Digital Library

[7]

N. Eisley et al. In-Network Cache Coherence. In Proc. of the 39th Annual International Symposium on Microarchitecture, pp. 321--332, December, 2006.

Digital Library

[8]

J. R. Goodman and P. J. Woest. The Wisconsin Multicube: A New Large-Scale Cache-Coherent Multiprocessor. in Proc. of the 15th International Symposium on High Performance Computer Architecture, pp. 422--431, June, 1988.

Digital Library

[9]

L. Hammond et al. The Stanford Hydra CMP. In IEEE Micro, Vol. 20, No. 2, pp. 71--84, 2000.

Digital Library

[10]

J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. San Francisco, CA, USA: Morgan Kaufmann Publishers, Inc., 2003.

Digital Library

[11]

J. Huh et al. A NUCA Substrate for Flexible CMP Cache Sharing. In Proc. of the 19th Annual International Conference on Supercomputing, pp. 31--40, June, 2005.

Digital Library

[12]

R. Iyer et al. Using Switch Directories to Speed up Cache-to-Cache Transfers in CC-NUMA Multiprocessors. In Proc. of the 14th International Parallel and Distributed Processing Symposium, pp. 721--728, May, 2000.

Digital Library

[13]

S. Kaxiras and J. R. Goodman. The GLOW Cache Coherence Protocol Extensions for Widely Shared Data. In Proc. of the 10th International Conference on Supercomputing, pp. 35--43, May, 1996.

Digital Library

[14]

L. Lamport. How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs. In IEEE Transactions on Computing, Vol. c-28, No. 9, pp. 690--691, September, 1979.

Digital Library

[15]

A. Mendelson et al. CMP Implementation in Systems Based on the Intel Core Duo Processor. In Intel Technology Journal, Vol. 10, No. 2, May, 2006.

[16]

H. E. Mizrahi et al. Introducing Memory into the Switch Elements of Multiprocessor Interconnection Networks. In Proc. of the 16th International Symposium on Computer Architecture, pp. 158--166, June, 1989.

Digital Library

[17]

K. Olukotun et al. The Case for a Single-Chip Multiprocessor. In IEEE SIGPLAN Notices, Vol. 31, No. 9, pp. 2--11, 1996.

Digital Library

[18]

S. J. E. Wilton and N. P. Jouppi. An Enhanced Access and Cycle Time Model for on-Chip Caches. DECWestern Research Laboratory, No. 93/5, 1994.

[19]

W. A. Wulf and S. A. McKee. Hitting the Memory Wall: Implications of the Obvious. In SIGARCH Computer Architecture News, Vol. 23, No. 1, pp. 20--24, 1995.

Digital Library

[20]

M. Zhang and K. Asanovic. Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors. In Proc. of the 32nd International Symposium on Computer Architecture, pp. 336--345, June, 2005.

Digital Library

[21]

M. Zhang and K. Asanovic. Victim Migration: Dynamically Adapting between Private and Shared CMP Caches. MIT Technical Report MIT-CSAIL-TR-2005-064, MIT-LCS-TR-1006, October, 2005.

[22]

http://www-128.ibm.com/developerworks/power/library/paexpert1.html

[23]

http://www-flash.stanford.edu/apps/SPLASH/

[24]

http://www.intel.com/multi-core/

[25]

http://www.sun.com/processors/throughput/

[26]

http://www.virtutech.com/

Cited By

Sarkar AMueller FRamaprasad H(2015)Static Task Partitioning for Locked Caches in Multicore Real-Time SystemsACM Transactions on Embedded Computing Systems10.1145/263855714:1(1-30)Online publication date: 21-Jan-2015
https://dl.acm.org/doi/10.1145/2638557
Kwon WKrishna TPeh L(2014)Locality-oblivious cache organization leveraging single-cycle multi-hop NoCsACM SIGARCH Computer Architecture News10.1145/2654822.254197642:1(715-728)Online publication date: 24-Feb-2014
https://dl.acm.org/doi/10.1145/2654822.2541976
Kwon WKrishna TPeh L(2014)Locality-oblivious cache organization leveraging single-cycle multi-hop NoCsACM SIGPLAN Notices10.1145/2644865.254197649:4(715-728)Online publication date: 24-Feb-2014
https://dl.acm.org/doi/10.1145/2644865.2541976
Show More Cited By

Index Terms

Leveraging on-chip networks for data cache migration in chip multiprocessors
1. Networks
  1. Network protocols
    1. Network protocol design
  2. Network types
    1. Packet-switching networks

Recommendations

NoC-aware cache design for multithreaded execution on tiled chip multiprocessors
HiPEAC '11: Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers

In chip multiprocessors (CMPs), data access latency depends on the memory hierarchy organization, the on-chip interconnect (NoC), and the running workload. Reducing data access latency is vital to achieving performance improvements and scalability of ...
TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...
Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors
MICRO '43: Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture

Translation Look-aside Buffers (TLBs) are vital hardware support for virtual memory management in high performance computer systems and have a momentous influence on overall system performance. Numerous techniques to reduce TLB miss latencies including ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

October 2008

328 pages

ISBN:9781605582825

DOI:10.1145/1454115

General Chair:
Andreas Moshovos
University of Toronto, Canada
,
Program Chairs:
David Tarditi
Microsoft, USA
,
Kunle Olukotun
Stanford University, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 October 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PACT '08

Sponsor:

PACT '08: International Conference on Parallel Architectures and Compilation Techniques

October 25 - 29, 2008

Ontario, Toronto, Canada

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
351
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sarkar AMueller FRamaprasad H(2015)Static Task Partitioning for Locked Caches in Multicore Real-Time SystemsACM Transactions on Embedded Computing Systems10.1145/263855714:1(1-30)Online publication date: 21-Jan-2015
https://dl.acm.org/doi/10.1145/2638557
Kwon WKrishna TPeh L(2014)Locality-oblivious cache organization leveraging single-cycle multi-hop NoCsACM SIGARCH Computer Architecture News10.1145/2654822.254197642:1(715-728)Online publication date: 24-Feb-2014
https://dl.acm.org/doi/10.1145/2654822.2541976
Kwon WKrishna TPeh L(2014)Locality-oblivious cache organization leveraging single-cycle multi-hop NoCsACM SIGPLAN Notices10.1145/2644865.254197649:4(715-728)Online publication date: 24-Feb-2014
https://dl.acm.org/doi/10.1145/2644865.2541976
Kwon WKrishna TPeh LBalasubramonian RDavis AAdve S(2014)Locality-oblivious cache organization leveraging single-cycle multi-hop NoCsProceedings of the 19th international conference on Architectural support for programming languages and operating systems10.1145/2541940.2541976(715-728)Online publication date: 24-Feb-2014
https://dl.acm.org/doi/10.1145/2541940.2541976
Kayi ASerres OEl-Ghazawi T(2014)Bandwidth Adaptive Cache Coherence Optimizations for Chip MultiprocessorsInternational Journal of Parallel Programming10.1007/s10766-013-0247-842:3(435-455)Online publication date: 1-Jun-2014
https://dl.acm.org/doi/10.1007/s10766-013-0247-8
Hu JXue CZhuge QTseng WSha E(2013)Write activity reduction on non-volatile main memories for embedded chip multiprocessorsACM Transactions on Embedded Computing Systems10.1145/2442116.244212712:3(1-27)Online publication date: 8-Apr-2013
https://dl.acm.org/doi/10.1145/2442116.2442127
Sarkar AMueller FRamaprasad HJerraya ACarloni LMooney VRabbah R(2012)Static task partitioning for locked caches in multi-core real-time systemsProceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems10.1145/2380403.2380434(161-170)Online publication date: 7-Oct-2012
https://dl.acm.org/doi/10.1145/2380403.2380434
Lira JJones TMolina CGonzález A(2012)The migration prefetcherACM Transactions on Architecture and Code Optimization10.1145/2086696.20867248:4(1-20)Online publication date: 26-Jan-2012
https://dl.acm.org/doi/10.1145/2086696.2086724
Kayi ASerres OEl-Ghazawi T(2012)Bandwidth Adaptive Write-update Optimizations for Chip MultiprocessorsProceedings of the 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications10.1109/ISPA.2012.34(199-206)Online publication date: 10-Jul-2012
https://dl.acm.org/doi/10.1109/ISPA.2012.34
Meng JSheaffer JSkadron K(2012)Robust SIMDProceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium10.1109/IPDPS.2012.20(107-118)Online publication date: 21-May-2012
https://dl.acm.org/doi/10.1109/IPDPS.2012.20
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten