research-article

Cache equalizer: a placement mechanism for chip multiprocessor distributed shared caches

Authors:

Mohammad Hammoud,

Rami G. MelhemAuthors Info & Claims

HiPEAC '11: Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers

Pages 177 - 186

https://doi.org/10.1145/1944862.1944889

Published: 24 January 2011 Publication History

Abstract

This paper describes Cache Equalizer (CE), a novel distributed cache management scheme for large-scale chip multiprocessors (CMPs). Our work is motivated by large asymmetry in cache sets' usages. CE decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Temporal pressure at the on-chip last-level cache is continuously collected at a group (comprised of cache sets) granularity, and periodically recorded at the memory controller to guide the placement process. An incoming block is consequently placed at a cache group that exhibits the minimum pressure. Simulation results using a full-system simulator demonstrate that CE achieves an average L2 miss rate reduction of 13.6% over a shared NUCA scheme and by as much as 46.7% for the benchmark programs we examined. Furthermore, evaluations showed that CE outperforms related cache designs.

References

[1]

M. Awasthi, K. Sudan, R. Balasubramonian, J. Carter. "Dynamic Hardware-Assisted Software-Controlled Page Placement to Manage Capacity Allocation and Sharing within Large Caches," HPCA, Feb. 2009.

[2]

B. M. Beckmann, M. R. Marty, and D. A. Wood. "ASR: Adaptive Selective Replication for CMP Caches," MICRO, Dec. 2006.

Digital Library

[3]

B. M. Beckmann and D. A. Wood. "Managing Wire Delay in Large Chip-Multiprocessor Caches," MICRO, Dec. 2004.

Digital Library

[4]

C. M. Bienia, S. Kumar, J. P. Singh, and K. Li. "The PARSEC Benchmark Suite: Characterization and Architectural Implications," PACT, Oct. 2008.

Digital Library

[5]

J. Chang and G. S. Sohi. "Cooperative Caching for Chip Multiprocessors," ISCA, June 2006.

Digital Library

[6]

M. Chaudhuri. "PageNUCA: Selected Policies for Page-grain Locality Management in Large Shared Chip-multiprocessor Caches," HPCA, Feb. 2009.

[7]

Z. Chishti, M. D. Powell, and T. N. Vijaykumar. "Optimizing Replication, Communication, and Capacity Allocation in CMPs," ISCA, June 2005.

Digital Library

[8]

S. Cho and L. Jin "Managing Distributed Shared L2 Caches through OS-Level Page Allocation," MICRO, Dec 2006.

Digital Library

[9]

Z. Guz, I. Keidar, A. Kolodny, U. C. Weiser. "Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture," SPAA, June 2008.

Digital Library

[10]

M. Hammoud, S. Cho, and R. Melhem. "A Dynamic Pressure-Aware Associative Placement Strategy for Large Scale Chip Multiprocessors," Computer Architecture Letters, May 2010.

Digital Library

[11]

M. Hammoud, S. Cho, and R. Melhem. "ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors," HiPEAC, Jan. 2009.

Digital Library

[12]

N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. "Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches," ISCA, June 2009.

Digital Library

[13]

HP Labs. "http://www.hpl.hp.com/research/cacti/"

[14]

J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler. "A NUCA Substrate for Flexible CMP Cache Sharing," ICS, June 2005.

Digital Library

[15]

L. Jin and S. Cho. "Taming Single-Thread Program Performance on Many Distributed On-Chip L2 Caches," ICPP, September 2008.

Digital Library

[16]

N. P. Jouppi. "Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers," ISCA, 1990.

Digital Library

[17]

M. Kandemir, F. Li, M. J. Irwin, and S. W. Son. "A Novel Migration-Based NUCA Design for Chip Multiprocessors," Proc. HiPC, Nov. 2008.

Digital Library

[18]

C. Kim, D. Burger, and S. W. Keckler. "An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches," ASPLOS, Oct. 2002.

Digital Library

[19]

P. Kongetira, K. Aingaran, and K. Olukotun. "Niagara: A 32-Way Multithreaded Sparc Processor," IEEE Micro, March--April 2005.

Digital Library

[20]

G. Memik, G. Reinman, and W. H. Mangione-Smith. "Reducing Energy and Delay Using Efficient Victim Caches," ISLPED, 2003.

Digital Library

[21]

K. Olukotun, L. Hammond, and J. Laudon. "Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency," Synthesis Lectures on Computer Arch, 1st Ed., Morgan and Claypool, Dec. 2007.

Digital Library

[22]

M. K. Qureshi. "Adaptive Spill-Receive for Robust High-Performance Caching in CMPs," HPCA, Feb. 2009.

[23]

Research at Intel. "Introducing the 45nm Next-Generation Intel Core#8482; Microarchitecture," White Paper.

[24]

A. Ros, M. E. Acacio, and J. M. García "Scalable Directory Organization for Tiled CMP Architectures," ICCAD, July 2008.

[25]

T. Sherwood, B. Calder, and J. Emer. "Reducing CacheMisses Using Hardware and Software Page Placement," ICS, June 1999.

Digital Library

[26]

B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner. "POWER5 System Microarchitecture," IBM J. Res. & Dev., July. 2005.

Digital Library

[27]

S. Srikantaiah, M. Kandemir, and M. J. Irwin. "Adaptive Set Pinning: Managing Shared Caches in Chip Multiprocessors," ASPLOS, March 2008.

Digital Library

[28]

S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. "Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers," HPCA, Feb. 2007.

Digital Library

[29]

Standard Performance Evaluation Corporation. http://www.specbench.org.

[30]

D. Tam, R. Azimi, L. Soares, and M. Stumm. "Managing Shared L2 Caches on Multicore Systems in Software," WIOSCA, 2007.

[31]

N. Topham, A. Gonzalez, and J. Gonzalez. "The Design and Performance of a Conflict-Avoiding Cache," MICRO, 1997.

Digital Library

[32]

H. Vandierendonck, P. Manet, and J.-D. Legat. "Application-Specific Reconfigurable XOR-Indexing To Eliminate Cache Conflict Misses," DATE, 2006.

Digital Library

[33]

Virtutech AB. Simics Full System Simulator "http://www.simics.com/"

[34]

S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. "The SPLASH-2 Programs: Characterization and Methodological Considerations," ISCA, July 1995.

Digital Library

[35]

C. Zhang. "Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches," ISCA, June 2006.

Digital Library

[36]

M. Zhang and K. Asanović. "Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors," ISCA, June 2005.

Digital Library

Cited By

Das SKapoor H(2016)A Framework for Block Placement, Migration, and Fast Searching in Tiled-DNUCA ArchitectureACM Transactions on Design Automation of Electronic Systems10.1145/290794622:1(1-26)Online publication date: 27-May-2016
https://dl.acm.org/doi/10.1145/2907946
Das SKapoor H(2015)Exploration of Migration and Replacement Policies for Dynamic NUCA over Tiled CMPs2015 28th International Conference on VLSI Design10.1109/VLSID.2015.29(141-146)Online publication date: Jan-2015
https://doi.org/10.1109/VLSID.2015.29
Li YMelhem RJones A(2014)A Practical Data Classification Framework for Scalable and High Performance Chip-MultiprocessorsIEEE Transactions on Computers10.1109/TC.2013.16163:12(2905-2918)Online publication date: 1-Dec-2014
https://dl.acm.org/doi/10.1109/TC.2013.161
Show More Cited By

Index Terms

Cache equalizer: a placement mechanism for chip multiprocessor distributed shared caches
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Reactive NUCA: near-optimal block placement and replication in distributed caches

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Reactive NUCA: near-optimal block placement and replication in distributed caches
ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Towards hybrid last level caches for chip-multiprocessors

As CMP platforms are widely adopted, more and more cores are integrated on to the die. To reduce the off-chip memory access, the last level cache is usually organized as a distributed shared cache. In order to avoid hot-spots, cache lines are ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

HiPEAC '11: Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers

January 2011

226 pages

ISBN:9781450302418

DOI:10.1145/1944862

General Chairs:
Manolis Katevenis
FORTH-ICS and U.Crete, Greece
,
Margaret Martonosi
Princeton University
,
Program Chairs:
Christos Kozyrakis
Stanford University
,
Olivier Temam
INRIA, France

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

HiPEAC: HiPEAC Network of Excellence

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 January 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

HIPEAC '11

Sponsor:

HiPEAC

HIPEAC '11: International Conference on High-Performance and Embedded Architectures and Compilers

January 24 - 26, 2011

Heraklion, Greece

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
182
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Das SKapoor H(2016)A Framework for Block Placement, Migration, and Fast Searching in Tiled-DNUCA ArchitectureACM Transactions on Design Automation of Electronic Systems10.1145/290794622:1(1-26)Online publication date: 27-May-2016
https://dl.acm.org/doi/10.1145/2907946
Das SKapoor H(2015)Exploration of Migration and Replacement Policies for Dynamic NUCA over Tiled CMPs2015 28th International Conference on VLSI Design10.1109/VLSID.2015.29(141-146)Online publication date: Jan-2015
https://doi.org/10.1109/VLSID.2015.29
Li YMelhem RJones A(2014)A Practical Data Classification Framework for Scalable and High Performance Chip-MultiprocessorsIEEE Transactions on Computers10.1109/TC.2013.16163:12(2905-2918)Online publication date: 1-Dec-2014
https://dl.acm.org/doi/10.1109/TC.2013.161
Li YMelhem RJones AYew PCho SDeRose LLilja D(2012)Practically privateProceedings of the 21st international conference on Parallel architectures and compilation techniques10.1145/2370816.2370852(231-240)Online publication date: 19-Sep-2012
https://dl.acm.org/doi/10.1145/2370816.2370852

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten