research-article

PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches

Authors:
Yuejian Xie

Georgia Institute of Technology, Atlanta, GA, USA

Georgia Institute of Technology, Atlanta, GA, USA
View Profile

,
Gabriel H. Loh

Georgia Institute of Technology, Atlanta, GA, USA

Georgia Institute of Technology, Atlanta, GA, USA
View Profile

ISCA '09: Proceedings of the 36th annual international symposium on Computer architectureJune 2009Pages 174–183https://doi.org/10.1145/1555754.1555778

Published:20 June 2009Publication History

ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture

Pages 174–183

ABSTRACT

Many multi-core processors employ a large last-level cache (LLC) shared among the multiple cores. Past research has demonstrated that sharing-oblivious cache management policies (e.g., LRU) can lead to poor performance and fairness when the multiple cores compete for the limited LLC capacity. Different memory access patterns can cause cache contention in different ways, and various techniques have been proposed to target some of these behaviors. In this work, we propose a new cache management approach that combines dynamic insertion and promotion policies to provide the benefits of cache partitioning, adaptive insertion, and capacity stealing all with a single mechanism. By handling multiple types of memory behaviors, our proposed technique outperforms techniques that target only either capacity partitioning or adaptive insertion.

References

J. Abella, A. González, X. Vera, and M. F. P. O'Boyle. IATAC: A Smart Predictor to Turn-Off L2 Cache Lines. Trans. on Architecture and Code Optimization, 2(1):55--77, Mar. 2005. Google ScholarDigital Library
T. Austin, E. Larson, and D. Ernst. SimpleScalar: An Infrastructure for Computer System Modeling. IEEE Micro Magazine, pages 59--67, Feb. 2002. Google ScholarDigital Library
D. A. Bader, Y. Li, T. Li, and V. Sachdeva. BioPerf: A Benchmark Suite to Evaluate High-Performance Computer Architecture of Bioinformatics Applications. In Proc. of the IEEE Int. Symp. on Workload Characterization, pages 163--173, Austin, TX, USA, Oct. 2005.Google ScholarCross Ref
M. Behar, A. Mendelson, and A. Kolodny. Trace Cache Sampling Filter. In Proc. of the 14th Int. Conference on Parallel Architectures and Compilation Techniques, pages 255--266, St. Louis, MO, USA, Sep. 2005. Google ScholarDigital Library
D. S. Bolme, M. M. Strout, and J. R. Beveridge. FacePerf: Benchmarks for Face Recognition Algorithms. In Proc. of the IEEE Int. Symp. on Workload Characterization, Boston, MA, USA, Oct. 2007. Google ScholarDigital Library
D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting Inter-Thread Cache Contenton on a Chip Multi-Processor Architecture. In Proc. of the 11th Int. Symp. on High Performance Computer Architecture, pages 340--351, San Francisco, CA, USA, Feb. 2005. Google ScholarDigital Library
J. Chang and G. Sohi. Cooperative Cache Partitioning for Chip Multiprocessors. In Proc. of the 21st Int. Conference on Supercomputing, pages 242--252, Seattle, WA, June 2007. Google ScholarDigital Library
D. Chiou. Extending the Reach of Microprocessors: Column and Curious Caching. PhD thesis, Massachusettts Institute of Technology, 1999. Google ScholarDigital Library
J. Doweck. Inside Intel Core Microarchitecture and Smart Memory Access. White paper, Intel Corporation, 2006. http://download.intel.com/technology/architecture/sma.pdf.Google Scholar
K. Flautner, N. S. Kim, S. Martin, D. Blaauw, and T. Mudge. Drowsy Caches: Simple Techniques for Reducing Leakage Power. In Proc. of the 29th Int. Symp. on Computer Architecture, pages 148--157, Anchorage, AK, USA, May 2002. Google ScholarDigital Library
H. Ghasemzadeh, S. Mazrouee, and M. R. Kakoee. Modified Pseudo LRU Replacement Algorithm. In Proc. of the Int. Symp. on Low Power Electronics and Design, pages 27--30, Potsdam, Germany, Mar. 2006.Google ScholarDigital Library
F. Guo, Y. Solihin, L. Zhao, and R. Iyer. A Framework for Providing Quality of Service in Chip Multi-Processors. In Proc. of the 40th Int. Symp. on Microarchitecture, Chicago, IL, Dec. 2007. Google ScholarDigital Library
M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. MiBench: A Free, Commerically Representative Embedded Benchmark Suite. In Proc. of the 4th Workshop on Workload Characterization, pages 83--94, Austin, TX, USA, Dec. 2001. Google ScholarDigital Library
G. Hamerly, E. Perelman, J. Lau, and B. Calder. SimPoint 3.0: Faster and More Flexible Program Analysis. In Proc. of the Workshop on Modeling, Benchmarking and Simulation, Madison, WI, USA, June 2005.Google Scholar
L. R. Hsu, S. K. Reinhardt, R. R. Iyer, and S. Makineni. Communist, Utilitarian, and Capitalist Cache Policies on CMPs: Caches as a Shared Resource. In Proc. of the 15th Int. Conference on Parallel Architectures and Compilation Techniques, pages 13--22, Seattle, WA, USA, Sep. 2006. Google ScholarDigital Library
Z. Hu, M. Martonosi, and S. Kaxiras. Timekeeping in the Memory System: Predicting and Optimizing Memory Behavior. In Proc. of the 29th Int. Symp. on Computer Architecture, pages 209--220, Anchorage, AK, USA, May 2002. Google ScholarDigital Library
R. Iyer. CQoS: A Framework for Enabling QoS in Shared Caches of CMP Platforms. In Proc. of the Int. Conference on Supercomputing, Saint-Malo, France, June 2004. Google ScholarDigital Library
R. Iyer, L. Zhao, F. Guo, R. Illikkal, S. Makineni, D. Newell, Y. Solihin, L. Hsu, and S. Reinhardt. QoS Policies and Architecture for Cache/Memory in CMP Platforms. In Proc. of the ACM SIGMETRICS, San Diego, CA, USA, June 2007. Google ScholarDigital Library
A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. S. Jr., and J. Emer. Adaptive Insertion Policies for Managing Shared Caches. In Proc. of the 17th Int. Conference on Parallel Architectures and Compilation Techniques, 2007. Google ScholarDigital Library
S. Kaxiras, Z. Hu, and M. Martonosi. Cache Decay: Exploiting Generational Behavior to Reduce Cache Leakage Power. In Proc. of the 28th Int. Symp. on Computer Architecture, pages 240--251, Göteborg, Sweden, June 2001. Google ScholarDigital Library
M. Kharbutli and Y. Solihin. Counter-Based Cache Replacement Algorithms. In Proc. of the Int. Conference on Computer Design, pages 61--68, San Jose, CA, USA, Oct. 2005. Google ScholarDigital Library
M. Kharbutli and Y. Solihin. Counter-Based Cache Replacement and Bypassing Algorithms. Trans. on Computers, 57(4):433--447, Apr. 2008. Google ScholarDigital Library
S. Kim, D. Chandra, and Y. Solihin. Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture. In Proc. of the 13th Int. Conference on Parallel Architectures and Compilation Techniques, pages 111--122, Antibes Juan-les-Pins, France, Sep. 2004. Google ScholarDigital Library
S. Kim, D. Chandra, and Y. Solihin. Fair Caching in a Chip Multi-Processor Architecture. In Proc. of the IBM P=ACÆ2 Conference, Yorktown Heights, NY, USA, Oct. 2004. Google ScholarDigital Library
J. D. Kron, B. Prumo, and G. H. Loh. Double-DIP: Augmenting DIP with Adaptive Promotion Policies to Manage Shared L2 Caches. In Proc. of the Workshop on Chip Multiprocessor Memory Systems and Interconnects, Beijing, China, June 2008.Google Scholar
A.-C. Lai, C. Fide, and B. Falsafi. Dead--Block Prediction&Dead-Block Correlating Prefetchers. In Proc. of the 28th Int. Symp. on Microarchitecture, pages 144--154, Gööteborg, Sweden, June 2001. Google ScholarDigital Library
C. Lee, M. Potkonjak, and W. H. Mangione-Smith. MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communication Systems. In Proc. of the 30th Int. Symp. on Microarchitecture, pages 330--335, Research Triangle Park, NC, USA, Dec. 1997. Google ScholarDigital Library
J. Lin, Q. Lu, X. Ding, Z. Zhang, and P. Sadayappan. Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems. In Proc. of the 14th Int. Symp. on High Performance Computer Architecture, pages 367--378, Salt Lake City, UT, USA, Feb. 2008.Google Scholar
H. Liu, M. Ferdman, J. Huh, and D. Burger. Cache Bursts: A New Approach for Eliminating Dead Blocks and Increasing Cache Efficiency. In Proc. of the 41st Int. Symp. on Microarchitecture, pages 222--233, Lake Como, Italy, Nov. 2008. Google ScholarDigital Library
G. H. Loh, S. Subramaniam, and Y. Xie. Zesto: A Cycle-Level Simulator for Highly Detailed Microarchitecture Exploration. In Proc. of the Int. Symp. on Performance Analysis of Systems and Software, Boston, MA, USA, Apr. 2009.Google ScholarCross Ref
K. Luo, J. Gummaraju, and M. Franklin. Balancing Throughput and Fairness in SMT Processors. In Proc. of the 2001 Int. Symp. on Performance Analysis of Systems and Software, pages 164--171, Tucson, AZ, USA, Nov. 2001.Google Scholar
R. Narayanan, B. Ozisikyilmax, J. Zambreno, G. Memik, and A. N. Choudhary. MineBench: A Benchmark Suite for Data Mining Workloads. In Proc. of the IEEE Int. Symp. on Workload Characterization, pages 182---188, San Jose, CA, USA, Oct. 2006.Google ScholarCross Ref
M. K. Qureshi, , D. Lynch, O. Mutlu, and Y. N. Patt. A Case for MLP-Aware Cache Replacement. In Proc. of the 33rd Int. Symp. on Computer Architecture, pages 167--178, Boston, MA, USA, June 2006. Google ScholarDigital Library
M. K. Qureshi. Dynamic Spill-Accept for Scalable High-Performance Caching in CMPs. In Proc. of the 15th Int. Symp. on High Performance Computer Architecture, Raleigh, NC, USA, Feb. 2009.Google Scholar
M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. S. Jr., and J. Emer. Adaptive Insertion Policies for High-Performance Caching. In Proc. of the 34th Int. Symp. on Computer Architecture, pages 381--391, San Diego, CA, USA, June 2007. Google ScholarDigital Library
M. K. Qureshi and Y. N. Patt. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In Proc. of the 39th Int. Symp. on Microarchitecture, pages 423--432, Orlando, FL, Dec. 2006. Google ScholarDigital Library
N. Rafique, W.-T. Lin, and M. Thottethodi. Architectural Support for Operating System-Driven CMP Cache Management. In Proc. of the 15th Int. Conference on Parallel Architectures and Compilation Techniques, pages 2--12, Seattle, WA, USA, Sep. 2006. Google ScholarDigital Library
S. Srikantaiah, M. Kandemir, and M. J. Irwin. Adaptive Set-Pinning: Managing Shared Caches in Chip Multiprocessors. In Proc. of the 13th Symp. on Architectural Support for Programming Languages and Operating Systems, Seattle, WA, USA, Mar. 2009. Google ScholarDigital Library
H. S. Stone, J. Tuerk, and J. L. Wolf. Optimal Paritioning of Cache Memory. Trans. on Computers, 41(9):1054--1068, Sep. 1992. Google ScholarDigital Library
G. E. Suh, L. Rudolph, and S. Devadas. Dynamic Partitioning of Shared Cache Memory. Jour. of Supercomputing, 28(1):7--26, 2004. Google ScholarDigital Library
T. Y. Yeh, P. Faloutsos, S. J. Patel, and G. Reinman. ParallAX: an Architecture for Real-Time Physics. In Proc. of the 34th Int. Symp. on Computer Architecture, pages 232--243, San Diego, CA, USA, June 2007. Google ScholarDigital Library

Index Terms

PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches

Many multi-core processors employ a large last-level cache (LLC) shared among the multiple cores. Past research has demonstrated that sharing-oblivious cache management policies (e.g., LRU) can lead to poor performance and fairness when the multiple ...
Read More
The ZCache: Decoupling Ways and Associativity
MICRO '43: Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture

The ever-increasing importance of main memory latency and bandwidth is pushing CMPs towards caches with higher capacity and associativity. Associativity is typically improved by increasing the number of ways. This reduces conflict misses, but increases ...
Read More
Reactive NUCA: near-optimal block placement and replication in distributed caches
ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture
June 2009
510 pages
ISBN:9781605585260
DOI:10.1145/1555754
General Chair:
Steve Keckler
University of Texas at Austin
,
Program Chair:
Luiz André Barroso
Google Inc.
ACM SIGARCH Computer Architecture News Volume 37, Issue 3
June 2009
495 pages
ISSN:0163-5964
DOI:10.1145/1555815
Issue’s Table of Contents
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 June 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cache
contention
insertion
multi-core
promotion
sharing
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate543of3,203submissions,17%
Upcoming Conference
ISCA '24

Sponsor:

sigarch

ISCA '24: The 51st Annual International Symposium on Computer Architecture

June 29 - July 3, 2024

Buenos Aires , Argentina
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 284
  Total Citations
  View Citations
- 1,909
  Total Downloads
- Downloads (Last 12 months)60
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches

ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches

The ZCache: Decoupling Ways and Associativity

Reactive NUCA: near-optimal block placement and replication in distributed caches