ABSTRACT
Many multi-core processors employ a large last-level cache (LLC) shared among the multiple cores. Past research has demonstrated that sharing-oblivious cache management policies (e.g., LRU) can lead to poor performance and fairness when the multiple cores compete for the limited LLC capacity. Different memory access patterns can cause cache contention in different ways, and various techniques have been proposed to target some of these behaviors. In this work, we propose a new cache management approach that combines dynamic insertion and promotion policies to provide the benefits of cache partitioning, adaptive insertion, and capacity stealing all with a single mechanism. By handling multiple types of memory behaviors, our proposed technique outperforms techniques that target only either capacity partitioning or adaptive insertion.
- J. Abella, A. González, X. Vera, and M. F. P. O'Boyle. IATAC: A Smart Predictor to Turn-Off L2 Cache Lines. Trans. on Architecture and Code Optimization, 2(1):55--77, Mar. 2005. Google ScholarDigital Library
- T. Austin, E. Larson, and D. Ernst. SimpleScalar: An Infrastructure for Computer System Modeling. IEEE Micro Magazine, pages 59--67, Feb. 2002. Google ScholarDigital Library
- D. A. Bader, Y. Li, T. Li, and V. Sachdeva. BioPerf: A Benchmark Suite to Evaluate High-Performance Computer Architecture of Bioinformatics Applications. In Proc. of the IEEE Int. Symp. on Workload Characterization, pages 163--173, Austin, TX, USA, Oct. 2005.Google ScholarCross Ref
- M. Behar, A. Mendelson, and A. Kolodny. Trace Cache Sampling Filter. In Proc. of the 14th Int. Conference on Parallel Architectures and Compilation Techniques, pages 255--266, St. Louis, MO, USA, Sep. 2005. Google ScholarDigital Library
- D. S. Bolme, M. M. Strout, and J. R. Beveridge. FacePerf: Benchmarks for Face Recognition Algorithms. In Proc. of the IEEE Int. Symp. on Workload Characterization, Boston, MA, USA, Oct. 2007. Google ScholarDigital Library
- D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting Inter-Thread Cache Contenton on a Chip Multi-Processor Architecture. In Proc. of the 11th Int. Symp. on High Performance Computer Architecture, pages 340--351, San Francisco, CA, USA, Feb. 2005. Google ScholarDigital Library
- J. Chang and G. Sohi. Cooperative Cache Partitioning for Chip Multiprocessors. In Proc. of the 21st Int. Conference on Supercomputing, pages 242--252, Seattle, WA, June 2007. Google ScholarDigital Library
- D. Chiou. Extending the Reach of Microprocessors: Column and Curious Caching. PhD thesis, Massachusettts Institute of Technology, 1999. Google ScholarDigital Library
- J. Doweck. Inside Intel Core Microarchitecture and Smart Memory Access. White paper, Intel Corporation, 2006. http://download.intel.com/technology/architecture/sma.pdf.Google Scholar
- K. Flautner, N. S. Kim, S. Martin, D. Blaauw, and T. Mudge. Drowsy Caches: Simple Techniques for Reducing Leakage Power. In Proc. of the 29th Int. Symp. on Computer Architecture, pages 148--157, Anchorage, AK, USA, May 2002. Google ScholarDigital Library
- H. Ghasemzadeh, S. Mazrouee, and M. R. Kakoee. Modified Pseudo LRU Replacement Algorithm. In Proc. of the Int. Symp. on Low Power Electronics and Design, pages 27--30, Potsdam, Germany, Mar. 2006.Google ScholarDigital Library
- F. Guo, Y. Solihin, L. Zhao, and R. Iyer. A Framework for Providing Quality of Service in Chip Multi-Processors. In Proc. of the 40th Int. Symp. on Microarchitecture, Chicago, IL, Dec. 2007. Google ScholarDigital Library
- M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. MiBench: A Free, Commerically Representative Embedded Benchmark Suite. In Proc. of the 4th Workshop on Workload Characterization, pages 83--94, Austin, TX, USA, Dec. 2001. Google ScholarDigital Library
- G. Hamerly, E. Perelman, J. Lau, and B. Calder. SimPoint 3.0: Faster and More Flexible Program Analysis. In Proc. of the Workshop on Modeling, Benchmarking and Simulation, Madison, WI, USA, June 2005.Google Scholar
- L. R. Hsu, S. K. Reinhardt, R. R. Iyer, and S. Makineni. Communist, Utilitarian, and Capitalist Cache Policies on CMPs: Caches as a Shared Resource. In Proc. of the 15th Int. Conference on Parallel Architectures and Compilation Techniques, pages 13--22, Seattle, WA, USA, Sep. 2006. Google ScholarDigital Library
- Z. Hu, M. Martonosi, and S. Kaxiras. Timekeeping in the Memory System: Predicting and Optimizing Memory Behavior. In Proc. of the 29th Int. Symp. on Computer Architecture, pages 209--220, Anchorage, AK, USA, May 2002. Google ScholarDigital Library
- R. Iyer. CQoS: A Framework for Enabling QoS in Shared Caches of CMP Platforms. In Proc. of the Int. Conference on Supercomputing, Saint-Malo, France, June 2004. Google ScholarDigital Library
- R. Iyer, L. Zhao, F. Guo, R. Illikkal, S. Makineni, D. Newell, Y. Solihin, L. Hsu, and S. Reinhardt. QoS Policies and Architecture for Cache/Memory in CMP Platforms. In Proc. of the ACM SIGMETRICS, San Diego, CA, USA, June 2007. Google ScholarDigital Library
- A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. S. Jr., and J. Emer. Adaptive Insertion Policies for Managing Shared Caches. In Proc. of the 17th Int. Conference on Parallel Architectures and Compilation Techniques, 2007. Google ScholarDigital Library
- S. Kaxiras, Z. Hu, and M. Martonosi. Cache Decay: Exploiting Generational Behavior to Reduce Cache Leakage Power. In Proc. of the 28th Int. Symp. on Computer Architecture, pages 240--251, Göteborg, Sweden, June 2001. Google ScholarDigital Library
- M. Kharbutli and Y. Solihin. Counter-Based Cache Replacement Algorithms. In Proc. of the Int. Conference on Computer Design, pages 61--68, San Jose, CA, USA, Oct. 2005. Google ScholarDigital Library
- M. Kharbutli and Y. Solihin. Counter-Based Cache Replacement and Bypassing Algorithms. Trans. on Computers, 57(4):433--447, Apr. 2008. Google ScholarDigital Library
- S. Kim, D. Chandra, and Y. Solihin. Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture. In Proc. of the 13th Int. Conference on Parallel Architectures and Compilation Techniques, pages 111--122, Antibes Juan-les-Pins, France, Sep. 2004. Google ScholarDigital Library
- S. Kim, D. Chandra, and Y. Solihin. Fair Caching in a Chip Multi-Processor Architecture. In Proc. of the IBM P=ACÆ2 Conference, Yorktown Heights, NY, USA, Oct. 2004. Google ScholarDigital Library
- J. D. Kron, B. Prumo, and G. H. Loh. Double-DIP: Augmenting DIP with Adaptive Promotion Policies to Manage Shared L2 Caches. In Proc. of the Workshop on Chip Multiprocessor Memory Systems and Interconnects, Beijing, China, June 2008.Google Scholar
- A.-C. Lai, C. Fide, and B. Falsafi. Dead--Block Prediction&Dead-Block Correlating Prefetchers. In Proc. of the 28th Int. Symp. on Microarchitecture, pages 144--154, Gööteborg, Sweden, June 2001. Google ScholarDigital Library
- C. Lee, M. Potkonjak, and W. H. Mangione-Smith. MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communication Systems. In Proc. of the 30th Int. Symp. on Microarchitecture, pages 330--335, Research Triangle Park, NC, USA, Dec. 1997. Google ScholarDigital Library
- J. Lin, Q. Lu, X. Ding, Z. Zhang, and P. Sadayappan. Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems. In Proc. of the 14th Int. Symp. on High Performance Computer Architecture, pages 367--378, Salt Lake City, UT, USA, Feb. 2008.Google Scholar
- H. Liu, M. Ferdman, J. Huh, and D. Burger. Cache Bursts: A New Approach for Eliminating Dead Blocks and Increasing Cache Efficiency. In Proc. of the 41st Int. Symp. on Microarchitecture, pages 222--233, Lake Como, Italy, Nov. 2008. Google ScholarDigital Library
- G. H. Loh, S. Subramaniam, and Y. Xie. Zesto: A Cycle-Level Simulator for Highly Detailed Microarchitecture Exploration. In Proc. of the Int. Symp. on Performance Analysis of Systems and Software, Boston, MA, USA, Apr. 2009.Google ScholarCross Ref
- K. Luo, J. Gummaraju, and M. Franklin. Balancing Throughput and Fairness in SMT Processors. In Proc. of the 2001 Int. Symp. on Performance Analysis of Systems and Software, pages 164--171, Tucson, AZ, USA, Nov. 2001.Google Scholar
- R. Narayanan, B. Ozisikyilmax, J. Zambreno, G. Memik, and A. N. Choudhary. MineBench: A Benchmark Suite for Data Mining Workloads. In Proc. of the IEEE Int. Symp. on Workload Characterization, pages 182---188, San Jose, CA, USA, Oct. 2006.Google ScholarCross Ref
- M. K. Qureshi, , D. Lynch, O. Mutlu, and Y. N. Patt. A Case for MLP-Aware Cache Replacement. In Proc. of the 33rd Int. Symp. on Computer Architecture, pages 167--178, Boston, MA, USA, June 2006. Google ScholarDigital Library
- M. K. Qureshi. Dynamic Spill-Accept for Scalable High-Performance Caching in CMPs. In Proc. of the 15th Int. Symp. on High Performance Computer Architecture, Raleigh, NC, USA, Feb. 2009.Google Scholar
- M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. S. Jr., and J. Emer. Adaptive Insertion Policies for High-Performance Caching. In Proc. of the 34th Int. Symp. on Computer Architecture, pages 381--391, San Diego, CA, USA, June 2007. Google ScholarDigital Library
- M. K. Qureshi and Y. N. Patt. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In Proc. of the 39th Int. Symp. on Microarchitecture, pages 423--432, Orlando, FL, Dec. 2006. Google ScholarDigital Library
- N. Rafique, W.-T. Lin, and M. Thottethodi. Architectural Support for Operating System-Driven CMP Cache Management. In Proc. of the 15th Int. Conference on Parallel Architectures and Compilation Techniques, pages 2--12, Seattle, WA, USA, Sep. 2006. Google ScholarDigital Library
- S. Srikantaiah, M. Kandemir, and M. J. Irwin. Adaptive Set-Pinning: Managing Shared Caches in Chip Multiprocessors. In Proc. of the 13th Symp. on Architectural Support for Programming Languages and Operating Systems, Seattle, WA, USA, Mar. 2009. Google ScholarDigital Library
- H. S. Stone, J. Tuerk, and J. L. Wolf. Optimal Paritioning of Cache Memory. Trans. on Computers, 41(9):1054--1068, Sep. 1992. Google ScholarDigital Library
- G. E. Suh, L. Rudolph, and S. Devadas. Dynamic Partitioning of Shared Cache Memory. Jour. of Supercomputing, 28(1):7--26, 2004. Google ScholarDigital Library
- T. Y. Yeh, P. Faloutsos, S. J. Patel, and G. Reinman. ParallAX: an Architecture for Real-Time Physics. In Proc. of the 34th Int. Symp. on Computer Architecture, pages 232--243, San Diego, CA, USA, June 2007. Google ScholarDigital Library
Index Terms
- PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches
Recommendations
PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches
Many multi-core processors employ a large last-level cache (LLC) shared among the multiple cores. Past research has demonstrated that sharing-oblivious cache management policies (e.g., LRU) can lead to poor performance and fairness when the multiple ...
The ZCache: Decoupling Ways and Associativity
MICRO '43: Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on MicroarchitectureThe ever-increasing importance of main memory latency and bandwidth is pushing CMPs towards caches with higher capacity and associativity. Associativity is typically improved by increasing the number of ways. This reduces conflict misses, but increases ...
Reactive NUCA: near-optimal block placement and replication in distributed caches
ISCA '09: Proceedings of the 36th annual international symposium on Computer architectureIncreases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Comments