ABSTRACT
Cache blocks often exhibit a small number of uses during their life time in the last-level cache. Past research has exploited this property in two different ways. First, replacement policies have been designed to evict dead blocks early and retain the potentially live blocks. Second, dynamic insertion policies attempt to victimize single-use blocks (dead on fill) as early as possible, thereby leaving most of the working set undisturbed in the cache. However, we observe that as the last-level cache grows in capacity and associativity, the traditional dead block prediction-based replacement policy loses effectiveness because often the LRU block itself is dead leading to an LRU replacement decision. The benefit of dynamic insertion policies is also small in a large class of applications that exhibit a significant number of cache blocks with small, yet more than one, uses.
To address these drawbacks, we introduce pseudo-last-in-first-out (pseudo-LIFO), a fundamentally new family of replacement heuristics that manages each cache set as a fill stack (as opposed to the traditional access recency stack). We specify three members of this family, namely, dead block prediction LIFO, probabilistic escape LIFO, and probabilistic counter LIFO. The probabilistic escape LIFO (peLIFO) policy is the central contribution of this paper. It dynamically learns the use probabilities of cache blocks beyond each fill stack position to implement a new replacement policy. Our detailed simulation results show that peLIFO, while having less than 1% storage overhead, reduces the execution time by 10% on average compared to a baseline LRU replacement policy for a set of fourteen single-threaded applications on a 2 MB 16-way set associative L2 cache. It reduces the average CPI by 19% on average for a set of twelve multiprogrammed workloads while satisfying a strong fairness requirement on a four-core chip-multiprocessor with an 8 MB 16-way set associative shared L2 cache. Further, it reduces the parallel execution time by 17% on average for a set of six multi-threaded programs on an eight-core chip-multiprocessor with a 4 MB 16-way set associative shared L2 cache. For the architectures considered in this paper, the storage overhead of the peLIFO policy is one-fifth to half of that of a state-of-the-art dead block prediction-based replacement policy. However, the peLIFO policy delivers better average performance for the selected single-threaded and multiprogrammed workloads and similar average performance for the multi-threaded workloads compared to the dead block prediction-based replacement policy.
- A. Basu et al. Scavenger: A New Last Level Cache Architecture with Global Block Priority. In Proc. of the 40th Intl. Symp. on Microarchitecture, pages 421--432, December 2007. Google ScholarDigital Library
- HP Labs. CACTI 4.2. Available at http://www.hpl.hp.com/personal/Norman_Jouppi/cacti4.html.Google Scholar
- Z. Hu, S. Kaxiras, and M. Martonosi. Timekeeping in the Memory System: Predicting and Optimizing Memory Behavior. In Proc. of the 29th Intl. Symp. on Computer Architecture, pages 209--220, May 2002. Google ScholarDigital Library
- A. Jaleel et al. Adaptive Insertion Policies for Managing Shared Caches. In Proc. of the 17th Intl. Conf. on Parallel Architecture and Compilation Techniques, pages 208--219, October 2008. Google ScholarDigital Library
- N. P. Jouppi. Improving Direct-mapped Cache Performance by the Addition of a Small Fully Associative Cache and Prefetch Buffers. In Proc. of the 17th Intl. Symp. on Computer Architecture, pages 364--373, June 1990. Google ScholarDigital Library
- R. E. Kessler and M. D. Hill. Page Placement Algorithms for Large Real-indexed Caches. In ACM Transactions on Computer Systems, 10(4): 338--359, November 1992. Google ScholarDigital Library
- A-C. Lai, C. Fide, and B. Falsafi. Dead-block Prediction&Dead-block Correlating Prefetchers. In Proc. of the 28th Intl. Symp. on Computer Architecture, pages 144--154, June/July 2001. Google ScholarDigital Library
- J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In Proc. of the 24th Intl. Symp. on Computer Architecture, pages 241--251, June 1997. Google ScholarDigital Library
- H. Liu et al. Cache Bursts: A New Approach for Eliminating Dead Blocks and Increasing Cache Efficiency. In Proc. of the 41st Intl. Symp. on Microarchitecture, pages 222--233, November 2008. Google ScholarDigital Library
- M. K. Qureshi and Y. N. Patt. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In Proc. of the 39th Intl. Symp. on Microarchitecture, pages 423--432, December 2006. Google ScholarDigital Library
- M. K. Qureshi et al. Adaptive Insertion Policies for High Performance Caching. In Proc. of the 34th Intl. Symp. on Computer Architecture, pages 381--391, June 2007. Google ScholarDigital Library
- T. Sherwood et al. Automatically Characterizing Large Scale Program Behavior. In Proc. of the 10th Intl. Conf. on Architectural Support on Programming Languages and Operating Systems, pages 45--57, October 2002. Google ScholarDigital Library
- S. Srikantaiah, M. Kandemir, and M. J. Irwin. Adaptive Set Pinning: Managing Shared Caches in Chip Multiprocessors. In Proc. of the 13th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, pages 135--144, March 2008. Google ScholarDigital Library
- D. K. Tam et al. RapidMRC: Approximating L2 Miss Rate Curves on Commodity Systems for Online Optimizations. In Proc. of the 14th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, pages 121--132, March 2009. Google ScholarDigital Library
- Y. Xie and G. H. Loh. PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-core Shared Caches. In Proc. of the 36th Intl. Symp. on Computer Architecture, pages 174--183, June 2009. Google ScholarDigital Library
- Z. Zhang, Z. Zhu, and X. Zhang. A Permutation-based Page Interleaving Scheme to Reduce Row-buffer Conflicts and Exploit Data Locality. In Proc. of the 33rd Intl. Symp. on Microarchitecture, pages 32--41, December 2000. Google ScholarDigital Library
Index Terms
- Pseudo-LIFO: the foundation of a new family of replacement policies for last-level caches
Recommendations
Temporal-based multilevel correlating inclusive cache replacement
Inclusive caches have been widely used in Chip Multiprocessors (CMPs) to simplify cache coherence. However, they have poor performance compared with noninclusive caches not only because of the limited capacity of the entire cache hierarchy but also due ...
WADE: Writeback-aware dynamic cache management for NVM-based main memory system
Emerging Non-Volatile Memory (NVM) technologies are explored as potential alternatives to traditional SRAM/DRAM-based memory architecture in future microprocessor design. One of the major disadvantages for NVM is the latency and energy overhead ...
Dense Footprint Cache: Capacity-Efficient Die-Stacked DRAM Last Level Cache
MEMSYS '16: Proceedings of the Second International Symposium on Memory SystemsDie-stacked DRAM technology enables a large Last Level Cache (LLC) that provides high bandwidth data access to the processor. However, it requires a large tag array that may take a significant portion of the on-chip SRAM budget. To reduce this SRAM ...
Comments