research-article

Less reused filter: improving l2 cache performance via filtering less reused lines

Authors:

Lingxiang Xiang,

Wei HuAuthors Info & Claims

ICS '09: Proceedings of the 23rd international conference on Supercomputing

Pages 68 - 79

https://doi.org/10.1145/1542275.1542290

Published: 08 June 2009 Publication History

Abstract

The L2 cache is commonly managed using LRU policy. For workloads that have a working set larger than L2 cache, LRU behaves poorly, resulting in a great number of less reused lines that are never reused or reused for few times. In this case, the cache performance can be improved through retaining a portion of working set in cache for a period long enough. Previous schemes approach this by bypassing never reused lines. Nevertheless, severely constrained by the number of never reused lines, sometimes they deliver no benefit due to the lack of never reused lines.

This paper proposes a new filtering mechanism that filters out the less reused lines rather than just never reused lines. The extended scope of bypassing provides more opportunities to fit the working set into cache. This paper also proposes a Less Reused Filter (LRF), a separate structure that precedes L2 cache, to implement the above mechanism. LRF employs a reuse frequency predictor to accurately identify the less reused lines from incoming lines. Meanwhile, based on our observation that most less reused lines have a short life span, LRF places the filtered lines into a small filter buffer to fully utilize them, avoiding extra misses.

Our evaluation, for 24 SPEC 2000 benchmarks, shows that augmenting a 512KB LRU-managed L2 cache with a LRF having 32KB filter buffer reduces the average MPKI by 27.5%, narrowing the gap between LRU and OPT by 74.4%.

References

[1]

A. Basu, N. Kirman, M. Kirman, M. Chaudhuri, and J. Martinez. Scavenger: A new last level cache architecture with global block priority. In MICRO-40, 2007.

Digital Library

[2]

L. A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Systems journal, pages 78--101, 1966.

[3]

H. Dybdahl, P. Stenström, and L. Natvig. An lru-based replacement algorithm augmented with frequency of access in shared chip-multiprocessor caches. In MEDEA '06: Proceedings of the 2006 workshop on MEmory performance, 2006.

Digital Library

[4]

M. Takagi and K. Hiraki. Inter-reference gap distribution replacement: an improved replacement algorithm for set-associative caches. In ICS-18, 2004.

Digital Library

[5]

R. Subramanian, Y. Smaragdakis, and G. H. Loh. Adaptive caches: Effective shaping of cache behavior to workloads. In MICRO-39, 2006.

Digital Library

[6]

W. A. Wong and J.-L. Baer. Modified lru policies for improving second-level cache behavior. In HPCA-6, 2000.

[7]

K. Rajan and G. Ramaswamy. Emulating optimal replacement with a shepherd cache. In MICRO-40, 2007.

Digital Library

[8]

M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In ISCA-34, 2007.

Digital Library

[9]

A. González, C. Aliagas, and M. Valero. A data cache with multiple caching strategies tuned to different types of locality. In ICS-9, 1995.

Digital Library

[10]

M. K. Qureshi, D. Thompson, and Y. N. Patt. The v-way cache: Demand based associativity via global replacement. In ISCA-32, 2005.

Digital Library

[11]

Y. Smaragdakis, S. Kaplan, and P. Wilson. The eelru adaptive replacement algorithm. Perform. Eval., 53(2):93--123, 2003.

Digital Library

[12]

K. Inoue, T. Ishihara, and K. Murakami. Way-predicting set-associative cache for high performance and low energy consumption. In ISLPED'99: Proceedings of the 1999 international symposium on Low power electronics and design, 1999.

Digital Library

[13]

B. Calder and D. Grunwald. Next cache line and set prediction. In ISCA-22, 1995.

Digital Library

[14]

Z. Hu, S. Kaxiras, and M. Martonosi. Timekeeping in the memory system: predicting and optimizing memory behavior. In ISCA-02, 2002.

Digital Library

[15]

A.-C. Lai, C. Fide, and B. Falsafi. Dead-block prediction & dead-block correlating prefetchers. In ISCA-28, 2001.

Digital Library

[16]

P. Pujara and A. Aggarwal. Increasing the cache efficiency by eliminating noise. In HPCA-12, 2006.

[17]

W. Lin and S. Reinhardt. Predicting last-touch references under optimal replacement. Technical Report CSE-TR-447-02, University of Michigan, 2002.

[18]

E. J. O'Neil, P. E. O'Neil, and G. Weikum. The lru-k page replacement algorithm for database disk buffering. In SIGMOD'93, 1993.

Digital Library

[19]

T. R. Puzak. Analysis of cache replacement-algorithms. Ph.D. thesis, 1985.

Digital Library

[20]

N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In ISCA-17, 1990.

Digital Library

[21]

D. Lee, J. Choi, J. H. Kim, S. H. Noh, S. L. Min, Y. Cho, and C. S. Kim. Lrfu: A spectrum of policies that subsumes the least recently used and least frequently used policies. IEEE Trans. Comput., 50(12):1352--1361, 2001.

Digital Library

[22]

H. Liu, M. Ferdman, J. Huh, and D. Burger. Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency. In MICRO'08: Proceedings of the 2008 41st IEEE/ACM International Symposium on Microarchitecture, 2008.

Digital Library

[23]

N. Megiddo and D. S. Modha. Arc: A self-tuning, low overhead replacement cache. In FAST-2, 2003.

Digital Library

[24]

J. Abella, A. González, X. Vera, and M. F. P. O'Boyle. Iatac: a smart predictor to turn-off l2 cache lines. ACM Trans. Archit. Code Optim., 2(1):55--77, 2005.

Digital Library

[25]

M. Kharbutli and Y. Solihin. Counter-based cache replacement algorithms. In ICCD'05, 2005.

Digital Library

[26]

M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt. A case for mlp-aware cache replacement. In ISCA-33, 2006.

Digital Library

[27]

C.-H. Chi and H. Dietz. Improving cache performance by selective cache bypass. System Sciences, 1989. Vol.I: Architecture Track, Proceedings of the Twenty-Second Annual Hawaii International Conference on, 1:277--285 vol.1, 1989.

[28]

Y. Wu, R. Rakvic, L.-L. Chen, C.-C. Miao, G. Chrysos, and J. Fang. Compiler managed micro-cache bypassing for high performance epic processors. In MICRO 35: Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture, 2002.

Digital Library

[29]

J. Rivers and E. Davidson. Reducing conflicts in direct-mapped caches with a temporality-based design. In ICPP'96, 1996.

[30]

T. L. Johnson and W. mei W. Hwu. Run-time adaptive cache hierarchy management via reference analysis. In ISCA-24, 1997.

Digital Library

[31]

T. L. Johnson, D. A. Connors, M. C. Merten, and W. mei W. Hwu. Run-time cache bypassing. IEEE Trans. Comput., 48(12):1338--1354, 1999.

Digital Library

[32]

Y. Etsion and D. G. Feitelson. L1 cache filtering through random selection of memory references. In PACT-16, 2007.

Digital Library

[33]

E. S. Tam, J. A. Rivers, V. Srinivasan, G. S. Tyson, and E. S. Davidson. Active management of data caches by exploiting reuse information. IEEE Trans. Comput., 48(11):1244--1259, 1999.

Digital Library

[34]

J. Jalminger and P. Stenstrom. A novel approach to cache block reuse predictions. In ICPP'03, 2003.

[35]

S. Kumar and C. Wilkerson. Exploiting spatial locality in data caches using spatial footprints. In ISCA-25, 1998.

Digital Library

[36]

E. G. Hallnor and S. K. Reinhardt. A fully associative software-managed cache design. In ISCA-27, 2000.

Digital Library

Cited By

Chang CHan JSivasubramaniam ASharma Mailthody VQureshi ZHwu WTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)GMT: GPU Orchestrated Memory Tiering for the Big Data EraProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651353(464-478)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651353
Zou MZhang MWang RSun XYe XFan DTang Z(2024)Skyway: Accelerate Graph Applications with a Dual-Path Architecture and Fine-Grained Data ManagementJournal of Computer Science and Technology10.1007/s11390-023-2939-x39:4(871-894)Online publication date: 20-Sep-2024
https://doi.org/10.1007/s11390-023-2939-x
Wang YChang CSivasubramaniam ASoundararajan N(2023)ACIC: Admission-Controlled Instruction Cache2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071033(165-178)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071033
Show More Cited By

Index Terms

Less reused filter: improving l2 cache performance via filtering less reused lines
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Exploiting Core Working Sets to Filter the L1 Cache with Random Sampling

Locality is often characterized by working sets, defined by Denning as the set of distinct addresses referenced within a certain window of time. This definition ignores the fact that dramatic differences exist between the usage patterns of frequently ...
Using the first-level caches as filters to reduce the pollution caused by speculative memory references

High-performance processors employ aggressive branch prediction and prefetching techniques to increase performance. Speculative memory references caused by these techniques sometimes bring data into the caches that are not needed by correct execution. ...
SLRF: A High-efficiency Shared Less Reused Filter in Chip Multiprocessors
ICICTA '11: Proceedings of the 2011 Fourth International Conference on Intelligent Computation Technology and Automation - Volume 02

In general, the Less Recently Used (LRU) policy was commonly employed to manage shared L2 cache in Chip Multiprocessors. However, LRU policy remains some deficiencies based on previous studies. In particular, LRU may perform considerably bad when the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '09: Proceedings of the 23rd international conference on Supercomputing

June 2009

544 pages

ISBN:9781605584980

DOI:10.1145/1542275

General Chairs:
Michael Gschwind
IBM TJ Watson Research Center, USA
,
Alex Nicolau
UC Irvine, USA
,
Program Chairs:
Valentina Salapura
IBM TJ Watson Research Center, USA
,
José Moreira
IBM TJ Watson Research Center, USA

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 June 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS '09

Sponsor:

ICS '09: International Conference on Supercomputing

June 8 - 12, 2009

NY, Yorktown Heights, USA

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
482
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)1

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chang CHan JSivasubramaniam ASharma Mailthody VQureshi ZHwu WTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)GMT: GPU Orchestrated Memory Tiering for the Big Data EraProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651353(464-478)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651353
Zou MZhang MWang RSun XYe XFan DTang Z(2024)Skyway: Accelerate Graph Applications with a Dual-Path Architecture and Fine-Grained Data ManagementJournal of Computer Science and Technology10.1007/s11390-023-2939-x39:4(871-894)Online publication date: 20-Sep-2024
https://doi.org/10.1007/s11390-023-2939-x
Wang YChang CSivasubramaniam ASoundararajan N(2023)ACIC: Admission-Controlled Instruction Cache2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071033(165-178)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071033
Mazumdar CMitra PBasu A(2021)Dead Page and Dead Block Predictors: Cleaning TLBs and Caches Together2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00050(507-519)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00050
Wang JRamrakhyani PElsasser WJohn L(2019)Reducing Data Movement and Energy in Multilevel Cache Hierarchies without Losing Performance: Can you have it all?Proceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00037(382-393)Online publication date: 23-Sep-2019
https://dl.acm.org/doi/10.1109/PACT.2019.00037
Kong JLee K(2017)A DVFS-aware cache bypassing technique for multiple clock domain mobile SoCsIEICE Electronics Express10.1587/elex.14.2017032414:11(20170324-20170324)Online publication date: 2017
https://doi.org/10.1587/elex.14.20170324
Mittal S(2016)A Survey of Cache Bypassing TechniquesJournal of Low Power Electronics and Applications10.3390/jlpea60200056:2(5)Online publication date: 28-Apr-2016
https://doi.org/10.3390/jlpea6020005
王冠(2016)The IBP Replacement Algorithm Based on Process BindingSoftware Engineering and Applications10.12677/SEA.2016.5302005:03(181-189)Online publication date: 2016
https://doi.org/10.12677/SEA.2016.53020
Sritriratanarak WEkpanyapong MChongstitvatana P(2015)Applying SVM to data bypass prediction in multi core last-level cachesIEICE Electronics Express10.1587/elex.12.2015073612:22(20150736-20150736)Online publication date: 2015
https://doi.org/10.1587/elex.12.20150736
Li LLu JCheng X(2014)Retention Benefit Based Intelligent Cache ReplacementJournal of Computer Science and Technology10.1007/s11390-014-1481-229:6(947-961)Online publication date: 17-Nov-2014
https://doi.org/10.1007/s11390-014-1481-2
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten