ABSTRACT
One of the most basic algorithmic problems concerning caches is to compute the LRU hit-rate curve on a given trace. Unfortunately, the known algorithms exhibit poor data locality and fail to scale to large caches. It is widely believed that the LRU hit-rate curve cannot be computed efficiently enough to be used in online production settings. This has led to a large literature on heuristics that aim to approximate the curve efficiently.
In this paper, we show that the poor data locality of past algorithms can be avoided. We introduce a new algorithm, called Increment-and-Freeze, for computing exact LRU hit-rate curves. The algorithm achieves RAM-model complexity O(n log n), external-memory complexity O(n over B log n), and parallelism Θ(log n). We also present two theoretical extensions of Increment-and-Freeze, one that achieves SORT complexity in the external-memory model, and one that achieves a parallel span of O(log2 n) which is near linear parallelism, while maintaining work efficiency.
We implement Increment-and-Freeze and obtain a speedup of up to 9x over the classical augmented-tree algorithm on a single processor. On 16 threads, the speedup becomes as large as 60x. In comparison to the previous state-of-the-art parallel algorithm, Increment-and-Freeze achieves a speedup of up to 10x when both algorithms use the same number of threads.
- Alok Aggarwal and S Vitter, Jeffrey. 1988. The input/output complexity of sorting and related problems. Commun. ACM, Vol. 31, 9 (1988), 1116--1127.Google ScholarDigital Library
- George Almási, Cualin Cacscaval, and David A. Padua. 2002. Calculating stack distances efficiently. In Proceedings of the 2002 Workshop on Memory System Performance (MSP). Berlin, Germany, 37--43. https://doi.org/10.1145/773146.773043Google ScholarDigital Library
- Laszlo A. Bélády. 1966. A study of replacement algorithms for virtual storage computers. IBM Systems Journal, Vol. 5, 2 (1966), 78--101. https://doi.org/10.1147/sj.52.0078Google ScholarDigital Library
- Laszlo A. Bélády and Frank P. Palermo. 1974. On-line measurement of paging behavior by the multivalued MIN algorithm. IBM Journal of Research and Development, Vol. 18, 1 (Jan. 1974), 2--19. https://doi.org/10.1147/rd.181.0002Google ScholarDigital Library
- Michael A. Bender, Daniel DeLayo, Bradley C. Kuszmaul, William Kuszmaul, and Evan West. 2022. Increment-and-Freeze source code. https://github.com/etwest/Increment-and-Freeze.Google Scholar
- B. T. Bennett and V. J. Kruskal. 1975. LRU stack processing. IBM Journal of Research and Development, Vol. 19, 4 (July 1975), 353--357. https://doi.org/10.1147/rd.194.0353Google ScholarDigital Library
- Daniel S Berger, Nathan Beckmann, and Mor Harchol-Balter. 2018. Practical bounds on optimal caching with variable object sizes. Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS), Vol. 2, 2 (2018), 1--38.Google ScholarDigital Library
- Gianfranco Bilardi, Kattamuri Ekanadham, and Pratap Pattnaik. 2011. Efficient stack distance computation for priority replacement policies. In Proceedings of the 8th ACM International Conference on Computing Frontiers (CF). https://doi.org/10.1145/2016604.2016607Google ScholarDigital Library
- Gianfranco Bilardi, Kattamuri Ekanadham, and Pratap Pattnaik. 2017. Optimal on-line computation of stack distances for MIN and OPT. In Proceedings of the Computing Frontiers Conference (CF). 237--246. https://doi.org/10.1145/3075564.3075571Google ScholarDigital Library
- Guy E Blelloch. 1993. Prefix sums and their applications. In Synthesis of Parallel Algorithms,, John H Reif (Ed.). Morgan Kaufmann Publishers Inc.Google Scholar
- Guy E Blelloch, Phillip B Gibbons, and Harsha Vardhan Simhadri. 2010. Low depth cache-oblivious algorithms. In Proceedings of the Twenty-Second Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 189--199.Google ScholarDigital Library
- Guy E Blelloch and Bruce M Maggs. 2010. Parallel algorithms. In Algorithms and Theory of Computation Handbook: Special Topics and Techniques,, Mikhail J Atallah and Marina Blanton (Eds.). 25--25.Google Scholar
- Daniel Byrne. 2018. A survey of miss-ratio curve construction techniques. https://arxiv.org/pdf/1804.01972.pdfGoogle Scholar
- Zachary Drudi, Nicholas JA Harvey, Stephen Ingram, Andrew Warfield, and Jake Wires. 2015. Approximating hit rate curves using streaming algorithms. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.Google Scholar
- David Eklov and Erik Hagersten. 2010. StatStack: Efficient modeling of LRU caches. In IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS). 55--65.Google ScholarCross Ref
- Changpeng Fang, S Can, Soner Onder, and Zhenlin Wang. 2005. Instruction based memory distance analysis and its application to optimization. In 14th International Conference on Parallel Architectures and Compilation Techniques (PACT). 27--37.Google ScholarDigital Library
- Lulu He, Zhibin Yu, and Hai Jin. 2012. FractalMRC: online cache miss rate curve prediction on commodity systems. In IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS). 1341--1351.Google ScholarDigital Library
- Yul H Kim, Mark D Hill, and David A Wood. 1991. Implementing stack simulation for highly-associative memories. ACM SIGMETRICS Performance Evaluation Review, Vol. 19, 1 (1991), 212--213.Google ScholarDigital Library
- Charles Eric Leiserson, Ronald L Rivest, Thomas H Cormen, and Clifford Stein. 1994. Introduction to Algorithms. Vol. 3. MIT press.Google Scholar
- R.L. Mattson, J. Gecsei, D.R. Slutz, and I.L. Traiger. 1970. Evaluation Techniques for Storage Hierarchies. IBM Systems Journal, Vol. 9, 2 (1970), 78--117. https://doi.org/10.1147/sj.92.0078Google ScholarDigital Library
- Qingpeng Niu, James Dinan, Qingda Lu, and Ponnuswamy Sadayappan. 2012. PARDA: A fast parallel reuse distance analysis algorithm. In IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS). 1284--1294.Google ScholarDigital Library
- Frank Olken. 1981. Efficient Methods for Calculating the Success Function of Fixed Space Replacement Policies. Technical Report LBL-12370. Physics, Computer Science & Mathematics Division, Lawerence Berkeley Laboratory, University of California. M.S. thesis.Google Scholar
- Trausti Saemundsson, Hjortur Bjornsson, Gregory Chockler, and Ymir Vigfusson. 2014. Dynamic performance profiling of cloud caches. In Proceedings of the ACM Symposium on Cloud Computing (SoCC). 1--14.Google ScholarDigital Library
- Daniel D Sleator and Robert E Tarjan. 1985. Amortized efficiency of list update and paging rules. Commun. ACM, Vol. 28, 2 (1985), 202--208.Google ScholarDigital Library
- Rabin A. Sugumar. 1993. Multi-configuration simulation algorithms for the evaluation of computer architecture designs. Ph.,D. Dissertation. University of Michigan.Google Scholar
- Carl A. Waldspurger, Nohhyun Park, Alexander Garthwaite, and Irfan Ahmad. 2015. Efficient MRC construction with SHARDS. In 13th USENIX Conference on File and Storage Technologies (FAST). Santa Clara, California, USA, 95--110. https://www.usenix.org/conference/fast15/technical-sessions/presentation/waldspurgerGoogle ScholarDigital Library
- Carl A Waldspurger, Trausti Saemundsson, Irfan Ahmad, and Nohhyun Park. 2017. Cache Modeling and Optimization using Miniature Simulations. In USENIX Annual Technical Conference (ATC). 487--498.Google Scholar
- Jake Wires, Stephen Ingram, Zachary Drudi, Nicholas JA Harvey, and Andrew Warfield. 2014. Characterizing storage workloads with counter stacks. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 335--349.Google Scholar
- Jiangwei Zhang and YC Tay. 2020. PG2S: Stack distance construction using popularity, gap and machine learning. In Proceedings of The Web Conference (WWW). 973--983.Google ScholarDigital Library
- Lei Zhang, Reza Karimi, Irfan Ahmad, and Ymir Vigfusson. 2020. Optimal data placement for heterogeneous cache, memory, and storage systems. Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS), Vol. 4, 1 (2020), 1--27.Google ScholarDigital Library
- Yutao Zhong, Xipeng Shen, and Chen Ding. 2009. Program locality analysis using reuse distance. ACM Transactions on Programming Languages and Systems (TOPLAS), Vol. 31, 6 (Aug. 2009). https://doi.org/10.1145/1552309.1552310 Article 20.Google ScholarDigital Library
Index Terms
- Increment - and - Freeze: Every Cache, Everywhere, All of the Time
Recommendations
Modeling LRU cache with invalidation
Least Recently Used (LRU) is a very popular caching replacement policy. It is very easy to implement and offers good performance, especially when data requests are temporally correlated, as in the case of web traffic.When the data content can change ...
Dynamic Performance Profiling of Cloud Caches
SOCC '14: Proceedings of the ACM Symposium on Cloud ComputingLarge-scale in-memory object caches such as memcached are widely used to accelerate popular web sites and to reduce burden on backend databases. Yet current cache systems give cache operators limited information on what resources are required to ...
Reuse-based online models for caches
SIGMETRICS '13: Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systemsWe develop a reuse distance/stack distance based analytical modeling framework for efficient, online prediction of cache performance for a range of cache configurations and replacement policies LRU, PLRU, RANDOM, NMRU. Our framework unifies existing ...
Comments