Abstract
As memory access times grow larger relative to processor cycle times, the cache performance of algorithms has an increasingly large impact on overall performance. Unfortunately, most commonly used algorithms were not designed with cache performance in mind. This paper investigates the cache performance of implicit heaps. We present optimizations which significantly reduce the cache misses that heaps incur and improve their overall performance. We present an analytical model called collective analysis that allows cache performance to be predicted as a function of both cache configuration and algorithm configuration. As part of our investigation, we perform an approximate analysis of the cache performance of both traditional heaps and our improved heaps in our model. In addition empirical data is given for five architectures to show the impact our optimizations have on overall performance. We also revisit a priority queue study originally performed by Jones [25]. Due to the increases in cache miss penalties, the relative performance results we obtain on today's machines differ greatly from the machines of only ten years ago. We compare the performance of implicit heaps, skew heaps and splay trees and discuss the difference between our results and Jones's.
Supplemental Material
Available for Download
The software suite accompanying the article.
- {1} A. Agarwal, M. Horowitz, and J. Hennessy. An analytical cache model. ACM Transactions on Computer Systems, 7:2:184-215, 1989. Google ScholarDigital Library
- {2} R. Agarwal, F. Gustavson, and M. Zubair. Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms. IBM Journal of Research and Development , 38:5:563-576, Sep 1994. Google ScholarDigital Library
- {3} A. Aggarwal, K. Chandra, and M. Snir. A model for hierarchical memory. In 19th Annual ACM Symposium on Theory of Computing, pages 305-314, 1987. Google ScholarDigital Library
- {4} A. Aho, J. Hopcroft, and J. Ullman. The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, Massachusetts, 1974. Google ScholarDigital Library
- {5} B. Alpern, L. Carter, E. Feig, and T. Selker. The uniform memory hierarchy model of computation. Algorithmica, 12:2-3:72-109, 1994.Google ScholarDigital Library
- {6} J. Anderson and M. Lam. Global optimizations for parallelism and locality on scalable parallel machines. In Proceedings of the 1993 ACM Symposium on Programming Languages Design and Implementation, pages 112-125. ACM, 1993. Google ScholarDigital Library
- {7} S. Carlsson. An optimal algorithm for deleting the root of a heap. Information Processing Letters, 37:2:117-120, 1991. Google ScholarDigital Library
- {8} S. Carr, K. McKinley, and C. W. Tseng. Compiler optimizations for improving data locality. In Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 252-262, 1994. Google ScholarDigital Library
- {9} M. Cierniak and Wet Li. Unifying data and control transformations for distributed shared-memory machines. In Proceedings of the 1995 ACM Symposium on Programming Languages Design and Implementation, pages 205-217. ACM, 1995. Google ScholarDigital Library
- {10} D. Clark. Cache performance of the VAX-11/780. ACM Transactions on Computer Systems, 1:1:24-37, 1983. Google ScholarDigital Library
- {11} E. Coffman and P. Denning. Operating Systems Theory. Prentice-Hall, Englewood Cliffs, NJ, 1973. Google ScholarDigital Library
- {12} T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. The MIT Press, Cambridge, MA, 1990. Google ScholarDigital Library
- {13} J. De Graffe and W. Kosters. Expected heights in heaps. BIT, 32:4:570-579, 1992. Google ScholarDigital Library
- {14} E. Doberkat. Inserting a new element into a heap. BIT, 21:225-269, 1981.Google ScholarCross Ref
- {15} E. Doberkat. Deleting the root of a heap. Acta Informatica, 17:245-265, 1982.Google ScholarDigital Library
- {16} J. Dongarra, O. Brewer, J. Kohl, and S. Fineberg. A tool to aid in the design, implementation, and understanding of matrix algorithms for parallel processors. Journal of Parallel and Distributed Computing, 9:2:185-202, June 1990. Google ScholarDigital Library
- {17} M. Farrens, G. Tyson, and A. Pleszkun. A study of single-chip processor/cache organizations for large numbers of transistors. In Proceedings of the 21st Annual International Symposium on Computer Architecture, pages 338-347, 1994. Google ScholarDigital Library
- {18} D. Fenwick, D. Foley, W. Gist, S. VanDoren, and D. Wissell. The AlphaServer 8000 series: High-end server platform development. Digital Technical Journal, 7:1:43-65, 1995. Google ScholarDigital Library
- {19} Robert W. Floyd. Treesort 3. Communications of the ACM, 7:12:701, 1964.Google ScholarDigital Library
- {20} D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformation. Journal of Parallel and Distributed Computing, 5:5:587- 616, Oct 1988. Google ScholarDigital Library
- {21} G. Gonnet and J. Munro. Heaps on heaps. SIAM Journal of Computing, 15:4:964-971, 1986. Google ScholarDigital Library
- {22} D. Grunwald, B. Zorn, and R. Henderson. Improving the cache locality of memory allocation. In Proceedings of the 1993 ACM Symposium on Programming Languages Design and Implementation, pages 177-186. ACM, 1993. Google ScholarDigital Library
- {23} J. Hennesey and D. Patterson. Computer Architecture A Quantitative Approach. Morgan Kaufman Publishers, Inc., San Mateo, CA, 1990. Google ScholarDigital Library
- {24} D.B. Johnson. Priority queues with update and finding minimum spanning trees. Information Processing Letters, 4, 1975.Google Scholar
- {25} D. Jones. An emperical comparison of priority-queue and event-set implementations. Communications of the ACM, 29:4:300-311, 1986. Google ScholarDigital Library
- {26} K. Kennedy and K. McKinley. Optimizing for parallelism and data locality. In Proceedings of the 1992 International Conference on Supercomputing, pages 323-334, 1992. Google ScholarDigital Library
- {27} D.E. Knuth. The Art of Computer Programming, vol III-Sorting and Searching. Addison-Wesely, Reading, MA, 1973. Google ScholarDigital Library
- {28} A. LaMarca. Caches and algorithms. Ph.D. Dissertation, University of Washington, May 1996. Google ScholarDigital Library
- {29} A. LaMarca and R.E. Ladner. The influence of caches on the performance of sorting. Technical Report 96-10-01, University of Washington, Department of Computer Science and Engineering, 1992. Also appears in the Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, January 1997. Google ScholarDigital Library
- {30} A. Lebeck and D. Wood. Cache profiling and the spec benchmarks: a case study. Computer, 27:10:15-26, Oct 1994. Google ScholarDigital Library
- {31} M. Martonosi, A. Gupta, and T. Anderson. Memspy: analyzing memory system bottlenecks in programs. In Proceedings of the 1992 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 1-12, 1992. Google ScholarDigital Library
- {32} D. Naor, C. Martel, and N. Matloff. Performance of priority queue structures in a virtual memory environment. Computer Journal, 34:5:428-437, Oct 1991. Google ScholarDigital Library
- {33} G. Rao. Performance analysis of cache memories. Journal of the ACM, 25:3:378-395, 1978. Google ScholarDigital Library
- {34} R. Sedgewick. Algorithms. Addison-Wesley, Reading, MA, 1988. Google ScholarDigital Library
- {35} J.P. Singh, H.S. Stone, and D.F. Thiebaut. A model of workloads and its use in miss-rate prediction for fully associative caches. IEEE Transactions on Computers, 41:7:811-825, 1992. Google ScholarDigital Library
- {36} D. Sleator and R. Tarjan. Self-adjusting binary search trees. Journal of the ACM, 32:3:652- 686, 1985. Google ScholarDigital Library
- {37} Amitabh Srivastava and Alan Eustace. ATOM: A system for building customized program analysis tools. In Proceedings of the 1994 ACM Symposium on Programming Languages Design and Implementation, pages 196-205. ACM, 1994. Google ScholarDigital Library
- {38} O. Temam, C. Fricker, and W. Jalby. Cache interference phenomena. In Proceedings of the 1994 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems , pages 261-271, 1994. Google ScholarDigital Library
- {39} R. Uhlig, D. Nagle, T. Stanley, T. Mudge, S. Sechrest, and R. Brown. Design tradeoffs for software-managed TLBs. ACM Transactions on Computer Systems, 12:3:175-205, 1994. Google ScholarDigital Library
- {40} M. Weiss. Data structures and algorithm analysis. Benjamin/Cummings Pub. Co., Redwood City, CA, 1995. Google ScholarDigital Library
- {41} H. Wen and J. L. Baer. Efficient trace-driven simulation methods for cache performance analysis. ACM Transactions on Computer Systems, 9:3:222-241, 1991. Google ScholarDigital Library
- {42} J. W. Williams. Heapsort. Communications of the ACM, 7:6:347-348, 1964.Google Scholar
- {43} M. Wolf and M. Lam. A data locality optimizing algorithm. In Proceedings of the 1991 ACM Symposium on Programming Languages Design and Implementation, pages 30-44. ACM, 1991. Google ScholarDigital Library
Index Terms
- The influence of caches on the performance of heaps
Recommendations
Performance of One's Complement Caches
On-chip caches to reduce average memory access latency are commonplace in today's commercial microprocessors. These on-chip caches generally have low associativity and small cache sizes. Cache line conflicts are the main source of cache misses, which ...
Comments