skip to main content
10.1145/1062261.1062304acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
Article

A case for a working-set-based memory hierarchy

Published: 04 May 2005 Publication History

Abstract

Modern microprocessor designs continue to obtain impressive performance gains through increasing clock rates and advances in the parallelism obtained via micro-architecture design. Unfortunately, corresponding improvements in memory design technology have not been realized, resulting in latencies of over 100 cycles between processors and main memory. This ever-increasing gap in speed has pushed the current memory-hierarchy approach to its limit.Traditional approaches to memory-hierarchy management have not yielded satisfactory results. Hardware solutions require more power and energy than desired and do not scale well. Compiler solutions tend to miss too many optimization opportunities because of limited compile-time knowledge of run-time behavior. This paper explores a different approach that combines both approaches by making use of the static knowledge obtained by the compiler in the dynamic decision making of the micro-architecture. We propose a memory-hierarchy design based on working sets that uses compile-time annotations regarding the working set of memory operations to guide cache placement decisionsOur experiments show that a working-set-based memory hierarchy can significantly reduce the miss rate for memory-intensive tiled kernels by limiting cross interference. The working-set-based memory hierarchy allows the compiler to tile many loops without concern for cross interference in the cache, making tile size choice easier. In addition, the compiler can more easily tailor tile choices to the separate needs of different working sets.

References

[1]
J. Allen and K. Kennedy. Vector register allocation. IEEE Transactions on Computers, 41(10):1290--1317, Oct. 1992.
[2]
F. Bodin and A. Seznec. Skewed associativity enhances performance predictability. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 265--274. ACM Press, 1995.
[3]
D. Callahan, S. Carr, and K. Kennedy. Improving register allocation for subscripted variables. In Proceedings of the ACM SIGPLAN '90 Conference on Programming Language Design and Implementation, pages 53--65, White Plains, NY, June 1990.
[4]
S. Carr and R. Lehoucq. Compiler blockability of dense matrix factorizations. ACM Transactions on Mathematical Software, 23(3):336--361, Sept. 1997.
[5]
S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In Proceedings of the ACM SIGPLAN '95 Conference on Programming Language Design and Implementation, pages 279--280, La Jolla, CA, June 1995.
[6]
D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformations. In Proceedings of the First International Conference on Supercomputing. Springer-Verlag, Athens, Greece, 1987.
[7]
G. Gao, R. Olsen, V. Sarkar, and R. Thekkath. Collective loop fusion for array contraction. In Proceedings of the Fifth Workshop on Languages and Compilers for Parallel Computing, New Haven, CT USA, Aug. 1992.
[8]
S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: An analytical representation of cache misses. In Proceedings of the 1997 ACM International Conference on Supercomputing, pages 317--324, Vienna, Austria, July 1997.
[9]
S. Ghosh, M. Martonosi, and S. Malik. Precise miss analysis for program transformations with caches of arbitrary associativity. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 228--239, San Jose, CA, Oct. 1998.
[10]
M. Kandemir, A. Choudary, J. Ramanujam, and P. Banerjee. Improving locality using loop and data transformations in an integrated framework. In Proceedings of the 31st International Symposium on Microarchitecture (MICRO-31), pages 285--296, Dallas, TX, Dec. 1998.
[11]
M. T. Kandemir, J. Ramanujam, and A. Choudary. A compiler algorithm for optimizing locality in loop nests. In International Conference on Supercomputing, pages 269--276, May 1997.
[12]
K. Kennedy. Fast greedy weighted fusion. In Proceedings of the 2000 ACM International Conference on Supercomputing, May 2000.
[13]
K. Kennedy and K. McKinley. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Languages and Compiler for Parallel Computing, pages 301--321, Portland, OR USA, Aug. 1993.
[14]
M. Kharbutli, K. Irwin, Y. Solohin, and J. Lee. Using prime numbers for cache indexing to eliminate conflict misses. In Tenth International Symposium on High-Performance Computer Architecture, pages 288--299. IEEE Computer Society, 2004.
[15]
M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63--74, Santa Clara, California, 1991.
[16]
K. McKinley and O. Temam. A quantitative analysis of loop nest locality. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 94--104, Cambridge, MA, Oct. 1996.
[17]
K. S. McKinley, S. Carr, and C.-W. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 18(4):424--453, 1996.
[18]
G. Rivera and C.-W. Tseng. Data transformations for eliminating conflict misses. In Proceedings of the ACM SIGPLAN '98 Conference on Programming Language Design and Implementation, pages 38--49, Montreal, Canada, June 1998.
[19]
G. Rivera and C.-W. Tseng. A comparison of compiler tiling algorithms. In Proceedings of the 8th International Conference on Compiler Construction, Amsterdam, The Netherlands, Mar. 1999.
[20]
V. Sarkar and G. Gao. Optimization of array accesses by collective loop transformations. In Proceedings of the 1991 ACM International Conference on Supercomputing, pages 194--205, June 1991.
[21]
Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation, pages 215--228, Atlanta, GA USA, May 1999.
[22]
N. Topham, A. González, and J. González. The design and performance of a conflict-avoiding cache. In Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 71--80. IEEE Computer Society, 1997.
[23]
X. Vera, J. Abella, A. González, and J. Llosa. Optimizing program locality through cmes and gas. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques, pages 68--78, New Orleans, LA, September 2003.
[24]
M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN '91 Conference on Programming Language Design and Implementation, pages 30--44, Toronto, Ontario, June 1991.
[25]
Q. Yang and L. W. Yang. A novel cache design for vector processing. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 362--371. ACM Press, 1992.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CF '05: Proceedings of the 2nd conference on Computing frontiers
May 2005
467 pages
ISBN:1595930191
DOI:10.1145/1062261
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 May 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cache design
  2. loop tiling

Qualifiers

  • Article

Conference

CF05
Sponsor:
CF05: Computing Frontiers Conference
May 4 - 6, 2005
Ischia, Italy

Acceptance Rates

Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 159
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media