Article

A case for a working-set-based memory hierarchy

Authors:

Soner ÖnderAuthors Info & Claims

CF '05: Proceedings of the 2nd conference on Computing frontiers

Pages 252 - 261

https://doi.org/10.1145/1062261.1062304

Published: 04 May 2005 Publication History

Abstract

Modern microprocessor designs continue to obtain impressive performance gains through increasing clock rates and advances in the parallelism obtained via micro-architecture design. Unfortunately, corresponding improvements in memory design technology have not been realized, resulting in latencies of over 100 cycles between processors and main memory. This ever-increasing gap in speed has pushed the current memory-hierarchy approach to its limit.Traditional approaches to memory-hierarchy management have not yielded satisfactory results. Hardware solutions require more power and energy than desired and do not scale well. Compiler solutions tend to miss too many optimization opportunities because of limited compile-time knowledge of run-time behavior. This paper explores a different approach that combines both approaches by making use of the static knowledge obtained by the compiler in the dynamic decision making of the micro-architecture. We propose a memory-hierarchy design based on working sets that uses compile-time annotations regarding the working set of memory operations to guide cache placement decisionsOur experiments show that a working-set-based memory hierarchy can significantly reduce the miss rate for memory-intensive tiled kernels by limiting cross interference. The working-set-based memory hierarchy allows the compiler to tile many loops without concern for cross interference in the cache, making tile size choice easier. In addition, the compiler can more easily tailor tile choices to the separate needs of different working sets.

References

[1]

J. Allen and K. Kennedy. Vector register allocation. IEEE Transactions on Computers, 41(10):1290--1317, Oct. 1992.

Digital Library

[2]

F. Bodin and A. Seznec. Skewed associativity enhances performance predictability. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 265--274. ACM Press, 1995.

Digital Library

[3]

D. Callahan, S. Carr, and K. Kennedy. Improving register allocation for subscripted variables. In Proceedings of the ACM SIGPLAN '90 Conference on Programming Language Design and Implementation, pages 53--65, White Plains, NY, June 1990.

Digital Library

[4]

S. Carr and R. Lehoucq. Compiler blockability of dense matrix factorizations. ACM Transactions on Mathematical Software, 23(3):336--361, Sept. 1997.

Digital Library

[5]

S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In Proceedings of the ACM SIGPLAN '95 Conference on Programming Language Design and Implementation, pages 279--280, La Jolla, CA, June 1995.

Digital Library

[6]

D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformations. In Proceedings of the First International Conference on Supercomputing. Springer-Verlag, Athens, Greece, 1987.

Digital Library

[7]

G. Gao, R. Olsen, V. Sarkar, and R. Thekkath. Collective loop fusion for array contraction. In Proceedings of the Fifth Workshop on Languages and Compilers for Parallel Computing, New Haven, CT USA, Aug. 1992.

Digital Library

[8]

S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: An analytical representation of cache misses. In Proceedings of the 1997 ACM International Conference on Supercomputing, pages 317--324, Vienna, Austria, July 1997.

Digital Library

[9]

S. Ghosh, M. Martonosi, and S. Malik. Precise miss analysis for program transformations with caches of arbitrary associativity. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 228--239, San Jose, CA, Oct. 1998.

Digital Library

[10]

M. Kandemir, A. Choudary, J. Ramanujam, and P. Banerjee. Improving locality using loop and data transformations in an integrated framework. In Proceedings of the 31st International Symposium on Microarchitecture (MICRO-31), pages 285--296, Dallas, TX, Dec. 1998.

Digital Library

[11]

M. T. Kandemir, J. Ramanujam, and A. Choudary. A compiler algorithm for optimizing locality in loop nests. In International Conference on Supercomputing, pages 269--276, May 1997.

Digital Library

[12]

K. Kennedy. Fast greedy weighted fusion. In Proceedings of the 2000 ACM International Conference on Supercomputing, May 2000.

Digital Library

[13]

K. Kennedy and K. McKinley. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Languages and Compiler for Parallel Computing, pages 301--321, Portland, OR USA, Aug. 1993.

Digital Library

[14]

M. Kharbutli, K. Irwin, Y. Solohin, and J. Lee. Using prime numbers for cache indexing to eliminate conflict misses. In Tenth International Symposium on High-Performance Computer Architecture, pages 288--299. IEEE Computer Society, 2004.

Digital Library

[15]

M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63--74, Santa Clara, California, 1991.

Digital Library

[16]

K. McKinley and O. Temam. A quantitative analysis of loop nest locality. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 94--104, Cambridge, MA, Oct. 1996.

Digital Library

[17]

K. S. McKinley, S. Carr, and C.-W. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 18(4):424--453, 1996.

Digital Library

[18]

G. Rivera and C.-W. Tseng. Data transformations for eliminating conflict misses. In Proceedings of the ACM SIGPLAN '98 Conference on Programming Language Design and Implementation, pages 38--49, Montreal, Canada, June 1998.

Digital Library

[19]

G. Rivera and C.-W. Tseng. A comparison of compiler tiling algorithms. In Proceedings of the 8th International Conference on Compiler Construction, Amsterdam, The Netherlands, Mar. 1999.

Digital Library

[20]

V. Sarkar and G. Gao. Optimization of array accesses by collective loop transformations. In Proceedings of the 1991 ACM International Conference on Supercomputing, pages 194--205, June 1991.

Digital Library

[21]

Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation, pages 215--228, Atlanta, GA USA, May 1999.

Digital Library

[22]

N. Topham, A. González, and J. González. The design and performance of a conflict-avoiding cache. In Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 71--80. IEEE Computer Society, 1997.

Digital Library

[23]

X. Vera, J. Abella, A. González, and J. Llosa. Optimizing program locality through cmes and gas. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques, pages 68--78, New Orleans, LA, September 2003.

Digital Library

[24]

M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN '91 Conference on Programming Language Design and Implementation, pages 30--44, Toronto, Ontario, June 1991.

Digital Library

[25]

Q. Yang and L. W. Yang. A novel cache design for vector processing. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 362--371. ACM Press, 1992.

Digital Library

Index Terms

A case for a working-set-based memory hierarchy
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Revisiting level-0 caches in embedded processors
CASES '12: Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems

Level-0 (L0) caches have been proposed in the past as an inexpensive way to improve performance and reduce energy consumption in resource-constrained embedded processors. This paper proposes new L0 data cache organizations using the assumption that an ...
Yet Another Compressed Cache: A Low-Cost Yet Effective Compressed Cache

Cache memories play a critical role in bridging the latency, bandwidth, and energy gaps between cores and off-chip memory. However, caches frequently consume a significant fraction of a multicore chip's area and thus account for a significant fraction ...
Way adaptable D-NUCA caches

Non-uniform cache architecture (NUCA) aims to limit the wire-delay problem typical of large on-chip last level caches: by partitioning a large cache into several banks, with the latency of each one depending on its physical location and by employing a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '05: Proceedings of the 2nd conference on Computing frontiers

May 2005

467 pages

ISBN:1595930191

DOI:10.1145/1062261

General Chair:
Nader Bagherzadeh,
Program Chairs:
Mateo Valero,
Alex Ramirez

Copyright © 2005 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 May 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

CF05

Sponsor:

CF05: Computing Frontiers Conference

May 4 - 6, 2005

Ischia, Italy

Acceptance Rates

Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Sponsor:
sigmicro

22nd ACM International Conference on Computing Frontiers

May 28 - 30, 2025

Cagliari , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
159
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten