Abstract
Cache misses form a major bottleneck for memory-intensive applications, due to the significant latency of main memory accesses. Loop tiling, in conjunction with other program transformations, have been shown to be an effective approach to improving locality and cache exploitation, especially for dense matrix scientific computations. Beyond loop nest optimizations, data transformation techniques, and in particular blocked data layouts, have been used to boost the cache performance. The stability of performance improvements achieved are heavily dependent on the appropriate selection of tile sizes.
In this paper, we investigate the memory performance of blocked data layouts, and provide a theoretical analysis for the multiple levels of memory hierarchy, when they are organized in a set associative fashion. According to this analysis, the optimal tile size that maximizes L1 cache utilization, should completely fit in the L1 cache, even for loop bodies that access more than just one array. Increased self- or/and cross-interference misses can be tolerated through prefetching. Such larger tiles also reduce mispredicted branches and, as a result, the lost CPU cycles that arise. Results are validated through actual benchmarks on an SMT platform.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Athanasaki, E., Koziris, N.: Fast Indexing for Blocked Array Layouts to Improve Multi-Level Cache Locality. In: 8-th Work. on Interaction between Compilers and Computer Architectures, Madrid, Spain (February 2004); In conjuction with HPCA-10
Athanasaki, E., Koziris, N.: A Tile Size Selection Analysis for Blocked Array Layouts. In: 9-th Work. on Interaction between Compilers and Computer Architectures, San Francisco, CA (February 2005); In conjuction with HPCA-11
Chame, J., Moon, S.: A Tile Selection Algorithm for Data Locality and Cache Interference. In: Int. Conf. on Supercomputing, Rhodes, Greece (June 1999)
Coleman, S., McKinley, K.S.: Tile Size Selection Using Cache Organization and Data Layout. In: Conf. on Programming Language Design and Implementation, La Jolla, CA (June 1995)
Esseghir, K.: Improving Data Locality for Caches. Master’s thesis, Department of Computer Science, Rice University, Houston, TX (September 1993)
Ghosh, S., Martonosi, M., Malik, S.: Cache Miss Equations: A Compiler Framework for Analyzing and Tuning Memory Behavior. ACM Trans. on Programming Languages and Systems 21(4) (July 1999)
Harper, J.S., Kerbyson, D.J., Nudd, G.R.: Analytical Modeling of Set-Associative Cache Behavior. IEEE Trans. Computers 48(10) (October 1999)
Hsu, C.-H., Kremer, U.: A Quantitative Analysis of Tile Size Selection Algprithms. The J. of Supercomputing 27(3) (March 2004)
Kandemir, M., Ramanujam, J., Choudhary, A.: Improving Cache Locality by a Combinaion of Loop and Data Transformations. IEEE Trans. on Computers 48(2) (February 1999)
Lam, M.S., Rothberg, E.E., Wolf, M.E.: The Cache Performance and Optimizations of Blocked Algorithms. In: Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Santa Clara, CA (April 1991)
McKinley, K.S., Carr, S., Tseng, C.-W.: Improving Data Locality with Loop Transformations. ACM Trans. on Programming Languages and Systems 18(04) (July 1996)
Mitchell, N., Högstedt, K., Carter, L., Ferrante, J.: Quantifying the Multi-Level Nature of Tiling Interactions. Int. J. of Parallel Programming 26(6) (December 1998)
Panda, P.R., Nakamura, H., Dutt, N.D., Nicolau, A.: Augmenting Loop Tiling with Data Alignment for Improved Cache Performance. IEEE Trans. on Computers 48(2) (February 1999)
Park, N., Hong, B., Prasanna, V.: Analysis of Memory Hierarchy Performance of Block Data Layout. In: Int. Conf. on Parallel Processing, Vancouver, Canada (August 2002)
Patterson, D., Hennessy, J.: Computer Architecture. A Quantitative Approach, 3rd edn., San Francisco, CA (2002)
Rivera, G., Tseng, C.-W.: Eliminating Conflict Misses for High Performance Architectures. In: Int. Conf. on Supercomputing, Melbourne, Australia (July 1998)
Rivera, G., Tseng, C.-W.: A Comparison of Compiler Tiling Algorithms. In: Int. Conf. on Compiler Construction, Amsterdam, The Netherlands (March 1999)
Rivera, G., Tseng, C.-W.: Locality Optimizations for Multi-Level Caches. In: Int. Conf. on Supercomputing, Portland, OR (November 1999)
Song, Y., Li, Z.: Impact of Tile-Size Selection for Skewed Tiling. In: 5th Work. on Interaction between Compilers and Architectures, Monterrey, Mexico (Janaury 2001)
Temam, O., Fricker, C., Jalby, W.: Cache Interference Phenomena. In: Conf. on Measurement and Modeling of Computer Systems, Nashville, TN (May 1994)
Temam, O., Granston, E.D., Jalby, W.: To Copy or Not to Copy: A Compile-Time Technique for Assessing When Data Copying Should be Used to Eliminate Cache Conflicts. In: Conf. on Supercomputing, Portland, OR (November 1993)
Vera, X.: Cache and Compiler Interaction (how to analyze, optimize and time cache behaviour). PhD thesis, Malardalen University (Janaury 2003)
Wolf, M.E., Lam, M.S.: A Data Locality Optimizing Algorithm. In: Conf. on Programming Language Design and Implementation, Toronto, Canada (June 1991)
Wolf, M.E., Maydan, D.E., Chen, D.-K.: Combining Loop Transformations Considering Caches and Scheduling. In: Int. Symposium on Microarchitecture, Paris, France (December 1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Athanasaki, E., Kourtis, K., Anastopoulos, N., Koziris, N. (2005). Tuning Blocked Array Layouts to Exploit Memory Hierarchy in SMT Architectures. In: Bozanis, P., Houstis, E.N. (eds) Advances in Informatics. PCI 2005. Lecture Notes in Computer Science, vol 3746. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11573036_57
Download citation
DOI: https://doi.org/10.1007/11573036_57
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29673-7
Online ISBN: 978-3-540-32091-3
eBook Packages: Computer ScienceComputer Science (R0)