Abstract
Optimizations, including tiling, often target a single level of memory or parallelism, such as cache. These optimizations usually operate on a level-by-level basis, guided by a cost function parameterized by features of that single level. The benefit of optimizations guided by these one-level cost functions decreases as architectures tend towards a hierarchy of memory and of parallelism. We have identified three common architectural scenarios where a single tiling choice could be improved by using information from multiple levels in concert. For the first two scenarios, we derive multi-level cost functions which guide the optimal choice of tile size and shape, and quantify the improvement gained. We give both analysis and simulation results to support our points. For the third scenario, we summarize our findings.
This work supported in part by NSF CCR-9504150 and a UC MICRO grant in association with the Intel Corporation.
Preview
Unable to display preview. Download preview PDF.
References
A. Agarwal, D. Kranz, and V. Natarajan. Automatic partitioning of parallel loops and data arrays for distributed shared memory multiprocessors. In Int. Conf. on Parallel Computing, 1993.
C. Ancourt and F. Irigoin. Scanning polyhedra with DO loops. In PPoPP, Apr. 1991.
U. Banerjee. Unimodular transformations of double loops. In LCPC, Aug. 1990.
S. Carr. Combining optimizations for cache and instruction-level parallelism. In PACT, 1996.
S. Carr and K. Kennedy. Compiler blockability of numerical algorithms. J. of Supercomputing, Nov. 1992.
S. Carr and K. Kennedy. Improving the ratio of memory operations to floatingpoint operations in loops. TOPLAS, 16(6), Nov. 1994.
S. Carr, K. S. McKinley, and C. Tseng. Compiler optimizations for improving data locality. In ASPLOS, Oct. 1994.
L. Carter, J. Ferrante, and S. F. Hummel. Efficient parallelism via hierarchical tiling. In Parallel Processing for Scientific Computing, Feb. 1995.
L. Carter, J. Ferrante, and S. F. Hummel. Hierarchical tiling for improved superscalar perfomance. In IPPS, Apr. 1995.
L. Carter, J. Ferrante, S. F. Hummel, B. Alpern, and K. S. Gatlin. Hierarchical tiling: A methodology for high performance. Technical Report CS96-508, UCSD, Department of Computer Science and Engineering, Nov. 1996.
S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In PLDI, June 1995.
P. Feautrier. Some efficient solutions to the affine scheduling problem, Part I, one-dimensional time. Int. J. of Parallel Programming, 21(5), Oct. 1992.
J. Ferrante, V. Sarkar, and W. Thrash. On estimating and enhancing cache effectiveness. In LCPC, 1991.
D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformation. J. of Parallel and Distributed Computing, 5(5), Oct. 1988.
D. Gannon and K. Wang. Applying AI Techniques to Program Optimization for Parallel Computers, chapter 12. McGraw Hill Co., 1989.
K. Högstedt, L. Carter, and J. Ferrante. Calculating the idle time of a tiling. In POPL, 1997.
F. Irigoin and R. Violet. Supernode partitioning. In POPL, Jan. 1988.
W. Kelly and W. Pugh. A unifying framework for iteration reordering transformations. In Int. Conf. on Alg. and Arch. for Parallel Processing, Apr. 1995.
K. Kennedy and K. S. McKinley. Optimizing for parallelism and data locality. In Int. Conf. on Supercomputing, July 1992.
K. Kennedy and K. S. McKinley. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In LCPC, 1993.
M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In ASPLOS, Apr. 1991.
D. Lavery and W. Hwu. Unrolling-based optimizations for modulo scheduling. In MICRO-28, Dec. 1995.
D. A. Padua and M. J. Wolfe. Advanced compiler optimizations for supercomputers. Communications of the ACM, 29(12):1184–1201, Dec. 1986.
J. Ramanujam and P. Sadayappan. Tiling multidimensional iteration spaces for nonshared memory machines. In Supercomputing, Nov. 1991.
V. Sarkar, G. R. Gao, and S. Han. Locality analysis for distributed shared-memory multiprocessors. In LCPC, 1996.
V. Sarkar and R. Thekkath. A general framework for iteration-reordering loop transformations (Technical Summary). In PLDI, 1992.
M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In PLDI, 1991.
M. E. Wolf and M. S. Lam. A loop transformation theory and an algorithm to maximize parallelism. Trans. on Parallel and Distributed Systems, 2(4), 1991.
M. E. Wolf, D. Maydan, and D. Chen. Combining loop transformations considering caches and scheduling. In MICRO-29, Dec. 1996.
M. J. Wolfe. Iteration space tiling for memory hierarchies. In Parallel Processing for Scientific Computing, 1987.
M. J. Wolfe. More iteration space tiling. In Supercomputing, 1989.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mitchell, N., Carter, L., Ferrante, J., Högstedt, K. (1998). Quantifying the multi-level nature of tiling interactions. In: Li, Z., Yew, PC., Chatterjee, S., Huang, CH., Sadayappan, P., Sehr, D. (eds) Languages and Compilers for Parallel Computing. LCPC 1997. Lecture Notes in Computer Science, vol 1366. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0032680
Download citation
DOI: https://doi.org/10.1007/BFb0032680
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64472-9
Online ISBN: 978-3-540-69788-6
eBook Packages: Springer Book Archive