Abstract
Over the past 20 years, increases in processor speed have dramatically outstripped performance increases for standard memory chips. To bridge this gap, compilers must optimize applications so that data fetched into caches are reused before being displaced. Existing compiler techniques can efficiently optimize simple loop structures such as sequences of perfectly nested loops. However, on more complicated structures, existing techniques are either ineffective or require too much computation time to be practical for a commercial compiler. To optimize complex loop structures both effectively and inexpensively, we present a novel loop transformation, dependence hoisting, for optimizing arbitrarily nested loops, and an efficient framework that applies the new technique to aggressively optimize benchmarks for better locality. Our technique is as inexpensive as the traditional unimodular loop transformation techniques and thus can be incorporated into commercial compilers. In addition, it is highly effective and is able to block several linear algebra kernels containing highly challenging loop structures, in particular, Cholesky, QR, LU factorization without pivoting, and LU with partial pivoting. The automatic blocking of QR and pivoting LU is a notable achievement—to our knowledge, few previous compiler techniques, including theoretically more general loop transformation frameworks [1, 21, 23, 27, 31], were able to completely automate the blocking of these kernels, and none has produced the same blocking as produced by our technique. These results indicate that with low compilation cost, our technique can in practice match the effectiveness of much more expensive frameworks that are theoretically more powerful.
Similar content being viewed by others
References
N. Ahmed, N. Mateev, and K. Pingali. Synthesizing transformations for locality enhancement of imperfectly nested loop nests. In Proceedings of the 2000 ACM International Conference on Supercomputing, Santa Fe, New Mexico, May 2000.
J. R. Allen and K. Kennedy. Automatic loop interchange. In Proceedings of the SIGPLAN' 84 Symposium on Compiler Construction, Montreal, June 1984.
R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures, Morgan Kaufmann, San Francisco, October 2001.
E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users' Guide, The Society for Industrial and Applied Mathematics, 1999.
U. Banerjee. Dependence Analysis for Supercomputing, Kluwer Academic Publishers, Boston, 1988.
S. Carr and K. Kennedy. Compiler blockability of numerical algorithms. In Proceedings of Supercomputing, Minneapolis, Nov. 1992.
S. Carr and K. Kennedy. Improving the ratio of memory operations to floating-point operations in loops. ACM Transactions on Programming Languages and Systems, 16(6):1768–1810, 1994.
S. Carr and R. Lehoucq. Compiler blockability of dense matrix factorizations. ACM Trans. Math. Softw., 23(3), 1997.
L. Carter, J. Ferrante, and S. F. Hummel. Hierarchical tiling for improved superscalar performance. In Proc. 9th International Parallel Processing Symposium, Santa Barbara, CA, Apr. 1995.
S. Coleman and K. S. McKinley. Tile size selection using cache organization. In Proceedings of the SIGPLAN. Conference on Programming Language Design and Implementation, La Jolla, CA, June 1995.
C. Ding. Improving Effective Bandwidth through Compiler Enhancement of Global and Dynamic Cache Reuse, PhD thesis, Rice University, 2000.
J. Dongarra, J. Bunch, C. Moler, and G. Stewart. LINPACK Users' Guide, Society for Industrial and Applied Mathematics, 1979.
J. J. Dongarra, F. G. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26(1):91–112, Jan. 1984.
D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformation. Journal of Parallel and Distributed Computing, 5(5):587–616, Oct. 1988.
G. H. Golub and C. F. V. Loan. Matrix Computations, 2nd. The Johns Hopkins University Press, 1989.
W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott. The Omega Library Interface Guide, Technical report, Dept. of Computer Science, Univ. of Maryland, College Park, Apr. 1996.
W. Kelly, W. Pugh, E. Rosser, and T. Shpeisman. Transitive closure of infinite graphs and its applications. International Journal of Parallel Programming, 24(6), Dec. 1996.
K. Kennedy. Fast greedy weighted fusion. In Proceedings of the International Conference on Supercomputing, Santa Fe, NM, May 2000.
K. Kennedy and K. McKinley. Optimizing for parallelism and data locality. In Proceedings of the ACM International Conference on Supercomputing, July 1992.
K. Kennedy and K. S. McKinley. Typed fusion with applications to parallel and sequential code generation. Technical Report TR93–208, Dept. of Computer Science, Rice University, Aug. 1993. (also available as CRPC-TR94370).
I. Kodukula, N. Ahmed, and K. Pingali. Data-centric multi-level blocking. In Proceedings of the SIGPLAN' 97 Conference on Programming Language Design and Implementation, Las Vegas, NV, June 1997.
M. Lam, E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV), Santa Clara, Apr. 1991.
A. W. Lim, G. I. Cheong, and M. S. Lam. An affine partitioning algorithm to maximize parallelism and minimize communication. In Proceedings of the 13th ACM SIGARCH International Conference on Supercomputing, Rhodes, Greece, June 1999.
N. Manjikian and T. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed Systems, 8(2):193–209, Feb. 1997.
K. S. McKinley, S. Carr, and C.-W. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 18(4):424–453, July 1996.
N. Mitchell, L. Carter, J. Ferrante, and K. Hgstedt. Quantifying the multi-level nature of tiling interactions. In 10th International Workshop on Languages and Compilers for Parallel Computing, August 1997.
W. Pugh. Uniform techniques for loop optimization. In Proceedings of the 1991 ACM International Conference on Supercomputing, Cologne, June 1991.
G. Rivera and C.-W. Tseng. Data transformations for eliminating conflict misses. In ACM SIGPLAN Conference on Programming Language Design and Implementation, Montreal, Canada, June 1998.
E. J. Rosser. Fine Grained Analysis Of Array Computations. PhD thesis, Dept. of Computer Science, University of Maryland, Sep. 1998.
R. Whaley and J. Dongarra. Automatically tuned linear algebra software(atlas). In Proceedings of Supercomputing' 89, 1989.
W. Pugh and E. Rosser. Iteration space slicing for locality. In LCPC 99, July 1999.
M. E. Wolf and M. Lam. A data locality optimizing algorithm. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation, Toronto, June 1991.
M. J. Wolfe Advanced loop interchanging. In Proceedings of the 1986 International Conference on Parallel Processing, St. Charles, IL, Aug. 1986.
M. J. Wolfe. More iteration space tiling. In Proceedings of Supercomputing' 89, Reno, Nov. 1989.
M. J. Wolfe. Optimizing Supercompilers for Supercomputers, The MIT Press, Cambridge, 1989.
Q. Yi, V. Adve, and K. Kennedy. Transforming loops to recursion for multi-level memory hierarchies. In ACM SIGPLAN Conference on Programming Language Design and Implementation, Vancouver, British Columbia, Canada, June 2000.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Yi, Q., Kennedy, K. & Adve, V. Transforming Complex Loop Nests for Locality. The Journal of Supercomputing 27, 219–264 (2004). https://doi.org/10.1023/B:SUPE.0000011386.69245.f5
Issue Date:
DOI: https://doi.org/10.1023/B:SUPE.0000011386.69245.f5