Transforming Complex Loop Nests for Locality

Yi, Qing; Kennedy, Ken; Adve, Vikram

doi:10.1023/B:SUPE.0000011386.69245.f5

Transforming Complex Loop Nests for Locality

Published: March 2004

Volume 27, pages 219–264, (2004)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Qing Yi¹,
Ken Kennedy¹ &
Vikram Adve²

92 Accesses
14 Citations
Explore all metrics

Abstract

Over the past 20 years, increases in processor speed have dramatically outstripped performance increases for standard memory chips. To bridge this gap, compilers must optimize applications so that data fetched into caches are reused before being displaced. Existing compiler techniques can efficiently optimize simple loop structures such as sequences of perfectly nested loops. However, on more complicated structures, existing techniques are either ineffective or require too much computation time to be practical for a commercial compiler. To optimize complex loop structures both effectively and inexpensively, we present a novel loop transformation, dependence hoisting, for optimizing arbitrarily nested loops, and an efficient framework that applies the new technique to aggressively optimize benchmarks for better locality. Our technique is as inexpensive as the traditional unimodular loop transformation techniques and thus can be incorporated into commercial compilers. In addition, it is highly effective and is able to block several linear algebra kernels containing highly challenging loop structures, in particular, Cholesky, QR, LU factorization without pivoting, and LU with partial pivoting. The automatic blocking of QR and pivoting LU is a notable achievement—to our knowledge, few previous compiler techniques, including theoretically more general loop transformation frameworks [1, 21, 23, 27, 31], were able to completely automate the blocking of these kernels, and none has produced the same blocking as produced by our technique. These results indicate that with low compilation cost, our technique can in practice match the effectiveness of much more expensive frameworks that are theoretically more powerful.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

N. Ahmed, N. Mateev, and K. Pingali. Synthesizing transformations for locality enhancement of imperfectly nested loop nests. In Proceedings of the 2000 ACM International Conference on Supercomputing, Santa Fe, New Mexico, May 2000.
J. R. Allen and K. Kennedy. Automatic loop interchange. In Proceedings of the SIGPLAN' 84 Symposium on Compiler Construction, Montreal, June 1984.
R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures, Morgan Kaufmann, San Francisco, October 2001.
Google Scholar
E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users' Guide, The Society for Industrial and Applied Mathematics, 1999.
U. Banerjee. Dependence Analysis for Supercomputing, Kluwer Academic Publishers, Boston, 1988.
Google Scholar
S. Carr and K. Kennedy. Compiler blockability of numerical algorithms. In Proceedings of Supercomputing, Minneapolis, Nov. 1992.
S. Carr and K. Kennedy. Improving the ratio of memory operations to floating-point operations in loops. ACM Transactions on Programming Languages and Systems, 16(6):1768–1810, 1994.
Google Scholar
S. Carr and R. Lehoucq. Compiler blockability of dense matrix factorizations. ACM Trans. Math. Softw., 23(3), 1997.
L. Carter, J. Ferrante, and S. F. Hummel. Hierarchical tiling for improved superscalar performance. In Proc. 9th International Parallel Processing Symposium, Santa Barbara, CA, Apr. 1995.
S. Coleman and K. S. McKinley. Tile size selection using cache organization. In Proceedings of the SIGPLAN. Conference on Programming Language Design and Implementation, La Jolla, CA, June 1995.
C. Ding. Improving Effective Bandwidth through Compiler Enhancement of Global and Dynamic Cache Reuse, PhD thesis, Rice University, 2000.
J. Dongarra, J. Bunch, C. Moler, and G. Stewart. LINPACK Users' Guide, Society for Industrial and Applied Mathematics, 1979.
J. J. Dongarra, F. G. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26(1):91–112, Jan. 1984.
Google Scholar
D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformation. Journal of Parallel and Distributed Computing, 5(5):587–616, Oct. 1988.
Google Scholar
G. H. Golub and C. F. V. Loan. Matrix Computations, 2nd. The Johns Hopkins University Press, 1989.
W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott. The Omega Library Interface Guide, Technical report, Dept. of Computer Science, Univ. of Maryland, College Park, Apr. 1996.
Google Scholar
W. Kelly, W. Pugh, E. Rosser, and T. Shpeisman. Transitive closure of infinite graphs and its applications. International Journal of Parallel Programming, 24(6), Dec. 1996.
K. Kennedy. Fast greedy weighted fusion. In Proceedings of the International Conference on Supercomputing, Santa Fe, NM, May 2000.
K. Kennedy and K. McKinley. Optimizing for parallelism and data locality. In Proceedings of the ACM International Conference on Supercomputing, July 1992.
K. Kennedy and K. S. McKinley. Typed fusion with applications to parallel and sequential code generation. Technical Report TR93–208, Dept. of Computer Science, Rice University, Aug. 1993. (also available as CRPC-TR94370).
I. Kodukula, N. Ahmed, and K. Pingali. Data-centric multi-level blocking. In Proceedings of the SIGPLAN' 97 Conference on Programming Language Design and Implementation, Las Vegas, NV, June 1997.
M. Lam, E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV), Santa Clara, Apr. 1991.
A. W. Lim, G. I. Cheong, and M. S. Lam. An affine partitioning algorithm to maximize parallelism and minimize communication. In Proceedings of the 13th ACM SIGARCH International Conference on Supercomputing, Rhodes, Greece, June 1999.
N. Manjikian and T. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed Systems, 8(2):193–209, Feb. 1997.
Google Scholar
K. S. McKinley, S. Carr, and C.-W. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 18(4):424–453, July 1996.
Google Scholar
N. Mitchell, L. Carter, J. Ferrante, and K. Hgstedt. Quantifying the multi-level nature of tiling interactions. In 10th International Workshop on Languages and Compilers for Parallel Computing, August 1997.
W. Pugh. Uniform techniques for loop optimization. In Proceedings of the 1991 ACM International Conference on Supercomputing, Cologne, June 1991.
G. Rivera and C.-W. Tseng. Data transformations for eliminating conflict misses. In ACM SIGPLAN Conference on Programming Language Design and Implementation, Montreal, Canada, June 1998.
E. J. Rosser. Fine Grained Analysis Of Array Computations. PhD thesis, Dept. of Computer Science, University of Maryland, Sep. 1998.
R. Whaley and J. Dongarra. Automatically tuned linear algebra software(atlas). In Proceedings of Supercomputing' 89, 1989.
W. Pugh and E. Rosser. Iteration space slicing for locality. In LCPC 99, July 1999.
M. E. Wolf and M. Lam. A data locality optimizing algorithm. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation, Toronto, June 1991.
M. J. Wolfe Advanced loop interchanging. In Proceedings of the 1986 International Conference on Parallel Processing, St. Charles, IL, Aug. 1986.
M. J. Wolfe. More iteration space tiling. In Proceedings of Supercomputing' 89, Reno, Nov. 1989.
M. J. Wolfe. Optimizing Supercompilers for Supercomputers, The MIT Press, Cambridge, 1989.
Google Scholar
Q. Yi, V. Adve, and K. Kennedy. Transforming loops to recursion for multi-level memory hierarchies. In ACM SIGPLAN Conference on Programming Language Design and Implementation, Vancouver, British Columbia, Canada, June 2000.

Download references

Author information

Authors and Affiliations

Rice University, 6100 Main Street MS-132, Houston, TX, 77005
Qing Yi & Ken Kennedy
University of Illinois at Urbana-Champaign, 1304 W. Springfield Ave, Urbana, IL, 61801
Vikram Adve

Authors

Qing Yi
View author publications
You can also search for this author in PubMed Google Scholar
Ken Kennedy
View author publications
You can also search for this author in PubMed Google Scholar
Vikram Adve
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yi, Q., Kennedy, K. & Adve, V. Transforming Complex Loop Nests for Locality. The Journal of Supercomputing 27, 219–264 (2004). https://doi.org/10.1023/B:SUPE.0000011386.69245.f5

Download citation

Issue Date: March 2004
DOI: https://doi.org/10.1023/B:SUPE.0000011386.69245.f5

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Transforming Complex Loop Nests for Locality

Abstract

Access this article

Similar content being viewed by others

An Effective Framework of Program Optimization for High Performance Computing

Automated Compiler Optimization of Multiple Vector Loads/Stores

Enhancing the Effectiveness of Inlining in Automatic Parallelization

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Transforming Complex Loop Nests for Locality

Abstract

Access this article

Similar content being viewed by others

An Effective Framework of Program Optimization for High Performance Computing

Automated Compiler Optimization of Multiple Vector Loads/Stores

Enhancing the Effectiveness of Inlining in Automatic Parallelization

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation