Skip to main content
Log in

Data-Centric Transformations for Locality Enhancement

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

On modern computers, the performance of programs is often limited by memory latency rather than by processor cycle time. To reduce the impact of memory latency, the restructuring compiler community has developed locality-enhancing program transformations such as loop permutation and tiling. These transformations work well for perfectly nested loops (loops in which all assignment statements are contained in the innermost loop), but their performance on codes such as matrix factorizations that contain imperfectly nested loops leaves much to be desired. In this paper, we propose an alternative approach called data-centric transformation. Instead of reasoning directly about the control structure of the program, a compiler using the data-centric approach chooses an order for the arrival of data elements in the cache, determines what computations should be performed when that data arrives, and generates the appropriate code. At runtime, program execution will automatically pull data into the cache in an order that corresponds approximately to the order chosen by the compiler; since statements that touch a data structure element are scheduled close together, locality is improved. The idea of data-centric transformation is very general, and in this paper, we discuss a particular transformation called data-shackling. We have implemented shackling in the SGI MIPSPro compiler which already has a sophisticated implementation of control-centric transformations for locality enhancement. We present experimental results on the SGI Octane comparing the performance of the two approaches, and show that for dense numerical linear algebra codes, data-shackling does better by factors of two to five.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

REFERENCES

  1. G. Golub and C. Van Loan, Matrix Computations, Johns Hopkins University Press (1996).

  2. E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen, (eds.), LAPACK Users' Guide, Second Edition, SIAM, Philadelphia (1995).

  3. U. Banerjee, Unimodular Transformations of Double Loops, Proc. Workshop Adv. Lang. Compilers for Parallel Processing, pp. 192-219 (August 1990).

  4. P. Boulet, A. Darte, T. Risset, and Y. Robert, (Pen)-Ultimate Tiling? INTEGRATION, VLSI J., 17:33-51 (1994).

    Google Scholar 

  5. S. Carr and R. B. Lehoucq, Compiler Blockability of Dense Matrix Factorizations. Technical Report, Argonne National Laboratory (October 1996).

  6. L. Carter, J. Ferrante, and S. Flynn Hummel, Hierarchical Tiling for Improved Super-scalar Performance, Int'l. Parallel Processing Symp. (April 1995).

  7. J. Dongarra and R. Schreiber, Automatic Blocking of Nested Loops. Technical Report UT-CS-90-108, Department of Computer Science, University of Tennessee (May 1990).

  8. F. Irigoin and R. Triolet, Supernode Partitioning, ACM Symp. Principles of Progr. Lang., pp. 319-329 (January 1988).

  9. W. Li and K. Pingali, Access Normalization: Loop Restructuring for NUMA Compilers, ACM Trans. Computer Syst. (1993).

  10. J. Ramanujam and P. Sadayappan, Tiling Multidimensional Iteration Spaces for Multicomputers, J. Parallel and Distributed Computing, 16(2):108-120 (October 1992).

    Google Scholar 

  11. V. Sarkar, Automatic Selection of High-Order Transformations in the IBM ASTI Optimizer. Technical Report ADTI-96-004, Application Development Technology Institute, IBM Software Solutions Division (July 1996). Submitted to special issue of IBM Journal of Research and Development.

  12. M. E. Wolf and M. S. Lam, A Data Locality Optimizing Algorithm, SIGPLAN Conf. Progr. Lang. Design Implementation (June 1991).

  13. M. Wolfe, Iteration Space Tiling for Memory Hierarchies, Third SIAM Conf. Parallel Processing for Scientific Computing (December 1987).

  14. M. E. Wolf, D. E. Maydan, and D.-K. Chen, Combining Loop Transformations Considering Caches and Scheduling, MICRO 29, Silicon Graphics, Mountain View, California, pp. 274-286 (1996).

  15. R. C. Agarwal and F. G. Gustavson, Algorithm and Architecture Aspects of Producing ESSL BLAS on POWER2.

  16. M. Wolfe, High Performance Compilers for Parallel Computing, Addison-Wesley Publishing Company (1995).

  17. U. Banerjee, Unimodular Transformations of Double Loops, Lang. and Compilers for Parallel Computing, pp. 192-219 (1990).

  18. M. S. Lam, E. E. Rothberg, and M. E. Wolf, The Cache Performance and Optimizations of Blocked Algorithms, Proc. Fourth Int'l. Conf. Architectural Support Progr. Lang. Oper. Syst., pp. 63-74, Santa Clara, California, ACM SIGARCH, SIGPLAN, SIGOPS, and the IEEE Computer Society (April 1991).

  19. W. Li and K. Pingali, A Singular Loop Transformation Based on Nonsingular Matrices, IJPP, 22(2) (April 1994).

  20. S. Coleman and K. S. McKinley, Tile Size Selection Using Cache Organization and Data Layout, ACM SIGPLAN Conf. Progr. Lang. Design and Implementation (PLDI), ACM Press (June 1995).

  21. S. Carr and K. Kennedy, Compiler Blockability of Numerical Algorithms, Supercomputing (1992).

  22. S. Carr and R. B. Lehoucq, A Compiler-Blockable Algorithm for QR Decomposition (1994).

  23. R. Schreiber and J. Ramanujam, Personal communication (September 1997).

  24. K. S. McKinley, S. Carr, and C.-W. Tseng, Improving Data Locality with Loop Transformations, ACM Trans. Progr. Lang. Syst., 18(4):424-453 (July 1996).

    Google Scholar 

  25. Y. Song and Z. Li, New Tiling Techniques to Improve Cache Locality, SIGPLAN99 Conf. Progr. Lang. Design Implementation (June 1999).

  26. A. Rogers and K. Pingali, Process Decomposition Through Locality of Reference, SIGPLAN Conf. Progr. Lang. Design and Implementation (June 1989).

  27. J. Anderson, S. Amarsinghe, and M. Lam, Data and Computation Transformations for Multiprocessors, ACM Symp. Principles and Practice of Parallel Programming (June 1995).

  28. M. Cierniak and W. Li, Unifying Data and Control Transformations for Distributed Shared Memory Machines, SIGPLAN Conf. Progr. Lang. Design and Implementation (June 1995).

  29. W. Pugh, Counting Solutions to Presburger Formulas: How and Why. Technical report, University of Maryland (1993).

  30. P. Clauss, Counting Solutions to Linear and Nonlinear Constraints through Ehrhart Polynomials: Applications to Analyze and Transform Scientific Programs, ACM Int'l. Conf. Supercomputing (May 1996).

  31. I. Kodukula, Data-centric Compilation, Ph.D. thesis, Cornell University (1998).

  32. N. Mateev, V. Menon, and K. Pingali, Fractal Symbolic Analysis for Program Transformations. To appear as a Cornell CS Technical Report.

  33. W. Pugh and E. Rosser, Iteration Space Slicing for Locality,Proc. 12th Int'l. Workshop Lang. Compilers for Parallel Computing, (LCPC99) (August 1999).

  34. M. Weiser, Program Slicing, IEEE Trans. Software Engineering, 10(4):352-357 (1984).

    Google Scholar 

  35. V. Kotlyar, K. Pingali, and P. Stodghill, A Relational Approach to the Compilation of Sparse Matrix Programs, in EUROPAR (1997).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Keshav Pingali.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kodukula, I., Pingali, K. Data-Centric Transformations for Locality Enhancement. International Journal of Parallel Programming 29, 319–364 (2001). https://doi.org/10.1023/A:1011172104768

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1011172104768

Navigation