Abstract
The performance of irregular applications on modern computer systems is hurt by the wide gap between CPU and memory speeds because these applications typically under-utilize multi-level memory hierarchies, which help hide this gap. This paper investigates using data and computation reorderings to improve memory hierarchy utilization for irregular applications. We evaluate the impact of reordering on data reuse at different levels in the memory hierarchy. We focus on coordinated data and computation reordering based on space-filling curves and we introduce a new architecture-independent multi-level blocking strategy for irregular applications. For two particle codes we studied, the most effective reorderings reduced overall execution time by a factor of two and four, respectively. Preliminary experience with a scatter benchmark derived from a large unstructured mesh application showed that careful data and computation ordering reduced primary cache misses by a factor of two compared to a random ordering.
Similar content being viewed by others
REFERENCES
D. Callahan, S. Carr, and K. Kennedy, Improving Register Allocation for Subscripted Variables, Proc. ACM SIGPLAN Conf. Progr. Lang. Design Implementation, pp. 53-65 (June 1990).
D. Gannon, W. Jalby, and K. Gallivan, Strategies for Cache and Local Memory Management by Global Program Transformation, J. Parallel Distributed Computing, 5:587-616 (1988).
M. S. Lam, E. E. Rothberg, and M. E. Wolf, The Cache Performance and Optimizations of Blocked Algorithms, Proc. Fourth Int'l. Conf. Architectural Support Progr. Lang. Oper. Syst., pp. 63-74 (April 1991).
A. K. Porterfield, Software Methods for Improvement of Cache Performance on Super-computer Applications, Ph.D. Dissertation, Rice University, Houston, Texas (May 1989).
M. E. Wolf and M. S. Lam, A Data Locality Optimizing Algorithm, Proc. SIGPLAN Conf. Progr. Lang. Design and Implementation, pp. 30-44 (June 1991).
J. Ferrante, V. Sarkar, and W. Thrash, On Estimating and Enhancing Cache Effective-ness, Proc. Fourth Workshop on Lang. Compilers for Parallel Computing (August 1991).
D. M. Tullsen and S. J. Eggers, Effective Cache Prefetching on Bus-Based Multipro-cessors, ACM Trans. Computer Syst., 13(1):57-88 (February 1995).
T. C. Mowry, M. S. Lam, and A. Gupta, Design and Evaluation of a Compiler Algorithm for Prefetching, Proc. Fifth Int'l. Conf. Architectural Support Progr. Lang. Oper. Syst., pp. 62-73 (October 1992).
A. C. McKeller and E. G. Coffman, The Organization of Matrices and Matrix Operations in a Paged Multiprogramming Environment, Commun. ACM, 12(3):153-165 (1969).
W. Abu-Sufah, D. J. Kuck, and D. H. Lawrie, Automatic Program Transformations for Virtual Memory Computers, Proc. Nat'l. Computer Conf., pp. 969-974 (June 1979).
J. J. Navarro, E. Garcia, and J. R. Herrero, Proc. Tenth ACM Int'l. Conf. Supercomputing (ICS) (1996).
I. Kodukula, N. Ahmed, and K. Pingali, Data-Centric Multi-level Blocking, Proc. ACM SIGPLAN Conf. Progr. Lang. Design Implementation, pp. 346-357 (June 1997).
J. R. Allen and K. Kennedy, Automatic Loop Interchange, Proc. SIGPLAN Symp. Compiler Construction SIGPLAN Notices, 19(6):233-246 (June 1984).
K. S. McKinley, S. Carr, and C.-W. Tseng, Improving Data Locality with Loop Transformations, ACM Trans. Progr. Lang. Syst., 18(4):424-453 (July 1996).
C. Ding and K. Kennedy, Improving Cache Performance of Dynamic Applications with Computation and Data Layout Transformations, Proc. ACM SIGPLAN Conf. Progr. Lang. Design Implementation, pp. 229-241 (May 1999).
R. Das, D. Mavriplis, J. Saltz, S. Gupta, and R. Ponnusamy, The Design and Implemen-tation of a Parallel Unstructured Euler Solver Using Software Primitives, AIAA J., 32:489-496 (1994).
H. Sagan, Space-Filling Curves, Springer-Verlag, New York (1994).
H. Samet, Applications of Spatial Data Structures: Computer Graphics, Image Processing and GIS, Addison-Wesley, New York (1989).
J. P Singh, C. Holt, T. Totsuka, A. Gupta, and J. Hennessy, Load Balancing and Data Locality in Adaptive Hierarhcical N-body Methods: Barnes-Hut, Fast Multipole, and Radiosity, J. Parallel Distributed Computing (June 1995).
M. S. Warren and J. K. Salmon, A Parallel Hashed Oct-Tree N-Body Algorithm, Proc. Supercomputing (November 1993).
C. Ou, M. Gunwani, and S. Ranka, Architecture-Independent Locality-Improving Transformations of Computational Graphs Embedded in k-Dimensions, Proc. Int'l. Conf. Supercomputing (1995).
M. Parashar and J. C. Browne, On Partitioning Dynamic Adaptive Grid Hierarchies, Proc. Hawaii Conf. Syst. Sci. (January 1996).
M. Thottethodi, S. Chatterjee, and A. R. Lebeck, Tuning Strassen's Matrix Multiplication Algorithm for Memory Efficiency, Proc. SC98: High Performance Computing and Networking (November 1998).
J. Frens and D. Wise, Auto-blocking Matrix Multiplication or Tracking BLAS3 Performance from Source Code, Proc. ACM SIGPLAN Conf. Progr. Lang. Design Implementation, pp. 206-216 (June 1997).
I. Al-Furaih and S. Ranka, Memory Hierarchy Management for Iterative Graph Structures, Proc. Int'l. Parallel Processing Symp. (March 1998).
A. George and G. Liu, Computer Solution of Large Sparse Positive Definite Systems, Prentice Hall, Englewood Cliffs, New Jersey (1981).
E. Cuthill and J. McKee, Reducing the Bandwidth of Sparse Symmetric Matrices, Proc. ACM National Conf., Association of Computing Machinery (1969).
S. Sloan, An Algorithm for Profile and Wavefront Reduction of Sparse Matrices, Int'l. J. Numerical Methods Engng., 23:239-251 (1986).
N. Mitchell, L. Carter, and J. Ferrante, Localizing Nonaffine Array References, Proc. Parallel Architectures and Compilation Techniques (October 1999).
J. Mellor-Crummey, D. Whalley, and K. Kennedy, Improving Memory Hierarchy Performance for Irregular Applications, Proc. ACM Int'l. Conf. Supercomputing, pp. 425-433 (June 1999).
H. Prokop, Cache-Oblivious Algorithms, Master's thesis, MIT Department of Electrical Engineering and Computer Science (June 1999).
D. Knuth, The Art of Computer Programming Volume 3: Sorting and Searching, Addison-Wesley, New York (1973).
B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus, CHARMM: A Program for Macromolecular Energy, Minimization and Dynamics Calculations, J. Computational Chemistry, 4:187-217 (1983).
G. Karypis and V. Kumar, Parallel Multilevel k-way Partition Scheme for Irregular Graphs, SIAM Review, 41: 278-300 (1999).
R. Robey, Personal Communication (September 2000).
Y. C. Hu, A. Cox, and W. Zwaenepoel, Improving Fine-Grained Irregular Shared-Memory Benchmarks by Data Reordering, Proc. Supercomputing (November 2000).
V. Pai and S. Adve, Code Transformations to Improve Memory Parallelism, Proc. MICRO-32 (November 1999).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mellor-Crummey, J., Whalley, D. & Kennedy, K. Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings. International Journal of Parallel Programming 29, 217–247 (2001). https://doi.org/10.1023/A:1011119519789
Issue Date:
DOI: https://doi.org/10.1023/A:1011119519789