Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings

Mellor-Crummey, John; Whalley, David; Kennedy, Ken

doi:10.1023/A:1011119519789

Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings

Published: June 2001

Volume 29, pages 217–247, (2001)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

John Mellor-Crummey¹,
David Whalley² &
Ken Kennedy¹

430 Accesses
102 Citations
Explore all metrics

Abstract

The performance of irregular applications on modern computer systems is hurt by the wide gap between CPU and memory speeds because these applications typically under-utilize multi-level memory hierarchies, which help hide this gap. This paper investigates using data and computation reorderings to improve memory hierarchy utilization for irregular applications. We evaluate the impact of reordering on data reuse at different levels in the memory hierarchy. We focus on coordinated data and computation reordering based on space-filling curves and we introduce a new architecture-independent multi-level blocking strategy for irregular applications. For two particle codes we studied, the most effective reorderings reduced overall execution time by a factor of two and four, respectively. Preliminary experience with a scatter benchmark derived from a large unstructured mesh application showed that careful data and computation ordering reduced primary cache misses by a factor of two compared to a random ordering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

RADAR: Runtime Asymmetric Data-Access Driven Scientific Data Replication

Supporting Adaptive Privatization Techniques for Irregular Array Reductions in Task-Parallel Programming Models

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Article Open access 06 April 2024

REFERENCES

D. Callahan, S. Carr, and K. Kennedy, Improving Register Allocation for Subscripted Variables, Proc. ACM SIGPLAN Conf. Progr. Lang. Design Implementation, pp. 53-65 (June 1990).
D. Gannon, W. Jalby, and K. Gallivan, Strategies for Cache and Local Memory Management by Global Program Transformation, J. Parallel Distributed Computing, 5:587-616 (1988).
Google Scholar
M. S. Lam, E. E. Rothberg, and M. E. Wolf, The Cache Performance and Optimizations of Blocked Algorithms, Proc. Fourth Int'l. Conf. Architectural Support Progr. Lang. Oper. Syst., pp. 63-74 (April 1991).
A. K. Porterfield, Software Methods for Improvement of Cache Performance on Super-computer Applications, Ph.D. Dissertation, Rice University, Houston, Texas (May 1989).
Google Scholar
M. E. Wolf and M. S. Lam, A Data Locality Optimizing Algorithm, Proc. SIGPLAN Conf. Progr. Lang. Design and Implementation, pp. 30-44 (June 1991).
J. Ferrante, V. Sarkar, and W. Thrash, On Estimating and Enhancing Cache Effective-ness, Proc. Fourth Workshop on Lang. Compilers for Parallel Computing (August 1991).
D. M. Tullsen and S. J. Eggers, Effective Cache Prefetching on Bus-Based Multipro-cessors, ACM Trans. Computer Syst., 13(1):57-88 (February 1995).
Google Scholar
T. C. Mowry, M. S. Lam, and A. Gupta, Design and Evaluation of a Compiler Algorithm for Prefetching, Proc. Fifth Int'l. Conf. Architectural Support Progr. Lang. Oper. Syst., pp. 62-73 (October 1992).
A. C. McKeller and E. G. Coffman, The Organization of Matrices and Matrix Operations in a Paged Multiprogramming Environment, Commun. ACM, 12(3):153-165 (1969).
Google Scholar
W. Abu-Sufah, D. J. Kuck, and D. H. Lawrie, Automatic Program Transformations for Virtual Memory Computers, Proc. Nat'l. Computer Conf., pp. 969-974 (June 1979).
J. J. Navarro, E. Garcia, and J. R. Herrero, Proc. Tenth ACM Int'l. Conf. Supercomputing (ICS) (1996).
I. Kodukula, N. Ahmed, and K. Pingali, Data-Centric Multi-level Blocking, Proc. ACM SIGPLAN Conf. Progr. Lang. Design Implementation, pp. 346-357 (June 1997).
J. R. Allen and K. Kennedy, Automatic Loop Interchange, Proc. SIGPLAN Symp. Compiler Construction SIGPLAN Notices, 19(6):233-246 (June 1984).
Google Scholar
K. S. McKinley, S. Carr, and C.-W. Tseng, Improving Data Locality with Loop Transformations, ACM Trans. Progr. Lang. Syst., 18(4):424-453 (July 1996).
Google Scholar
C. Ding and K. Kennedy, Improving Cache Performance of Dynamic Applications with Computation and Data Layout Transformations, Proc. ACM SIGPLAN Conf. Progr. Lang. Design Implementation, pp. 229-241 (May 1999).
R. Das, D. Mavriplis, J. Saltz, S. Gupta, and R. Ponnusamy, The Design and Implemen-tation of a Parallel Unstructured Euler Solver Using Software Primitives, AIAA J., 32:489-496 (1994).
Google Scholar
H. Sagan, Space-Filling Curves, Springer-Verlag, New York (1994).
Google Scholar
H. Samet, Applications of Spatial Data Structures: Computer Graphics, Image Processing and GIS, Addison-Wesley, New York (1989).
Google Scholar
J. P Singh, C. Holt, T. Totsuka, A. Gupta, and J. Hennessy, Load Balancing and Data Locality in Adaptive Hierarhcical N-body Methods: Barnes-Hut, Fast Multipole, and Radiosity, J. Parallel Distributed Computing (June 1995).
M. S. Warren and J. K. Salmon, A Parallel Hashed Oct-Tree N-Body Algorithm, Proc. Supercomputing (November 1993).
C. Ou, M. Gunwani, and S. Ranka, Architecture-Independent Locality-Improving Transformations of Computational Graphs Embedded in k-Dimensions, Proc. Int'l. Conf. Supercomputing (1995).
M. Parashar and J. C. Browne, On Partitioning Dynamic Adaptive Grid Hierarchies, Proc. Hawaii Conf. Syst. Sci. (January 1996).
M. Thottethodi, S. Chatterjee, and A. R. Lebeck, Tuning Strassen's Matrix Multiplication Algorithm for Memory Efficiency, Proc. SC98: High Performance Computing and Networking (November 1998).
J. Frens and D. Wise, Auto-blocking Matrix Multiplication or Tracking BLAS3 Performance from Source Code, Proc. ACM SIGPLAN Conf. Progr. Lang. Design Implementation, pp. 206-216 (June 1997).
I. Al-Furaih and S. Ranka, Memory Hierarchy Management for Iterative Graph Structures, Proc. Int'l. Parallel Processing Symp. (March 1998).
A. George and G. Liu, Computer Solution of Large Sparse Positive Definite Systems, Prentice Hall, Englewood Cliffs, New Jersey (1981).
Google Scholar
E. Cuthill and J. McKee, Reducing the Bandwidth of Sparse Symmetric Matrices, Proc. ACM National Conf., Association of Computing Machinery (1969).
S. Sloan, An Algorithm for Profile and Wavefront Reduction of Sparse Matrices, Int'l. J. Numerical Methods Engng., 23:239-251 (1986).
Google Scholar
N. Mitchell, L. Carter, and J. Ferrante, Localizing Nonaffine Array References, Proc. Parallel Architectures and Compilation Techniques (October 1999).
J. Mellor-Crummey, D. Whalley, and K. Kennedy, Improving Memory Hierarchy Performance for Irregular Applications, Proc. ACM Int'l. Conf. Supercomputing, pp. 425-433 (June 1999).
H. Prokop, Cache-Oblivious Algorithms, Master's thesis, MIT Department of Electrical Engineering and Computer Science (June 1999).
D. Knuth, The Art of Computer Programming Volume 3: Sorting and Searching, Addison-Wesley, New York (1973).
Google Scholar
B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus, CHARMM: A Program for Macromolecular Energy, Minimization and Dynamics Calculations, J. Computational Chemistry, 4:187-217 (1983).
Google Scholar
G. Karypis and V. Kumar, Parallel Multilevel k-way Partition Scheme for Irregular Graphs, SIAM Review, 41: 278-300 (1999).
Google Scholar
R. Robey, Personal Communication (September 2000).
Y. C. Hu, A. Cox, and W. Zwaenepoel, Improving Fine-Grained Irregular Shared-Memory Benchmarks by Data Reordering, Proc. Supercomputing (November 2000).
V. Pai and S. Adve, Code Transformations to Improve Memory Parallelism, Proc. MICRO-32 (November 1999).

Download references

Author information

Authors and Affiliations

Department of Computer Science, MS 132, Rice University, 6100 Main, Houston, Texas, 77005
John Mellor-Crummey & Ken Kennedy
Computer Science Department, Florida State University, Tallahassee, Florida, 32306-4530
David Whalley

Authors

John Mellor-Crummey
View author publications
You can also search for this author in PubMed Google Scholar
David Whalley
View author publications
You can also search for this author in PubMed Google Scholar
Ken Kennedy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to John Mellor-Crummey.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mellor-Crummey, J., Whalley, D. & Kennedy, K. Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings. International Journal of Parallel Programming 29, 217–247 (2001). https://doi.org/10.1023/A:1011119519789

Download citation

Issue Date: June 2001
DOI: https://doi.org/10.1023/A:1011119519789

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings

Abstract

Access this article

Similar content being viewed by others

RADAR: Runtime Asymmetric Data-Access Driven Scientific Data Replication

Supporting Adaptive Privatization Techniques for Irregular Array Reductions in Task-Parallel Programming Models

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

REFERENCES

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Navigation

Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings

Abstract

Access this article

Similar content being viewed by others

RADAR: Runtime Asymmetric Data-Access Driven Scientific Data Replication

Supporting Adaptive Privatization Techniques for Irregular Array Reductions in Task-Parallel Programming Models

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

REFERENCES

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation