Article

Free access

Automatic benchmark generation for cache optimization of matrix operations

Authors:

John McCalpin,

Mark SmothermanAuthors Info & Claims

ACMSE '95: Proceedings of the 33rd annual ACM Southeast Conference

Pages 195 - 204

https://doi.org/10.1145/1122018.1122054

Published: 17 March 1995 Publication History

PDF eReader

Abstract

Computationally intensive algorithms must usually be restructured to make the best use of cache memory in current high-performance, hierarchical memory computers. Unfortunately, cache conscious algorithms are sensitive to object sizes and addresses as well as the details of the cache and translation lookaside buffer geometries, and this sensitivity makes both automatic restructuring and hand-turning difficult tasks. An optimization approach is presented in this paper that automatically generates and executes a benchmark program from a concise specification of the algorithm's structure. This technique provides the performance data needed for verification of code generation heuristics or search among the various restructuring options. Matrix transpose and matrix multiplication are examined using this approach for several workstations with restructuring options of loop order, tiling (blocking), and unrolling.

References

[1]

D.H. Bailey, "RISC Microprocessors and Scientific Computing," Proc. Supercomputing '93, Portland, November 1993, pp. 645--654.

Digital Library

Google Scholar

[2]

R. Bell, IBM RISC System/6000 Performance Tuning for Numerically Intensive Fortran and C Programs, IBM ITSC Technical Bulletin GG24-3611, October 1990.

Google Scholar

[3]

M. Bromley, S. Heller, T. McNerney, and G. L. Steele, Jr., "Fortran at Ten Gigaflops: The Connection Machine Convolution Compiler," Proc. SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Toronto, June 1991, pp. 145--156.

Digital Library

Google Scholar

[4]

M. S. Lam, E. E. Rothberg, and M. E. Wolf, "The Cache Performance and Optimizations of Blocked Algorithms," Proc. 4th Intl. Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV), Santa Clara, April 1991, pp. 63--74.

Digital Library

Google Scholar

[5]

O. Temam, E. D. Granston, and W. Jalby, "To Copy or Not to Copy: A Compile-Time Technique for Assessing When Data Copying Should be Used to Eliminate Cache Conflicts," Proc. Supercomputing '93, Portland, November 1993, pp. 410--419.

Digital Library

Google Scholar

[6]

M. E. Wolf and M. S. Lam, "A Data Locality Optimizing Algorithm," Proc. SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Toronto, June 1991, pp. 30--44.

Digital Library

Google Scholar

Cited By

View all

Schiavio FRosà ABinder WScholz BKameyama Y(2022)SQL to Stream with S2S: An Automatic Benchmark Generator for the Java Stream APIProceedings of the 21st ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences10.1145/3564719.3568699(179-186)Online publication date: 29-Nov-2022
https://dl.acm.org/doi/10.1145/3564719.3568699
Springer PMatthews DBientinesi P(2019)Spin SummationsACM Transactions on Mathematical Software10.1145/330131945:1(1-22)Online publication date: 14-Mar-2019
https://dl.acm.org/doi/10.1145/3301319
Springer PHammond JBientinesi P(2017)TTCACM Transactions on Mathematical Software10.1145/310498844:2(1-21)Online publication date: 16-Aug-2017
https://dl.acm.org/doi/10.1145/3104988
Show More Cited By

Recommendations

Cache Operations by MRU Change

The performance of set associative caches is analyzed. The method used is to group the cache lines into regions according to their positions in the replacement stacks of a cache, and then to observe how the memory access of a CPU is distributed over ...
Location cache: a low-power L2 cache system
ISLPED '04: Proceedings of the 2004 international symposium on Low power electronics and design

While set-associative caches incur fewer misses than direct-mapped caches, they typically have slower hit times and higher power consumption, when multiple tag and data banks are probed in parallel. This paper presents the location cache structure which ...
Buffering databse operations for enhanced instruction cache performance
SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data

As more and more query processing work can be done in main memory access is becoming a significant cost component of database operations. Recent database research has shown that most of the memory stalls are due to second-level cache data misses and ...

Comments

Information & Contributors

Information

Published In

ACMSE '95: Proceedings of the 33rd annual ACM Southeast Conference

March 1995

300 pages

ISBN:0897917472

DOI:10.1145/1122018

Program Chair:
Robert Geist

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 March 1995

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

ACMSE '95

March 17 - 18, 1995

South Carolina, Clemson

Acceptance Rates

ACMSE '95 Paper Acceptance Rate 47 of 75 submissions, 63%;

Overall Acceptance Rate 502 of 1,023 submissions, 49%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
273
Total Downloads

Downloads (Last 12 months)29
Downloads (Last 6 weeks)3

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Schiavio FRosà ABinder WScholz BKameyama Y(2022)SQL to Stream with S2S: An Automatic Benchmark Generator for the Java Stream APIProceedings of the 21st ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences10.1145/3564719.3568699(179-186)Online publication date: 29-Nov-2022
https://dl.acm.org/doi/10.1145/3564719.3568699
Springer PMatthews DBientinesi P(2019)Spin SummationsACM Transactions on Mathematical Software10.1145/330131945:1(1-22)Online publication date: 14-Mar-2019
https://dl.acm.org/doi/10.1145/3301319
Springer PHammond JBientinesi P(2017)TTCACM Transactions on Mathematical Software10.1145/310498844:2(1-21)Online publication date: 16-Aug-2017
https://dl.acm.org/doi/10.1145/3104988
Springer PSu TBientinesi PElsman MGrelck CKloeckner APadua DSolomonik E(2017)HPTT: a high-performance tensor transposition C++ libraryProceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming10.1145/3091966.3091968(56-62)Online publication date: 18-Jun-2017
https://dl.acm.org/doi/10.1145/3091966.3091968
Vuduc RDemmel JBilmes J(2016)Statistical Models for Empirical Search-Based Performance TuningThe International Journal of High Performance Computing Applications10.1177/109434200404129318:1(65-94)Online publication date: 26-Jul-2016
https://doi.org/10.1177/1094342004041293
Springer PSankaran ABientinesi PElsman MGrelck CKlöckner APadua D(2016)TTC: a tensor transposition compiler for multiple architecturesProceedings of the 3rd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming10.1145/2935323.2935328(41-46)Online publication date: 2-Jun-2016
https://dl.acm.org/doi/10.1145/2935323.2935328
Bilmes JAsanovic KChin CDemmel J(2014)Optimizing matrix multiply using PHiPACACM International Conference on Supercomputing 25th Anniversary Volume10.1145/2591635.2667174(253-260)Online publication date: 10-Jun-2014
https://dl.acm.org/doi/10.1145/2591635.2667174
Mateescu GBauer GFiedler R(2012)Optimizing matrix transposes using a POWER7 cache model and explicit prefetchingACM SIGMETRICS Performance Evaluation Review10.1145/2381056.238107340:2(68-73)Online publication date: 8-Oct-2012
https://dl.acm.org/doi/10.1145/2381056.2381073
Bilmes JAsanovic KChin CDemmel JWallach SZima H(1997)Optimizing matrix multiply using PHiPACProceedings of the 11th international conference on Supercomputing10.1145/263580.263662(340-347)Online publication date: 11-Jul-1997
https://dl.acm.org/doi/10.1145/263580.263662

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Cited By

Recommendations

Cache Operations by MRU Change

Location cache: a low-power L2 cache system

Buffering databse operations for enhanced instruction cache performance

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations