research-article

Diagnosis and optimization of application prefetching performance

Authors:
Gabriel Marin

University of Tennessee, Knoxville, TN, USA

University of Tennessee, Knoxville, TN, USA
View Profile

,
Collin McCurdy

Oak Ridge National Laboratory, Oak Ridge, TN, USA

Oak Ridge National Laboratory, Oak Ridge, TN, USA
View Profile

,
Jeffrey S. Vetter

Oak Ridge National Laboratory & Georgia Institute of Technology, Oak Ridge, TN, USA

Oak Ridge National Laboratory & Georgia Institute of Technology, Oak Ridge, TN, USA
View Profile

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputingJune 2013Pages 303–312https://doi.org/10.1145/2464996.2465014

Published:10 June 2013Publication History

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

Pages 303–312

ABSTRACT

Hardware prefetchers are effective at recognizing streaming memory access patterns and at moving data closer to the processing units to hide memory latency. However, hardware prefetchers can track only a limited number of data streams due to finite hardware resources. In this paper, we introduce the term streaming concurrency to characterize the number of parallel, logical data streams in an application. We present a simulation algorithm for understanding the streaming concurrency at any point in an application, and we show that this metric is a good predictor of the number of memory requests initiated by streaming prefetchers. Next, we try to understand the causes behind poor prefetching performance. We identified four prefetch unfriendly conditions and we show how to classify an application's memory references based on these conditions. We evaluated our analysis using the SPEC CPU2006 benchmark suite. We selected two benchmarks with unfavorable access patterns and transformed them to improve their prefetching effectiveness. Results show that making applications more prefetcher friendly can yield meaningful performance gains.

References

J.-L. Baer and T.-F. Chen. An effective on-chip preloading scheme to reduce data access penalty. In Proceedings of the 1991 ACM/IEEE conference on Supercomputing, Supercomputing '91, pages 176--186, New York, NY, USA, 1991. ACM. Google ScholarDigital Library
B. Bennett and V. Kruskal. LRU stack processing. IBM Journal of Research and Development, 19(4):353--357, July 1975. Google ScholarDigital Library
K. Beyls and E. H. D'Hollander. Reuse distance as a metric for cache behavior. In Proceedings of the IASTED Conference on Parallel and Distributed Computing And Systems, pages 617--662, 2001.Google Scholar
D. Callahan, K. Kennedy, and A. Porterfield. Software prefetching. In Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, ASPLOS-IV, pages 40--52, New York, NY, USA, 1991. ACM. Google ScholarDigital Library
T.-F. Chen and J.-L. Baer. A performance study of software and hardware data prefetching schemes. In Proceedings of the 21st annual international symposium on Computer architecture, ISCA '94, pages 223--232, Los Alamitos, CA, USA, 1994. IEEE Computer Society Press. Google ScholarDigital Library
E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt. Prefetch-aware shared resource management for multi-core systems. In Proceedings of the 38th annual international symposium on Computer architecture, ISCA '11, pages 141--152, New York, NY, USA, 2011. Google ScholarDigital Library
J. L. Henning. Spec cpu2006 benchmark descriptions. SIGARCH Comp. Arch. News, 34(4):1--17, Sept. 2006. Google ScholarDigital Library
N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th annual international symposium on Computer Architecture, ISCA '90, pages 364--373, New York, NY, USA, 1990. ACM. Google ScholarDigital Library
C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, PLDI '05, pages 190--200, New York, NY, USA, 2005. Google ScholarDigital Library
C.-K. Luk and T. C. Mowry. Automatic compiler- inserted prefetching for pointer-based applications. IEEE Trans. Comput., 48(2):134--141, Feb. 1999. Google ScholarDigital Library
A. Mandal, R. Fowler, and A. Porterfield. Modeling memory concurrency for multi-socket multi-core systems. In Performance Analysis of Systems Software (ISPASS), 2010 IEEE International Symposium on, pages 66--75, march 2010.Google ScholarCross Ref
G. Marin and J. Mellor-Crummey. Cross-architecture performance predictions for scientific applications using parameterized models. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, pages 2--13. ACM Press, 2004. Google ScholarDigital Library
R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluation techniques for storage hierarchies. IBM Syst. J., 9(2):78--117, June 1970. Google ScholarDigital Library
S. Palacharla and R. E. Kessler. Evaluating stream buffers as a secondary cache replacement. In Proceedings of the 21st annual international symposium on Computer architecture, ISCA '94, pages 24--33, Los Alamitos, CA, USA, 1994. Google ScholarDigital Library
V. Santhanam, E. H. Gornish, and W.-C. Hsu. Data prefetching on the hp pa-8000. In Proceedings of the 24th annual international symposium on Computer architecture, ISCA '97, pages 264--273, New York, NY, USA, 1997. ACM. Google ScholarDigital Library
D. D. Sleator and R. E. Tarjan. Self-adjusting binary search trees. J. ACM, 32(3):652--686, July 1985. Google ScholarDigital Library
S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, HPCA '07, pages 63--74, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarDigital Library
J. Wilke, T. Pohl, M. Kowarschik, and U. Rüde. Cache performance optimizations for parallel lattice boltzmann codes. In Euro-Par, pages 441--450, 2003.Google ScholarCross Ref
S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65--76, Apr. 2009. Google ScholarDigital Library

Index Terms

Diagnosis and optimization of application prefetching performance

Recommendations

Effective stream-based and execution-based data prefetching
ICS '04: Proceedings of the 18th annual international conference on Supercomputing

With processor speeds continuing to outpace the memory subsystem, cache missing memory operations continue to become increasingly important to application performance. In response to this continuing trend, most modern processors now support hardware (HW)...
Read More
Stealth prefetching
Proceedings of the 2006 ASPLOS Conference

Prefetching in shared-memory multiprocessor systems is an increasingly difficult problem. As system designs grow to incorporate larger numbers of faster processors, memory latency and interconnect traffic increase. While aggressive prefetching ...
Read More
Increasing hardware data prefetching performance using the second-level cache

Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing
June 2013
512 pages
ISBN:9781450321303
DOI:10.1145/2464996
General Chair:
Allen D. Malony
University of Oregon, USA
,
Program Chairs:
Mario Nemirovsky
Barcelona Supercomputing Center, Spain
,
Sam Midkiff
Purdue University, USA
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 June 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
diagnosis
performance modeling
stream prefetching
Qualifiers
- research-article
Conference

Acceptance Rates
ICS '13 Paper Acceptance Rate43of202submissions,21%Overall Acceptance Rate584of2,055submissions,28%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 226
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Diagnosis and optimization of application prefetching performance

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Effective stream-based and execution-based data prefetching

Stealth prefetching

Increasing hardware data prefetching performance using the second-level cache

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Diagnosis and optimization of application prefetching performance

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Effective stream-based and execution-based data prefetching

Stealth prefetching

Increasing hardware data prefetching performance using the second-level cache

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media