ABSTRACT
Hardware prefetchers are effective at recognizing streaming memory access patterns and at moving data closer to the processing units to hide memory latency. However, hardware prefetchers can track only a limited number of data streams due to finite hardware resources. In this paper, we introduce the term streaming concurrency to characterize the number of parallel, logical data streams in an application. We present a simulation algorithm for understanding the streaming concurrency at any point in an application, and we show that this metric is a good predictor of the number of memory requests initiated by streaming prefetchers. Next, we try to understand the causes behind poor prefetching performance. We identified four prefetch unfriendly conditions and we show how to classify an application's memory references based on these conditions. We evaluated our analysis using the SPEC CPU2006 benchmark suite. We selected two benchmarks with unfavorable access patterns and transformed them to improve their prefetching effectiveness. Results show that making applications more prefetcher friendly can yield meaningful performance gains.
- J.-L. Baer and T.-F. Chen. An effective on-chip preloading scheme to reduce data access penalty. In Proceedings of the 1991 ACM/IEEE conference on Supercomputing, Supercomputing '91, pages 176--186, New York, NY, USA, 1991. ACM. Google ScholarDigital Library
- B. Bennett and V. Kruskal. LRU stack processing. IBM Journal of Research and Development, 19(4):353--357, July 1975. Google ScholarDigital Library
- K. Beyls and E. H. D'Hollander. Reuse distance as a metric for cache behavior. In Proceedings of the IASTED Conference on Parallel and Distributed Computing And Systems, pages 617--662, 2001.Google Scholar
- D. Callahan, K. Kennedy, and A. Porterfield. Software prefetching. In Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, ASPLOS-IV, pages 40--52, New York, NY, USA, 1991. ACM. Google ScholarDigital Library
- T.-F. Chen and J.-L. Baer. A performance study of software and hardware data prefetching schemes. In Proceedings of the 21st annual international symposium on Computer architecture, ISCA '94, pages 223--232, Los Alamitos, CA, USA, 1994. IEEE Computer Society Press. Google ScholarDigital Library
- E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt. Prefetch-aware shared resource management for multi-core systems. In Proceedings of the 38th annual international symposium on Computer architecture, ISCA '11, pages 141--152, New York, NY, USA, 2011. Google ScholarDigital Library
- J. L. Henning. Spec cpu2006 benchmark descriptions. SIGARCH Comp. Arch. News, 34(4):1--17, Sept. 2006. Google ScholarDigital Library
- N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th annual international symposium on Computer Architecture, ISCA '90, pages 364--373, New York, NY, USA, 1990. ACM. Google ScholarDigital Library
- C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, PLDI '05, pages 190--200, New York, NY, USA, 2005. Google ScholarDigital Library
- C.-K. Luk and T. C. Mowry. Automatic compiler- inserted prefetching for pointer-based applications. IEEE Trans. Comput., 48(2):134--141, Feb. 1999. Google ScholarDigital Library
- A. Mandal, R. Fowler, and A. Porterfield. Modeling memory concurrency for multi-socket multi-core systems. In Performance Analysis of Systems Software (ISPASS), 2010 IEEE International Symposium on, pages 66--75, march 2010.Google ScholarCross Ref
- G. Marin and J. Mellor-Crummey. Cross-architecture performance predictions for scientific applications using parameterized models. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, pages 2--13. ACM Press, 2004. Google ScholarDigital Library
- R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluation techniques for storage hierarchies. IBM Syst. J., 9(2):78--117, June 1970. Google ScholarDigital Library
- S. Palacharla and R. E. Kessler. Evaluating stream buffers as a secondary cache replacement. In Proceedings of the 21st annual international symposium on Computer architecture, ISCA '94, pages 24--33, Los Alamitos, CA, USA, 1994. Google ScholarDigital Library
- V. Santhanam, E. H. Gornish, and W.-C. Hsu. Data prefetching on the hp pa-8000. In Proceedings of the 24th annual international symposium on Computer architecture, ISCA '97, pages 264--273, New York, NY, USA, 1997. ACM. Google ScholarDigital Library
- D. D. Sleator and R. E. Tarjan. Self-adjusting binary search trees. J. ACM, 32(3):652--686, July 1985. Google ScholarDigital Library
- S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, HPCA '07, pages 63--74, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarDigital Library
- J. Wilke, T. Pohl, M. Kowarschik, and U. Rüde. Cache performance optimizations for parallel lattice boltzmann codes. In Euro-Par, pages 441--450, 2003.Google ScholarCross Ref
- S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65--76, Apr. 2009. Google ScholarDigital Library
Index Terms
- Diagnosis and optimization of application prefetching performance
Recommendations
Effective stream-based and execution-based data prefetching
ICS '04: Proceedings of the 18th annual international conference on SupercomputingWith processor speeds continuing to outpace the memory subsystem, cache missing memory operations continue to become increasingly important to application performance. In response to this continuing trend, most modern processors now support hardware (HW)...
Stealth prefetching
Proceedings of the 2006 ASPLOS ConferencePrefetching in shared-memory multiprocessor systems is an increasingly difficult problem. As system designs grow to incorporate larger numbers of faster processors, memory latency and interconnect traffic increase. While aggressive prefetching ...
Increasing hardware data prefetching performance using the second-level cache
Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...
Comments