skip to main content
10.1145/2464996.2465014acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Diagnosis and optimization of application prefetching performance

Authors Info & Claims
Published:10 June 2013Publication History

ABSTRACT

Hardware prefetchers are effective at recognizing streaming memory access patterns and at moving data closer to the processing units to hide memory latency. However, hardware prefetchers can track only a limited number of data streams due to finite hardware resources. In this paper, we introduce the term streaming concurrency to characterize the number of parallel, logical data streams in an application. We present a simulation algorithm for understanding the streaming concurrency at any point in an application, and we show that this metric is a good predictor of the number of memory requests initiated by streaming prefetchers. Next, we try to understand the causes behind poor prefetching performance. We identified four prefetch unfriendly conditions and we show how to classify an application's memory references based on these conditions. We evaluated our analysis using the SPEC CPU2006 benchmark suite. We selected two benchmarks with unfavorable access patterns and transformed them to improve their prefetching effectiveness. Results show that making applications more prefetcher friendly can yield meaningful performance gains.

References

  1. J.-L. Baer and T.-F. Chen. An effective on-chip preloading scheme to reduce data access penalty. In Proceedings of the 1991 ACM/IEEE conference on Supercomputing, Supercomputing '91, pages 176--186, New York, NY, USA, 1991. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. B. Bennett and V. Kruskal. LRU stack processing. IBM Journal of Research and Development, 19(4):353--357, July 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. K. Beyls and E. H. D'Hollander. Reuse distance as a metric for cache behavior. In Proceedings of the IASTED Conference on Parallel and Distributed Computing And Systems, pages 617--662, 2001.Google ScholarGoogle Scholar
  4. D. Callahan, K. Kennedy, and A. Porterfield. Software prefetching. In Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, ASPLOS-IV, pages 40--52, New York, NY, USA, 1991. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T.-F. Chen and J.-L. Baer. A performance study of software and hardware data prefetching schemes. In Proceedings of the 21st annual international symposium on Computer architecture, ISCA '94, pages 223--232, Los Alamitos, CA, USA, 1994. IEEE Computer Society Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt. Prefetch-aware shared resource management for multi-core systems. In Proceedings of the 38th annual international symposium on Computer architecture, ISCA '11, pages 141--152, New York, NY, USA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. L. Henning. Spec cpu2006 benchmark descriptions. SIGARCH Comp. Arch. News, 34(4):1--17, Sept. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th annual international symposium on Computer Architecture, ISCA '90, pages 364--373, New York, NY, USA, 1990. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, PLDI '05, pages 190--200, New York, NY, USA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C.-K. Luk and T. C. Mowry. Automatic compiler- inserted prefetching for pointer-based applications. IEEE Trans. Comput., 48(2):134--141, Feb. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Mandal, R. Fowler, and A. Porterfield. Modeling memory concurrency for multi-socket multi-core systems. In Performance Analysis of Systems Software (ISPASS), 2010 IEEE International Symposium on, pages 66--75, march 2010.Google ScholarGoogle ScholarCross RefCross Ref
  12. G. Marin and J. Mellor-Crummey. Cross-architecture performance predictions for scientific applications using parameterized models. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, pages 2--13. ACM Press, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluation techniques for storage hierarchies. IBM Syst. J., 9(2):78--117, June 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Palacharla and R. E. Kessler. Evaluating stream buffers as a secondary cache replacement. In Proceedings of the 21st annual international symposium on Computer architecture, ISCA '94, pages 24--33, Los Alamitos, CA, USA, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. V. Santhanam, E. H. Gornish, and W.-C. Hsu. Data prefetching on the hp pa-8000. In Proceedings of the 24th annual international symposium on Computer architecture, ISCA '97, pages 264--273, New York, NY, USA, 1997. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. D. Sleator and R. E. Tarjan. Self-adjusting binary search trees. J. ACM, 32(3):652--686, July 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, HPCA '07, pages 63--74, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Wilke, T. Pohl, M. Kowarschik, and U. Rüde. Cache performance optimizations for parallel lattice boltzmann codes. In Euro-Par, pages 441--450, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  19. S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65--76, Apr. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Diagnosis and optimization of application prefetching performance

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing
          June 2013
          512 pages
          ISBN:9781450321303
          DOI:10.1145/2464996

          Copyright © 2013 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 10 June 2013

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          ICS '13 Paper Acceptance Rate43of202submissions,21%Overall Acceptance Rate584of2,055submissions,28%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader