skip to main content
10.1145/1993744.1993749acmconferencesArticle/Chapter ViewAbstractPublication PagesmetricsConference Proceedingsconference-collections
research-article

Studying the impact of hardware prefetching and bandwidth partitioning in chip-multiprocessors

Published: 07 June 2011 Publication History

Abstract

Modern high performance microprocessors widely employ hardware prefetching technique to hide long memory access latency. While very useful, hardware prefetching tends to aggravate the bandwidth wall, a problem where system performance is increasingly limited by the availability of the off-chip pin bandwidth in Chip Multi-Processors (CMPs).
In this paper, we propose an analytical model-based study to investigate how hardware prefetching and memory bandwidth partitioning impact CMP system performance and how they interact. The model includes a composite prefetching metric that can help determine under which conditions prefetching can improve system performance, a bandwidth partitioning model that takes into account prefetching effects, and a derivation of the weighted speedup optimum bandwidth partition sizes for different cores. Through model-driven case studies, we find several interesting observations that can be valuable for future CMP system design and optimization. We also explore simulation-based empirical evaluation to validate the observations and show that maximum system performance can be achieved by selective prefetching, guided by the composite prefetching metric, coupled with dynamic bandwidth partitioning.

Supplementary Material

JPG File (metrics_1b_2.jpg)
MP4 File (metrics_1b_2.mp4)

References

[1]
A. Snavely and D.M. Tullsen. Symbiotic Job Scheduling for a Simultaneous Multithreading Processor. In Proc. of 19th Intl. Conf. on Architecture Support for Programming Language and Operating Systems(ASPLOS), 2000.
[2]
B. Sinharoy and R.N. Kalla and J.M. Tendler and R.J. Eickemeyer and J.B. Joyner. POWER5 System Microarchitecture. IBM Journal of Research and Development, 49(4/5):505--521, 2005.
[3]
R. Bitirgen, E. Ipek, and J. Martinez. Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors: A Machine Learning Approach. In Proc. of the 41th IEEE/ACM Intl. Symp. on Microarchitecture (MICRO), 2008.
[4]
E. Ebrahimi, O. Mutlu, C. Lee, and Y. Patt. Coordinated Control of Multiple Prefetchers in Multi-Core Systems. In Proc. of the 42th IEEE/ACM Intl. Symp. on Microarchitecture (MICRO), 2009.
[5]
E. Ebrahimi, O. Mutlu, and Y. Patt. Techniques for Bandwidth-efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems. In 15th Intl. Symp. on High Performance Computer Architecture(HPCA), 2009.
[6]
P. Emma. Understanding Some Simple Processor-Performance Limits. IBM Journal of Research and Development, 41(3), 1997.
[7]
F. Liu and Y. Solihin. Understanding the Behavior and Implications of Context Switch Misses. ACM Trans. on Architecture and Code Optimization (TACO), 7(4):21:1--28, 2010.
[8]
G. Hinton and D. Sager and M. Upton and D. Boggs and D. Carmean and A. Kyker and P. Roussel. The Microarchitecture of the Pentium 4 Processor. Intel Technology Journal, (Q1), 2001.
[9]
H.Q. Le and W.J. Starke and J.S. Fields and F.O. Connell and D.Q. Nguyen and B.J. Ronchetti and W.M Sauer and E.M. Schwarz and M.T. Waden. IBM Power6 Microarchitecture. IBM Journal of Research and Development, 51:639--662, 2007.
[10]
I.B. Vapnyarskii. Numerical Methods of Solving Problems of the Mathematical Theory of Standardization. USSR Computational Mathematics and Mathematical Physics, 18(2):484--487, 1978.
[11]
IBM. IBM Power4 System Architecture White Paper, 2002.
[12]
E. Ipek, O. Mutlu, J. Martinez, and R. Caruana. Self-Optimizing Memory Controller: A Reinforcement Learning Approach. In Proc.of the 35th Intl. Symp. on Computer Architecture (ISCA), 2008.
[13]
ITRS. International Technology Roadmap for Semiconductors: 2005 Edition, Assembly and packaging. In http://www.itrs.net/Links/2005ITRS/AP2005.pdf, 2005.
[14]
X. Jiang, N. Madan, L. Zhao, M. Upton, R. Iyer, S. Makineni, D. Newell, Y. Solihin, and R. Balasubramonian. CHOP: Adaptive Filter-Based DRAM Caching for CMP Server Platforms. In Proc. of the 16th Intl. Symp. on High Performance Computer Architecture (HPCA), 2010.
[15]
X. Jiang, N. Madan, L. Zhao, M. Upton, R. Iyer, S. Makineni, D. Newell, Y. Solihin, and R. Balasubramonian. CHOP: Integrating DRAM Caches for CMP Server Platforms. IEEE Micro Top Picks, 31(1):99--108, 2011.
[16]
X. Jiang, A. Mishra, L. Zhao, R. Iyer, Z. Fang, S. Srinivasan, S. Makineni, P. Brett, and C. Das. ACCESS: Smart Scheduling for Asymmetric Cache CMPs. In Proc. of the 17th Intl. Symp. on High Performance Computer Architecture (HPCA), 2011.
[17]
X. Jiang and Y. Solihin. Architectural Framework for Supporting Operating System Survivability. In Proc. of the 17th Intl. Symp. on High Performance Computer Architecture (HPCA), 2011.
[18]
X. Jiang, Y. Solihin, L. Zhao, and R. Iyer. Architecture Support for Improving Bulk Memory Copying and Initialization Performance. In Proc. of the 18th Intl. Conf. on Parallel Architecture and Compilation Techniques (PACT), 2009.
[19]
N. Jouppi. Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers. In Proc. of the 17th Intl. Symp. on Computer Architecture, 1990.
[20]
C. Lee, O. Mutlu, V. Narasiman, and Y. Patt. Prefetch-Aware DRAM Controller. In Proc. of the 41th IEEE/ACM Intl. Symp. on Microarchitecture (MICRO), 2008.
[21]
J. Little. A Proof of Queueing Formula L = λW. Operations Research, 9(383--387), 1961.
[22]
F. Liu, F. Guo, S. Kim, A. Eker, and Y. Solihin. Characterizing and Modeling the Behavior of Context Switch Misses. In Proc. of the 17th Intl. Conf. on Parallel Architecture and Compilation Techniques (PACT), 2008.
[23]
F. Liu, X. Jiang, and Y. Solihin. Understanding How Off-Chip Memory Bandwidth Partitioning in Chip Multiprocessors Affects System Performance. In 16th Intl. Symp. on High Performance Computer Architecture, 2010.
[24]
P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A Full System Simulation Platform. IEEE Computer Society, 35(2):50--58, 2002.
[25]
O. Mutlu and T. Moscibroda. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In Proc. of the 40th IEEE/ACM Intl. Symp. on Microarchitecture (MICRO), 2007.
[26]
O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In Proc.of the 35th Intl. Symp. on Computer Architecture (ISCA), 2008.
[27]
K. Nesbit, D. Aggarwal, J. Laudon, and J. Smith. Fair Queuing Memory System. In Proc. of the 39th IEEE/ACM Intl. Symp. on Microarchitecture (MICRO), 2006.
[28]
N. Rafique, W. Lim, and M. Thottethodi. Effective Management of DRAM Bandwidth in Multicore Processors. In Proc. of the 16th Intl. Conf. on Parallel Architectures and Compilation Techniques(PACT), 2007.
[29]
S. Rixner, W. Dally, U. Kapasi, P. Mattson, and J. Owens. Memory Access Scheduling. In Proc.of the 27th Intl. Symp. on Computer Architecture (ISCA), 2000.
[30]
B. Rogers, A. Krishna, G. Bell, X. Jiang, and Y. Solihin. Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scaling. In Proc. of the 36th Intl. Conf. on Computer Architecture (ISCA), 2009.
[31]
L. Spracklen, Y. Chou, and S. Spracklen. Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications. In 11th Intl. Symp. on High Performance Computer Architecture(HPCA), 2004.
[32]
S. Srikantaiah and M. Kandemir. SRP: Symbiotic Resource Partitioning of the Memory Hierarchy in CMPs. In In Proc. of Intl. Conf. on High Performance Embedded Architectures and Compilers (HiPEAC), 2010.
[33]
S. Srinath, O. Mutlu, H. Kim, and Y. Patt. Feedback Directed Prefetching: Improving the Performance and Bandwidth-efficiency of Hardware Prefetchers. In 13th Intl. Symp. on High Performance Computer Architecture(HPCA), 2007.
[34]
Standard Performance Evaluation Corporation. Spec cpu2006 benchmarks. http://www.spec.org, 2006.
[35]
X. Zhuang and H.-H. Lee. Reducing Cache Pollution via Dynamic Data Prefetch Filtering. IEEE Trans. on Computers, 56(1):18--31, 2007.

Cited By

View all
  • (2021)SatoriProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00031(292-305)Online publication date: 14-Jun-2021
  • (2021)LIBRA: Clearing the Cloud Through Dynamic Memory Bandwidth Management2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00073(815-826)Online publication date: Feb-2021
  • (2019)CoPartProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303963(1-16)Online publication date: 25-Mar-2019
  • Show More Cited By

Index Terms

  1. Studying the impact of hardware prefetching and bandwidth partitioning in chip-multiprocessors

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMETRICS '11: Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
      June 2011
      376 pages
      ISBN:9781450308144
      DOI:10.1145/1993744
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 07 June 2011

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. analytical model
      2. chip multiprocessors
      3. hardware prefetching
      4. memory bandwidth partitioning

      Qualifiers

      • Research-article

      Conference

      SIGMETRICS '11
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 459 of 2,691 submissions, 17%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)11
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 06 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)SatoriProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00031(292-305)Online publication date: 14-Jun-2021
      • (2021)LIBRA: Clearing the Cloud Through Dynamic Memory Bandwidth Management2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00073(815-826)Online publication date: Feb-2021
      • (2019)CoPartProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303963(1-16)Online publication date: 25-Mar-2019
      • (2019)Combining Prefetch Control and Cache Partitioning to Improve Multicore Performance2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2019.00103(953-962)Online publication date: May-2019
      • (2018)HypartProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243211(1-14)Online publication date: 1-Nov-2018
      • (2018)Efficient selective multicore prefetching under limited memory bandwidthJournal of Parallel and Distributed Computing10.1016/j.jpdc.2018.05.002120:C(32-43)Online publication date: 1-Oct-2018
      • (2017)Band-Pass PrefetchingACM Transactions on Architecture and Code Optimization10.1145/309063514:2(1-27)Online publication date: 28-Jun-2017
      • (2017)Last Level Collective Hardware Prefetching For Data-Parallel Applications2017 IEEE 24th International Conference on High Performance Computing (HiPC)10.1109/HiPC.2017.00018(72-83)Online publication date: Dec-2017
      • (2016)Characterizing and Optimizing the Performance of Multithreaded Programs Under InterferenceProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2967939(287-297)Online publication date: 11-Sep-2016
      • (2016)A Survey of Recent Prefetching Techniques for Processor CachesACM Computing Surveys10.1145/290707149:2(1-35)Online publication date: 2-Aug-2016
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media