research-article

Studying the impact of hardware prefetching and bandwidth partitioning in chip-multiprocessors

Authors:

Yan SolihinAuthors Info & Claims

SIGMETRICS '11: Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems

Pages 37 - 48

https://doi.org/10.1145/1993744.1993749

Published: 07 June 2011 Publication History

Abstract

Modern high performance microprocessors widely employ hardware prefetching technique to hide long memory access latency. While very useful, hardware prefetching tends to aggravate the bandwidth wall, a problem where system performance is increasingly limited by the availability of the off-chip pin bandwidth in Chip Multi-Processors (CMPs).

In this paper, we propose an analytical model-based study to investigate how hardware prefetching and memory bandwidth partitioning impact CMP system performance and how they interact. The model includes a composite prefetching metric that can help determine under which conditions prefetching can improve system performance, a bandwidth partitioning model that takes into account prefetching effects, and a derivation of the weighted speedup optimum bandwidth partition sizes for different cores. Through model-driven case studies, we find several interesting observations that can be valuable for future CMP system design and optimization. We also explore simulation-based empirical evaluation to validate the observations and show that maximum system performance can be achieved by selective prefetching, guided by the composite prefetching metric, coupled with dynamic bandwidth partitioning.

Supplementary Material

JPG File (metrics_1b_2.jpg)

Download
17.32 KB

MP4 File (metrics_1b_2.mp4)

Download
130.18 MB

References

[1]

A. Snavely and D.M. Tullsen. Symbiotic Job Scheduling for a Simultaneous Multithreading Processor. In Proc. of 19th Intl. Conf. on Architecture Support for Programming Language and Operating Systems(ASPLOS), 2000.

Digital Library

[2]

B. Sinharoy and R.N. Kalla and J.M. Tendler and R.J. Eickemeyer and J.B. Joyner. POWER5 System Microarchitecture. IBM Journal of Research and Development, 49(4/5):505--521, 2005.

Digital Library

[3]

R. Bitirgen, E. Ipek, and J. Martinez. Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors: A Machine Learning Approach. In Proc. of the 41th IEEE/ACM Intl. Symp. on Microarchitecture (MICRO), 2008.

Digital Library

[4]

E. Ebrahimi, O. Mutlu, C. Lee, and Y. Patt. Coordinated Control of Multiple Prefetchers in Multi-Core Systems. In Proc. of the 42th IEEE/ACM Intl. Symp. on Microarchitecture (MICRO), 2009.

Digital Library

[5]

E. Ebrahimi, O. Mutlu, and Y. Patt. Techniques for Bandwidth-efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems. In 15th Intl. Symp. on High Performance Computer Architecture(HPCA), 2009.

[6]

P. Emma. Understanding Some Simple Processor-Performance Limits. IBM Journal of Research and Development, 41(3), 1997.

Digital Library

[7]

F. Liu and Y. Solihin. Understanding the Behavior and Implications of Context Switch Misses. ACM Trans. on Architecture and Code Optimization (TACO), 7(4):21:1--28, 2010.

Digital Library

[8]

G. Hinton and D. Sager and M. Upton and D. Boggs and D. Carmean and A. Kyker and P. Roussel. The Microarchitecture of the Pentium 4 Processor. Intel Technology Journal, (Q1), 2001.

[9]

H.Q. Le and W.J. Starke and J.S. Fields and F.O. Connell and D.Q. Nguyen and B.J. Ronchetti and W.M Sauer and E.M. Schwarz and M.T. Waden. IBM Power6 Microarchitecture. IBM Journal of Research and Development, 51:639--662, 2007.

Digital Library

[10]

I.B. Vapnyarskii. Numerical Methods of Solving Problems of the Mathematical Theory of Standardization. USSR Computational Mathematics and Mathematical Physics, 18(2):484--487, 1978.

[11]

IBM. IBM Power4 System Architecture White Paper, 2002.

[12]

E. Ipek, O. Mutlu, J. Martinez, and R. Caruana. Self-Optimizing Memory Controller: A Reinforcement Learning Approach. In Proc.of the 35th Intl. Symp. on Computer Architecture (ISCA), 2008.

Digital Library

[13]

ITRS. International Technology Roadmap for Semiconductors: 2005 Edition, Assembly and packaging. In http://www.itrs.net/Links/2005ITRS/AP2005.pdf, 2005.

[14]

X. Jiang, N. Madan, L. Zhao, M. Upton, R. Iyer, S. Makineni, D. Newell, Y. Solihin, and R. Balasubramonian. CHOP: Adaptive Filter-Based DRAM Caching for CMP Server Platforms. In Proc. of the 16th Intl. Symp. on High Performance Computer Architecture (HPCA), 2010.

[15]

X. Jiang, N. Madan, L. Zhao, M. Upton, R. Iyer, S. Makineni, D. Newell, Y. Solihin, and R. Balasubramonian. CHOP: Integrating DRAM Caches for CMP Server Platforms. IEEE Micro Top Picks, 31(1):99--108, 2011.

Digital Library

[16]

X. Jiang, A. Mishra, L. Zhao, R. Iyer, Z. Fang, S. Srinivasan, S. Makineni, P. Brett, and C. Das. ACCESS: Smart Scheduling for Asymmetric Cache CMPs. In Proc. of the 17th Intl. Symp. on High Performance Computer Architecture (HPCA), 2011.

Digital Library

[17]

X. Jiang and Y. Solihin. Architectural Framework for Supporting Operating System Survivability. In Proc. of the 17th Intl. Symp. on High Performance Computer Architecture (HPCA), 2011.

Digital Library

[18]

X. Jiang, Y. Solihin, L. Zhao, and R. Iyer. Architecture Support for Improving Bulk Memory Copying and Initialization Performance. In Proc. of the 18th Intl. Conf. on Parallel Architecture and Compilation Techniques (PACT), 2009.

Digital Library

[19]

N. Jouppi. Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers. In Proc. of the 17th Intl. Symp. on Computer Architecture, 1990.

Digital Library

[20]

C. Lee, O. Mutlu, V. Narasiman, and Y. Patt. Prefetch-Aware DRAM Controller. In Proc. of the 41th IEEE/ACM Intl. Symp. on Microarchitecture (MICRO), 2008.

Digital Library

[21]

J. Little. A Proof of Queueing Formula L = λW. Operations Research, 9(383--387), 1961.

[22]

F. Liu, F. Guo, S. Kim, A. Eker, and Y. Solihin. Characterizing and Modeling the Behavior of Context Switch Misses. In Proc. of the 17th Intl. Conf. on Parallel Architecture and Compilation Techniques (PACT), 2008.

Digital Library

[23]

F. Liu, X. Jiang, and Y. Solihin. Understanding How Off-Chip Memory Bandwidth Partitioning in Chip Multiprocessors Affects System Performance. In 16th Intl. Symp. on High Performance Computer Architecture, 2010.

[24]

P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A Full System Simulation Platform. IEEE Computer Society, 35(2):50--58, 2002.

Digital Library

[25]

O. Mutlu and T. Moscibroda. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In Proc. of the 40th IEEE/ACM Intl. Symp. on Microarchitecture (MICRO), 2007.

Digital Library

[26]

O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In Proc.of the 35th Intl. Symp. on Computer Architecture (ISCA), 2008.

Digital Library

[27]

K. Nesbit, D. Aggarwal, J. Laudon, and J. Smith. Fair Queuing Memory System. In Proc. of the 39th IEEE/ACM Intl. Symp. on Microarchitecture (MICRO), 2006.

Digital Library

[28]

N. Rafique, W. Lim, and M. Thottethodi. Effective Management of DRAM Bandwidth in Multicore Processors. In Proc. of the 16th Intl. Conf. on Parallel Architectures and Compilation Techniques(PACT), 2007.

Digital Library

[29]

S. Rixner, W. Dally, U. Kapasi, P. Mattson, and J. Owens. Memory Access Scheduling. In Proc.of the 27th Intl. Symp. on Computer Architecture (ISCA), 2000.

Digital Library

[30]

B. Rogers, A. Krishna, G. Bell, X. Jiang, and Y. Solihin. Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scaling. In Proc. of the 36th Intl. Conf. on Computer Architecture (ISCA), 2009.

Digital Library

[31]

L. Spracklen, Y. Chou, and S. Spracklen. Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications. In 11th Intl. Symp. on High Performance Computer Architecture(HPCA), 2004.

Digital Library

[32]

S. Srikantaiah and M. Kandemir. SRP: Symbiotic Resource Partitioning of the Memory Hierarchy in CMPs. In In Proc. of Intl. Conf. on High Performance Embedded Architectures and Compilers (HiPEAC), 2010.

Digital Library

[33]

S. Srinath, O. Mutlu, H. Kim, and Y. Patt. Feedback Directed Prefetching: Improving the Performance and Bandwidth-efficiency of Hardware Prefetchers. In 13th Intl. Symp. on High Performance Computer Architecture(HPCA), 2007.

Digital Library

[34]

Standard Performance Evaluation Corporation. Spec cpu2006 benchmarks. http://www.spec.org, 2006.

[35]

X. Zhuang and H.-H. Lee. Reducing Cache Pollution via Dynamic Data Prefetch Filtering. IEEE Trans. on Computers, 56(1):18--31, 2007.

Digital Library

Cited By

Roy RPatel TTiwari DMartínez JDuato JJohn L(2021)SatoriProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00031(292-305)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00031
Zhang YChen JJiang XLiu QSteiner IHerdrich AShu KDas RCui LJiang L(2021)LIBRA: Clearing the Cloud Through Dynamic Memory Bandwidth Management2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00073(815-826)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00073
Park JPark SBaek W(2019)CoPartProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303963(1-16)Online publication date: 25-Mar-2019
https://dl.acm.org/doi/10.1145/3302424.3303963
Show More Cited By

Index Terms

Studying the impact of hardware prefetching and bandwidth partitioning in chip-multiprocessors
1. General and reference
  1. Cross-computing tools and techniques
    1. Performance
2. Hardware
  1. Hardware validation

Recommendations

Studying the impact of hardware prefetching and bandwidth partitioning in chip-multiprocessors
Performance evaluation review

Modern high performance microprocessors widely employ hardware prefetching technique to hide long memory access latency. While very useful, hardware prefetching tends to aggravate the bandwidth wall, a problem where system performance is increasingly ...
A compiler-directed data prefetching scheme for chip multiprocessors
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

Data prefetching has been widely used in the past as a technique for hiding memory access latencies. However, data prefetching in multi-threaded applications running on chip multiprocessors (CMPs) can be problematic when multiple cores compete for a ...
A compiler-directed data prefetching scheme for chip multiprocessors
PPoPP '09

Data prefetching has been widely used in the past as a technique for hiding memory access latencies. However, data prefetching in multi-threaded applications running on chip multiprocessors (CMPs) can be problematic when multiple cores compete for a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMETRICS '11: Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems

June 2011

376 pages

ISBN:9781450308144

DOI:10.1145/1993744

General Chair:
Arif Merchant
Google, USA
,
Program Chairs:
Kimberly Keeton
HP Labs, USA
,
Dan Rubenstein
Columbia University, USA

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMETRICS: ACM Special Interest Group on Measurement and Evaluation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMETRICS '11

Sponsor:

SIGMETRICS

SIGMETRICS '11: ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems

June 7 - 11, 2011

California, San Jose, USA

Acceptance Rates

Overall Acceptance Rate 459 of 2,691 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

44
Total Citations
View Citations
772
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)1

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Roy RPatel TTiwari DMartínez JDuato JJohn L(2021)SatoriProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00031(292-305)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00031
Zhang YChen JJiang XLiu QSteiner IHerdrich AShu KDas RCui LJiang L(2021)LIBRA: Clearing the Cloud Through Dynamic Memory Bandwidth Management2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00073(815-826)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00073
Park JPark SBaek W(2019)CoPartProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303963(1-16)Online publication date: 25-Mar-2019
https://dl.acm.org/doi/10.1145/3302424.3303963
Sun GShen JVeidenbaum A(2019)Combining Prefetch Control and Cache Partitioning to Improve Multicore Performance2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2019.00103(953-962)Online publication date: May-2019
https://doi.org/10.1109/IPDPS.2019.00103
Park JPark SHan MHyun JBaek WEvripidou SStenström PO'Boyle M(2018)HypartProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243211(1-14)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243211
Selfa VSahuquillo JGómez MGómez C(2018)Efficient selective multicore prefetching under limited memory bandwidthJournal of Parallel and Distributed Computing10.1016/j.jpdc.2018.05.002120:C(32-43)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.1016/j.jpdc.2018.05.002
Sridharan APanda BSeznec A(2017)Band-Pass PrefetchingACM Transactions on Architecture and Code Optimization10.1145/309063514:2(1-27)Online publication date: 28-Jun-2017
https://dl.acm.org/doi/10.1145/3090635
Michelogiannakis GShalf J(2017)Last Level Collective Hardware Prefetching For Data-Parallel Applications2017 IEEE 24th International Conference on High Performance Computing (HiPC)10.1109/HiPC.2017.00018(72-83)Online publication date: Dec-2017
https://doi.org/10.1109/HiPC.2017.00018
Zhao YRao JYi QZaks AMendelson BRauchwerger LHwu W(2016)Characterizing and Optimizing the Performance of Multithreaded Programs Under InterferenceProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2967939(287-297)Online publication date: 11-Sep-2016
https://dl.acm.org/doi/10.1145/2967938.2967939
Mittal S(2016)A Survey of Recent Prefetching Techniques for Processor CachesACM Computing Surveys10.1145/290707149:2(1-35)Online publication date: 2-Aug-2016
https://dl.acm.org/doi/10.1145/2907071
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents