skip to main content
10.1145/3470496.3527398acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Register file prefetching

Published: 11 June 2022 Publication History

Abstract

The memory wall continues to limit the performance of modern out-of-order (OOO) processors, despite the expensive provisioning of large multi-level caches and advancements in memory prefetching. In this paper, we put forth an important observation that the memory wall is not monolithic, but is constituted of many latency walls arising due to the latency of each tier of cache/memory. Our results show that even though level-1 (L1) data cache latency is nearly 40X lower than main memory latency, mitigating this latency offers a very similar performance opportunity as the more widely studied, main memory latency.
This motivates our proposal Register File Prefetch (RFP) that intelligently utilizes the existing OOO scheduling pipeline and available L1 data cache/Register File bandwidth to successfully prefetch 43.4% of load requests from the L1 cache to the Register File. Simulation results on 65 diverse workloads show that this translates to 3.1% performance gain over a baseline with parameters similar to Intel Tiger Lake processor, which further increases to 5.7% for a futuristic up-scaled core. We also contrast and differentiate register file prefetching from techniques like load value and address prediction that enhance performance by speculatively breaking data dependencies. Our analysis shows that RFP is synergistic with value prediction, with both the features together delivering 4.1% average performance improvement, which is significantly higher than the 2.2% performance gain obtained from just doing value prediction.

References

[1]
Mehdi Alipour, Stefanos Kaxiras, David Black-Schaffer, and Rakesh Kumar. 2020. Delay and Bypass: Ready and Criticality Aware Instruction Scheduling in Out-of-Order Processors. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[2]
Ricardo Alves, Stefanos Kaxiras, and David Black-Schaffer. 2021. Early Address Prediction: Efficient Pipeline Prefetch and Reuse. ACM Transactions on Architecture and Code Optimization 18, 3, Article 39 (June 2021), 22 pages.
[3]
Apache Software Foundation. 2010. Hadoop. https://hadoop.apache.org
[4]
Todd M. Austin and Gurindar S. Sohi. 1995. Zero-Cycle Loads: Microarchitecture Support for Reducing Load Latency. In Proceedings of the 28th Annual International Symposium on Microarchitecture.
[5]
Jean-Loup Baer and Tien-Fu Chen. 1991. An Effective On-Chip Preloading Scheme to Reduce Data Access Penalty. In Supercomputing '91:Proceedings of the 1991 ACM/IEEE Conference on Supercomputing.
[6]
Mohammad Bakhshalipour, Mehran Shakerinava, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2019. Bingo Spatial Data Prefetcher. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[7]
Sumeet Bandishte, Jayesh Gaur, Zeev Sperber, Lihu Rappoport, Adi Yoaz, and Sreenivas Subramoney. 2020. Focused Value Prediction. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).
[8]
BAPCo. 2018. SYSmark 2018. https://bapco.com/products/sysmark-2018/
[9]
Michael Bekerman, Stephan Jourdan, Ronny Ronen, Gilad Kirshenboim, Lihu Rappoport, Adi Yoaz, and Uri Weiser. 1999. Correlated Load-Address Predictors. In Proceedings of the 26th International Symposium on Computer Architecture.
[10]
Rahul Bera, Anant V. Nori, Onur Mutlu, and Sreenivas Subramoney. 2019. DSPatch: Dual Spatial Pattern Prefetcher. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture.
[11]
Eshan Bhatia, Gino Chacon, Seth Pugsley, Elvira Teran, Paul V. Gratz, and Daniel A. Jiménez. 2019. Perceptron-Based Prefetch Filtering. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).
[12]
Mary D. Brown, Jared Stark, and Yale N. Patt. 2001. Select-Free Instruction Scheduling Logic. In Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.
[13]
Mainak Chaudhuri, Jayesh Gaur, Nithiyanandan Bashyam, Sreenivas Subramoney, and Joseph Nuzman. 2012. Introducing Hierarchy-awareness in Replacement and Bypass Algorithms for Last-level Caches. In 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[14]
George Z. Chrysos and Joel S. Emer. 1998. Memory Dependence Prediction Using Store Sets. In Proceedings. 25th Annual International Symposium on Computer Architecture.
[15]
Jamison D. Collins, Hong Wang, D.M. Tullsen, Christopher Hughes, Yong-Fong Lee, Dan Lavery, and John P. Shen. 2001. Speculative Precomputation: Long-range Prefetching of Delinquent Loads. In Proceedings 28th Annual International Symposium on Computer Architecture.
[16]
Standard Performance Evaluation Corporation. 2006. SPEC CPU 2006. https://www.spec.org/cpu2006/
[17]
Standard Performance Evaluation Corporation. 2010. SPECjEnterprise© 2010. https://www.spec.org/jEnterprise2010/
[18]
Standard Performance Evaluation Corporation. 2015. SPECjbb© 2015. https://www.spec.org/jbb2015/
[19]
Standard Performance Evaluation Corporation. 2017. SPEC CPU 2017. https://www.spec.org/cpu2017/
[20]
Transaction Processing Performance Council. 2010. TPC-C. http://www.tpc.org/tpcc/
[21]
Transaction Processing Performance Council. 2015. TPC-E. http://www.tpc.org/tpce/
[22]
Ian Cutress. 2020. Intel's 11th Gen Core Tiger Lake SoC Detailed: SuperFin, Willow Cove and Xe-LP. https://www.anandtech.com/show/15971/intels-11th-gen-core-tiger-lake-soc-detailed-superfin-willow-cove-and-xelp
[23]
Fredrik Dahlgren and Per Stenström. 1995. Effectiveness of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Nultiprocessors. In Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.
[24]
James Dundas and Trevor Mudge. 1997. Improving Data Cache Performance by Pre-executing Instructions Under a Cache Miss. In Proceedings of the 11th international conference on Supercomputing.
[25]
R. J. Eickemeyer and S. Vassiliadis. 1993. A load-instruction unit for pipelined processors. IBM Journal of Research and Development (1993).
[26]
Brian Fields, Shai Rubin, and Rastislav Bodik. 2001. Focusing Processor Policies via Critical-Path Prediction. In Proceedings 28th Annual International Symposium on Computer Architecture.
[27]
John W.C. Fu, Janak H. Patel, and Bob L. Janssens. 1992. Stride Directed Prefetching In Scalar Processors. In [1992] Proceedings the 25th Annual International Symposium on Microarchitecture MICRO 25.
[28]
Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, and Hans-Arno Jacobsen. 2013. BigBench: Towards an Industry Standard Benchmark for Big Data Analytics. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data.
[29]
José González and Antonio González. 1997. Speculative Execution via Address Prediction and Data Prefetching. In Proceedings of the 11th International Conference on Supercomputing.
[30]
Milad Hashemi, Onur Mutlu, and Yale N. Patt. 2016. Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[31]
Timothy H. Heil, Zak Smith, and J.E. Smith. 1999. Improving Branch Predictors by Correlating on Data Values. In MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.
[32]
John L Hennessy and David A Patterson. 2011. Computer Architecture: A Quantitative Approach. Elsevier.
[33]
Ibrahim Hur and Calvin Lin. 2006. Memory Prefetching Using Adaptive Stream Detection. In 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).
[34]
Yasuo Ishii, Mary Inaba, and Kei Hiraki. 2009. Access Map Pattern Matching for Data Cache Prefetch. In Proceedings of the 23rd international conference on Supercomputing.
[35]
Akanksha Jain and Calvin Lin. 2016. Back to the Future: Leveraging Belady's Algorithm for Improved Cache Replacement. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[36]
Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, and Joel Emer. 2010. High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP). In Proceedings of the 37th annual international symposium on Computer architecture.
[37]
Doug Joseph and Dirk Grunwald. 1997. Prefetching using Markov Predictors. In Proceedings of the 24th annual international symposium on Computer architecture.
[38]
Changhee Jung, Daeseob Lim, Jaejin Lee, and Yan Solihin. 2006. Helper Thread Prefetching for Loosely-Coupled Multiprocessor Systems. In Proceedings 20th IEEE International Parallel Distributed Processing Symposium.
[39]
Neelu S. Kalani and Biswabandan Panda. 2021. Instruction Criticality Based Energy-Efficient Hardware Data Prefetching. IEEE Computer Architecture Letters 20, 2 (2021), 146--149.
[40]
Joonsung Kim, Hamin Jang, Hunjun Lee, Seungho Lee, and Jangwoo Kim. 2021. UC-Check: Characterizing Micro-operation Caches in x86 Processors and Implications in Security and Performance. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture.
[41]
Jinchun Kim, Seth H. Pugsley, Paul V. Gratz, A. L. Narasimha Reddy, Chris Wilkerson, and Zeshan Chishti. 2016. Path Confidence based Lookahead Prefetching. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[42]
Sushant Kondguli and Michael Huang. 2017. T2: A Highly Accurate and Energy Efficient Stride Prefetcher. In 2017 IEEE International Conference on Computer Design (ICCD).
[43]
Primate Labs. 2021. Geekbench 5 CPU Benchmark. https://www.geekbench.com/
[44]
Sandia National Labs. [n.d.]. LAMMPS. https://www.lammps.org
[45]
An-Chow Lai, C. Fide, and B. Falsafi. 2001. Dead-Block Prediction & Dead-Block Correlating Prefetchers. In Proceedings of the 28th Annual International Symposium on Computer Architecture.
[46]
Mikko H. Lipasti, Christopher B. Wilkerson, and John Paul Shen. 1996. Value Locality and Load Value Prediction. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems.
[47]
Heiner Litz, Grant Ayers, and Parthasarathy Ranganathan. 2022. CRISP: Critical Slice Prefetching. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.
[48]
Jiwei Lu, Abhinav Das, Wei-Chung Hsu, Khoa Nguyen, and Santosh G. Abraham. 2005. Dynamic Helper Threaded Prefetching on the Sun UltraSPARC® CMP Processor. In 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).
[49]
R. Manikantan, R. Govindarajan, and Kaushik Rajan. 2011. Extended Histories: Improving Regularity and Performance in Correlation Prefetchers. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers.
[50]
Pierre Michaud. 2016. Best-Offset Hardware Prefetching. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[51]
Teresa Monreal, Antonio Gonzalez, Mateo Valero, José Gonzalez, and Victor Viñals. 1999. Delaying Physical Register Allocation Through Virtual-Physical Registers. In MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.
[52]
Kyle J. Nesbit and James E. Smith. 2004. Data Cache Prefetching Using a Global History Buffer. In 10th International Symposium on High Performance Computer Architecture (HPCA'04).
[53]
Anant V. Nori, Jayesh Gaur, Siddharth Rai, Sreenivas Subramoney, and Hong Wang. 2018. Criticality Aware Tiered Cache Hierarchy: A Fundamental Relook at Multi-level Cache Hierarchies. In Proceedings of the 45th Annual International Symposium on Computer Architecture.
[54]
Subbarao Palacharla and R.E. Kessler. 1994. Evaluating Stream Buffers as a Secondary Cache Replacement. In Proceedings of 21 International Symposium on Computer Architecture.
[55]
Il Park, Chong Liang Ooi, and T. N. Vijaykumar. 2003. Reducing Design Complexity of the Load/Store Queue. In Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture.
[56]
Arthur Perais. 2021. Leveraging Targeted Value Prediction to Unlock New Hardware Strength Reduction Potential. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture.
[57]
Arthur Perais and André Seznec. 2014. EOLE: Paving the Way for an Effective Implementation of Value Prediction. In 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).
[58]
Arthur Perais and André Seznec. 2014. Practical Data Value Speculation for Future High-end Processors. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[59]
Arthur Perais and André Seznec. 2015. BeBoP: A cost effective predictor infrastructure for superscalar value prediction. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[60]
Seth H Pugsley, Zeshan Chishti, Chris Wilkerson, Peng-fei Chuang, Robert L Scott, Aamer Jaleel, Shih-Lien Lu, Kingsum Chow, and Rajeev Balasubramonian. 2014. Sandbox Prefetching: Safe run-time evaluation of aggressive prefetchers. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[61]
Amir Roth. 2005. Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization. In 32nd International Symposium on Computer Architecture (ISCA'05).
[62]
S. Sair, T. Sherwood, and B. Calder. 2003. A decoupled predictor-directed stream prefetching architecture. IEEE Trans. Comput. 52, 3 (2003), 260--276.
[63]
Yiannakis Sazeides and James E. Smith. 1997. The Predictability of Data Values. In Proceedings of 30th Annual International Symposium on Microarchitecture.
[64]
Andreas Sembrant, Trevor Carlson, Erik Hagersten, David Black-Shaffer, Arthur Perais, André Seznec, and Pierre Michaud. 2015. Long Term Parking (LTP): Criticality-aware Resource Allocation in OOO Processors. In 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[65]
André Seznec. 2018. Exploring value prediction with the EVES predictor. In 1st Championship Value Prediction.
[66]
André Seznec and Pierre Michaud. 2006. A case for (partially) TAgged GEometric history length branch prediction. Journal of Instruction-level Parallelism - JILP 8 (01 2006).
[67]
Rami Sheikh, Harold W. Cain, and Raguram Damodaran. 2017. Load Value Prediction via Path-based Address Prediction: Avoiding Mispredictions due to Conflicting Stores. In 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[68]
Rami Sheikh and Derek Hower. 2019. Efficient Load Value Prediction Using Multiple Predictors and Filters. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[69]
Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. 2002. Automatically Characterizing Large Scale Program Behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems.
[70]
Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian, Chris Wilkerson, Seth H Pugsley, and Zeshan Chishti. 2015. Efficiently Prefetching Complex Address Patterns. In 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[71]
Zhan Shi, Xiangru Huang, Akanksha Jain, and Calvin Lin. 2019. Applying Deep Learning to the Cache Replacement Problem. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture.
[72]
Alan Jay Smith. 1978. Sequential Program Prefetching in Memory Hierarchies. Computer 11, 12 (1978), 7--21.
[73]
Yan Solihin, Jaejin Lee, and Josep Torrellas. 2002. Using a User-Level Memory Thread for Correlation Prefetching. In Proceedings 29th Annual International Symposium on Computer Architecture.
[74]
Stephen Somogyi, Thomas F. Wenisch, Anastassia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2006. Spatial Memory Streaming. In 33rd International Symposium on Computer Architecture (ISCA'06).
[75]
Niranjan K Soundararajan, Peter Braun, Tanvir Ahmed Khan, Baris Kasikci, Heiner Litz, and Sreenivas Subramoney. 2021. PDede: Partitioned, Deduplicated, Delta Branch Target Buffer. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture.
[76]
Jared Stark, Mary D. Brown, and Yale N. Patt. 2000. On Pipelining Dynamic Instruction Scheduling Logic. In Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.
[77]
Gary S. Tyson and Todd M. Austin. 1999. Memory Renaming: Fast, Early and Accurate Processing of Memory Communication. International Journal of Parallel Programming (1999).
[78]
WikiChip. [n.d.]. AMD Zen Microarchitecture. Retrieved March 31, 2021 from https://en.wikichip.org/wiki/amd/microarchitectures/zen
[79]
WikiChip. [n.d.]. SunnyCove - Microarchitectures - Intel. Retrieved March 31, 2021 from https://en.wikichip.org/wiki/intel/microarchitectures/sunny_cove
[80]
Adi Yoaz, Mattan Erez, Ronny Ronen, and Stephan Jourdan. 1999. Speculation Techniques for Improving Load Related Instruction Scheduling. In Proceedings of the 26th International Symposium on Computer Architecture.
[81]
Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM 59, 11 (oct 2016), 56--65.
[82]
Weifeng Zhang, Dean M. Tullsen, and Brad Calder. 2007. Accelerating and Adapting Precomputation Threads for Effcient Prefetching. In 2007 IEEE 13th International Symposium on High Performance Computer Architecture.
[83]
Tianhao Zheng, Haishan Zhu, and Mattan Erez. 2018. SIPT: Speculatively Indexed, Physically Tagged Caches. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

Cited By

View all
  • (2024)Data Prefetching on Processors with Heterogeneous MemoryProceedings of the International Symposium on Memory Systems10.1145/3695794.3695800(45-60)Online publication date: 30-Sep-2024
  • (2024)Constable: Improving Performance and Power Efficiency by Safely Eliminating Load Instruction Execution2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00017(88-102)Online publication date: 29-Jun-2024
  • (2023)Performance Analysis of Criticality-Aware Out-of-Order Cores for Exploiting MLP2023 International Technical Conference on Circuits/Systems, Computers, and Communications (ITC-CSCC)10.1109/ITC-CSCC58803.2023.10212794(1-4)Online publication date: 25-Jun-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ISCA '22: Proceedings of the 49th Annual International Symposium on Computer Architecture
June 2022
1097 pages
ISBN:9781450386104
DOI:10.1145/3470496
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS TCAA: IEEE CS technical committee on architectural acoustics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. address prediction
  2. load value prefetching
  3. microarchitecture
  4. pipeline prefetching
  5. value prediction

Qualifiers

  • Research-article

Conference

ISCA '22
Sponsor:

Acceptance Rates

ISCA '22 Paper Acceptance Rate 67 of 400 submissions, 17%;
Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)392
  • Downloads (Last 6 weeks)31
Reflects downloads up to 27 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Data Prefetching on Processors with Heterogeneous MemoryProceedings of the International Symposium on Memory Systems10.1145/3695794.3695800(45-60)Online publication date: 30-Sep-2024
  • (2024)Constable: Improving Performance and Power Efficiency by Safely Eliminating Load Instruction Execution2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00017(88-102)Online publication date: 29-Jun-2024
  • (2023)Performance Analysis of Criticality-Aware Out-of-Order Cores for Exploiting MLP2023 International Technical Conference on Circuits/Systems, Computers, and Communications (ITC-CSCC)10.1109/ITC-CSCC58803.2023.10212794(1-4)Online publication date: 25-Jun-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media