PDIP: Priority Directed Instruction Prefetching

Authors:
Bhargav Reddy Godala

Computer Science, Princeton University, Princeton, New Jersey, United States of America

Computer Science, Princeton University, Princeton, New Jersey, United States of America

https://orcid.org/0009-0007-2739-0538
View Profile

,
Sankara Prasad Ramesh

Electrical and Computer Engineering, University of California, San Diego, San Diego, California, USA

Electrical and Computer Engineering, University of California, San Diego, San Diego, California, USA

https://orcid.org/0000-0003-1361-8224
View Profile

,
Gilles A. Pokam

Intel Corporation, Santa Clara, United States of America

Intel Corporation, Santa Clara, United States of America

https://orcid.org/0009-0002-4363-5383
View Profile

,
Jared Stark

Intel Corporation, Hillsboro, Oregon, USA

Intel Corporation, Hillsboro, Oregon, USA

https://orcid.org/0009-0002-4366-4723
View Profile

,
Andre Seznec

Intel Corporation, Santa Clara, USA

Intel Corporation, Santa Clara, USA

https://orcid.org/0000-0002-3058-6503
View Profile

,
Dean Tullsen

Computer Science and Engineering, University of California, San Diego, San Diego, California, USA

Computer Science and Engineering, University of California, San Diego, San Diego, California, USA

https://orcid.org/0000-0003-3174-9316
View Profile

,
David I. August

Computer Science, Princeton University, Princeton, New Jersey, USA

Computer Science, Princeton University, Princeton, New Jersey, USA

https://orcid.org/0000-0003-3327-1803
View Profile

ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2April 2024Pages 846–861https://doi.org/10.1145/3620665.3640394

Published:27 April 2024Publication History

ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

Pages 846–861

ABSTRACT

Modern server workloads have large code footprints which are prone to front-end bottlenecks due to instruction cache capacity misses. Even with the aggressive fetch directed instruction prefetching (FDIP), implemented in modern processors, there are still significant front-end stalls due to I-Cache misses. A major portion of misses that occur on a BPU-predicted path are tolerated by FDIP without causing stalls. Prior work on instruction prefetching, however, has not been designed to work with FDIP processors. Their singular goal is reducing I-Cache misses, whereas FDIP processors are designed to tolerate them. Designing an instruction prefetcher that works in conjunction with FDIP requires identifying the fraction of cache misses that impact front-end performance (that are not fully hidden by FDIP), and only targeting them.

In this paper, we propose Priority Directed Instruction Prefetching (PDIP), a novel instruction prefetching technique that complements FDIP by issuing prefetches for only targets where FDIP struggles - along the resteer path of front-end stall-causing events. PDIP identifies these targets and associates them with a trigger for future prefetch. At a 43.5KB budget, PDIP achieves up to 5.1% IPC speedup on important workloads such as cassandra and a geomean IPC speedup of 3.2% across 16 benchmarks.

References

Apache cassandra. http://cassandra.apache.org/.Google Scholar
Apache kafka. https://kafka.apache.org/.Google Scholar
Apache tomcat. https://tomcat.apache.org/.Google Scholar
Browserbench. "https://browserbench.org".Google Scholar
Dotty scala compiler. "https://github.com/lampepfl/dotty".Google Scholar
Intel VTune. https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html.Google Scholar
Postgresql. "https://www.postgresql.org/".Google Scholar
Speedometer2.0. "https://browserbench.org/Speedometer2.0/".Google Scholar
TPC-C. http://www.tpc.org/tpcc/.Google Scholar
Twitter finagle. https://twitter.github.io/finagle/.Google Scholar
Verilator. https://www.veripool.org/wiki/verilator.Google Scholar
Wikichip. https://en.wikichip.org/wiki/intel/microarchitectures/golden_cove.Google Scholar
Ycsb. "https://github.com/brianfrankcooper/YCSB/".Google Scholar
Champsim Simulator. https://github.com/ChampSim/ChampSim, 2020.Google Scholar
Narasimha Adiga, James Bonanno, Adam Collura, Matthias Heizmann, Brian R. Prasky, and Anthony Saporito. The ibm z15 high frequency mainframe branch predictor industrial product. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 27--39, 2020.Google ScholarDigital Library
Alon Amid, David Biancolin, Abraham Gonzalez, Daniel Grubb, Sagar Karandikar, Harrison Liew, Albert Magyar, Howard Mao, Albert Ou, Nathan Pemberton, Paul Rigge, Colin Schmidt, John Wright, Jerry Zhao, Yakun Sophia Shao, Krste Asanović, and Borivoje Nikolić. Chipyard: Integrated design, simulation, and implementation framework for custom socs. IEEE Micro, 40(4):10--21, 2020.Google ScholarDigital Library
Ali Ansari, Fatemeh Golshan, Rahil Barati, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. Mana: Microarchitecting a temporal instruction prefetcher. IEEE Transactions on Computers, 72(3):732--743, 2023.Google Scholar
Ali Ansari, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. Divide and conquer frontend bottleneck. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 65--78, 2020.Google ScholarDigital Library
Grant Ayers, Nayana Prasad Nagendra, David I. August, Hyoun Kyu Cho, Svilen Kanev, Christos Kozyrakis, Trivikram Krishnamurthy, Heiner Litz, Tipp Moseley, and Parthasarathy Ranganathan. Asmdb: Understanding and mitigating front-end stalls in warehouse-scale computers. In International Symposium on Computer Architecture (ISCA), 2019.Google ScholarDigital Library
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. The gem5 simulator. SIGARCH Comput. Archit. News, 2011.Google ScholarDigital Library
Stephen M. Blackburn, Robin Garner, Chris Hoffmann, Asjad M. Khang, Kathryn S. McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanović, Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. The dacapo benchmarks: Java benchmarking development and analysis. In Proceedings of the 21st Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications, OOPSLA '06, page 169--190, New York, NY, USA, 2006. Association for Computing Machinery.Google ScholarDigital Library
Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino, and Philippe Cudre-Mauroux. Oltp-bench: An extensible testbed for benchmarking relational databases. Proc. VLDB Endow., 7(4):277--288, dec 2013.Google ScholarDigital Library
Michael Ferdman, Cansu Kaynak, and Babak Falsafi. Proactive instruction fetch. In 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 152--162, 2011.Google ScholarDigital Library
Michael Ferdman, Thomas F. Wenisch, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. Temporal instruction fetch streaming. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 41, page 1--10, USA, 2008. IEEE Computer Society.Google ScholarDigital Library
Nathan Gober, Gino Chacon, Daniel A. Jiménez, and Paul V. Gratz. The temporal ancestry prefetcher. 2020.Google Scholar
Brian Grayson, Jeff Rupley, Gerald Zuraski Zuraski, Eric Quinnell, Daniel A. Jiménez, Tarun Nakra, Paul Kitchin, Ryan Hensley, Edward Brekelbaum, Vikas Sinha, and Ankit Ghiya. Evolution of the samsung exynos cpu microarchitecture. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 40--51, 2020.Google ScholarDigital Library
Vishal Gupta, Neelu Shivprakash Kalani, and Biswabandan Panda. Runjump-run: Bouquet of instruction pointer jumpers for high performance instruction prefetching. The First Instruction Prefetching Championship, 2020.Google Scholar
Yasuo Ishii, Jaekyu Lee, Krishnendra Nathella, and Dam Sunwoo. Rebasing instruction prefetching: An industry perspective. IEEE Computer Architecture Letters, 19(2):147--150, 2020.Google ScholarDigital Library
Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. Profiling a warehouse-scale computer. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), pages 158--169, 2015.Google ScholarDigital Library
Cansu Kaynak, Boris Grot, and Babak Falsafi. Shift: Shared history instruction fetch for lean-core server processors. In 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 272--283, 2013.Google Scholar
Cansu Kaynak, Boris Grot, and Babak Falsafi. Confluence: Unified instruction supply for scale-out servers. In Microarchitecture (MICRO), 2015.Google Scholar
Tanvir Ahmed Khan, Akshitha Sriraman, Joseph Devietti, Gilles Pokam, Heiner Litz, and Baris Kasikci. I-spy: Context-driven conditional instruction prefetching with coalescing. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 146--159, 2020.Google ScholarCross Ref
Aasheesh Kolli, Ali Saidi, and Thomas F Wenisch. RDIP: Return-address-stack directed instruction prefetching. In Microarchitecture (MICRO), 2013.Google Scholar
Rakesh Kumar, Boris Grot, and Vijay Nagarajan. Blasting through the front-end bottleneck with shotgun. In Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018.Google ScholarDigital Library
Rakesh Kumar, Cheng-Chieh Huang, Boris Grot, and Vijay Nagarajan. Boomerang: A metadata-free architecture for control flow delivery. In High Performance Computer Architecture (HPCA), 2017.Google ScholarCross Ref
S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 469--480, 2009.Google ScholarDigital Library
Chi-Keung Luk and Todd C Mowry. Cooperative prefetching: Compiler and hardware support for effective instruction prefetching in modern processors. In Microarchitecture (MICRO), 1998.Google Scholar
Nayana Prasad Nagendra, Bhargav Reddy Godala, Ishita Chaturvedi, Atmn Patel, Svilen Kanev, Tipp Moseley, Jared Stark, Gilles A. Pokam, Simone Campanoni, and David I. August. EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching. In Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA '23), June 17--21, 2023, Orlando, FL, USA. ACM, 2023.Google ScholarDigital Library
Tomoki Nakamura, Toru Koizumi, Yuya Degawa, Hidetsugu Irie, Shuichi Sakai, and Ryota Shioya. D-jolt: Distant jolt prefetcher. The 1st Instruction Prefetching Championship (IPC1), 2020.Google Scholar
K.J. Nesbit and J.E. Smith. Data cache prefetching using a global history buffer. In 10th International Symposium on High Performance Computer Architecture (HPCA'04), pages 96--96, 2004.Google ScholarDigital Library
Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. Bolt: A practical binary optimizer for data centers and beyond. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2019, page 2--14. IEEE Press, 2019.Google ScholarCross Ref
Andrea Pellegrini, Nigel Stephens, Magnus Bruce, Yasuo Ishii, Joseph Pusdesris, Abhishek Raja, Chris Abernathy, Jinson Koppanalil, Tushar Ringe, Ashok Tummala, Jamshed Jalal, Mark Werkheiser, and Anitha Kona. The arm neoverse n1 platform: Building blocks for the next-gen cloud-to-edge infrastructure soc. IEEE Micro, 40(2):53--62, 2020.Google ScholarCross Ref
Jim Pierce and Trevor Mudge. Wrong-path instruction prefetching. In Microarchitecture (MICRO), 1996.Google ScholarCross Ref
Aleksandar Prokopec, Andrea Rosà, David Leopoldseder, Gilles Duboscq, Petr Tůma, Martin Studener, Lubomír Bulej, Yudi Zheng, Alex Villazón, Doug Simon, Thomas Würthinger, and Walter Binder. Renaissance: Benchmarking suite for parallel applications on the jvm. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019, page 31--47, New York, NY, USA, 2019. Association for Computing Machinery.Google ScholarDigital Library
G. Reinman, B. Calder, and T. Austin. Fetch directed instruction prefetching. In MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture, pages 16--27, 1999.Google ScholarDigital Library
Alberto Ros and Alexandra Jimborean. A cost-effective entangling prefetcher for instructions. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 99--111, 2021.Google ScholarDigital Library
J Rupley. Samsung exynos m3 processor. IEEE Hot Chips, 30, 2018.Google Scholar
Jeff Rupley, Brad Burgess, Brian Grayson, and Gerald D Zuraski. Samsung m3 processor. IEEE Micro, 39(2):37--44, 2019.Google ScholarCross Ref
David Schall, Artemiy Margaritov, Dmitrii Ustiugov, Andreas Sandberg, and Boris Grot. Lukewarm serverless functions: Characterization and optimization. In Proceedings of the 49th Annual International Symposium on Computer Architecture, ISCA '22, page 757--770, New York, NY, USA, 2022. Association for Computing Machinery.Google ScholarDigital Library
André Seznec. A 64-kbytes ittage indirect branch predictor. In JWAC-2: Championship Branch Prediction, 2011.Google Scholar
André Seznec. The fnl+mma instruction cache prefetcher. 2020.Google Scholar
André Seznec and Pierre Michaud. A case for (partially) tagged geometric history length branch prediction. Journal of Instruction-level Parallelism - JILP, 8, 02 2006.Google Scholar
Ahmad Yasin. A top-down method for performance analysis and counters architecture. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 35--44, 2014.Google ScholarCross Ref

Recommendations

Fetch directed instruction prefetching
MICRO 32: Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture

Instruction supply is a crucial component of processor performance. Instruction prefetching has been proposed as a mechanism to help reduce instruction cache misses, which in turn can help increase instruction supply to the processor. In this paper we ...
Read More
Wrong-path instruction prefetching
MICRO 29: Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture

Instruction cache misses can severely limit the performance of both superscalar processors and high speed sequential machines. Instruction prefetch algorithms attempt to reduce the performance degradation by bringing lines into the instruction cache ...
Read More
Execution History Guided Instruction Prefetching

The increasing gap in performance between processors and main memory has made effective instructions prefetching techniques more important than ever. A major deficiency of existing prefetching methods is that most of them require an extra port to I-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2
April 2024
1299 pages
ISBN:9798400703850
DOI:10.1145/3620665
General Chairs:
Nael Abu-Ghazaleh,
Rajiv Gupta,
Program Chairs:
Madan Musuvathi,
Dan Tsafrir
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 April 2024
Check for updates
Badges
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate535of2,713submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 218
  Total Downloads
- Downloads (Last 12 months)218
- Downloads (Last 6 weeks)218
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

PDIP: Priority Directed Instruction Prefetching

ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

ABSTRACT

References

Cited By

Recommendations

Fetch directed instruction prefetching

Wrong-path instruction prefetching

Execution History Guided Instruction Prefetching