ABSTRACT
Modern server workloads have large code footprints which are prone to front-end bottlenecks due to instruction cache capacity misses. Even with the aggressive fetch directed instruction prefetching (FDIP), implemented in modern processors, there are still significant front-end stalls due to I-Cache misses. A major portion of misses that occur on a BPU-predicted path are tolerated by FDIP without causing stalls. Prior work on instruction prefetching, however, has not been designed to work with FDIP processors. Their singular goal is reducing I-Cache misses, whereas FDIP processors are designed to tolerate them. Designing an instruction prefetcher that works in conjunction with FDIP requires identifying the fraction of cache misses that impact front-end performance (that are not fully hidden by FDIP), and only targeting them.
In this paper, we propose Priority Directed Instruction Prefetching (PDIP), a novel instruction prefetching technique that complements FDIP by issuing prefetches for only targets where FDIP struggles - along the resteer path of front-end stall-causing events. PDIP identifies these targets and associates them with a trigger for future prefetch. At a 43.5KB budget, PDIP achieves up to 5.1% IPC speedup on important workloads such as cassandra and a geomean IPC speedup of 3.2% across 16 benchmarks.
- Apache cassandra. http://cassandra.apache.org/.Google Scholar
- Apache kafka. https://kafka.apache.org/.Google Scholar
- Apache tomcat. https://tomcat.apache.org/.Google Scholar
- Browserbench. "https://browserbench.org".Google Scholar
- Dotty scala compiler. "https://github.com/lampepfl/dotty".Google Scholar
- Intel VTune. https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html.Google Scholar
- Postgresql. "https://www.postgresql.org/".Google Scholar
- Speedometer2.0. "https://browserbench.org/Speedometer2.0/".Google Scholar
- TPC-C. http://www.tpc.org/tpcc/.Google Scholar
- Twitter finagle. https://twitter.github.io/finagle/.Google Scholar
- Verilator. https://www.veripool.org/wiki/verilator.Google Scholar
- Wikichip. https://en.wikichip.org/wiki/intel/microarchitectures/golden_cove.Google Scholar
- Ycsb. "https://github.com/brianfrankcooper/YCSB/".Google Scholar
- Champsim Simulator. https://github.com/ChampSim/ChampSim, 2020.Google Scholar
- Narasimha Adiga, James Bonanno, Adam Collura, Matthias Heizmann, Brian R. Prasky, and Anthony Saporito. The ibm z15 high frequency mainframe branch predictor industrial product. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 27--39, 2020.Google ScholarDigital Library
- Alon Amid, David Biancolin, Abraham Gonzalez, Daniel Grubb, Sagar Karandikar, Harrison Liew, Albert Magyar, Howard Mao, Albert Ou, Nathan Pemberton, Paul Rigge, Colin Schmidt, John Wright, Jerry Zhao, Yakun Sophia Shao, Krste Asanović, and Borivoje Nikolić. Chipyard: Integrated design, simulation, and implementation framework for custom socs. IEEE Micro, 40(4):10--21, 2020.Google ScholarDigital Library
- Ali Ansari, Fatemeh Golshan, Rahil Barati, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. Mana: Microarchitecting a temporal instruction prefetcher. IEEE Transactions on Computers, 72(3):732--743, 2023.Google Scholar
- Ali Ansari, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. Divide and conquer frontend bottleneck. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 65--78, 2020.Google ScholarDigital Library
- Grant Ayers, Nayana Prasad Nagendra, David I. August, Hyoun Kyu Cho, Svilen Kanev, Christos Kozyrakis, Trivikram Krishnamurthy, Heiner Litz, Tipp Moseley, and Parthasarathy Ranganathan. Asmdb: Understanding and mitigating front-end stalls in warehouse-scale computers. In International Symposium on Computer Architecture (ISCA), 2019.Google ScholarDigital Library
- Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. The gem5 simulator. SIGARCH Comput. Archit. News, 2011.Google ScholarDigital Library
- Stephen M. Blackburn, Robin Garner, Chris Hoffmann, Asjad M. Khang, Kathryn S. McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanović, Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. The dacapo benchmarks: Java benchmarking development and analysis. In Proceedings of the 21st Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications, OOPSLA '06, page 169--190, New York, NY, USA, 2006. Association for Computing Machinery.Google ScholarDigital Library
- Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino, and Philippe Cudre-Mauroux. Oltp-bench: An extensible testbed for benchmarking relational databases. Proc. VLDB Endow., 7(4):277--288, dec 2013.Google ScholarDigital Library
- Michael Ferdman, Cansu Kaynak, and Babak Falsafi. Proactive instruction fetch. In 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 152--162, 2011.Google ScholarDigital Library
- Michael Ferdman, Thomas F. Wenisch, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. Temporal instruction fetch streaming. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 41, page 1--10, USA, 2008. IEEE Computer Society.Google ScholarDigital Library
- Nathan Gober, Gino Chacon, Daniel A. Jiménez, and Paul V. Gratz. The temporal ancestry prefetcher. 2020.Google Scholar
- Brian Grayson, Jeff Rupley, Gerald Zuraski Zuraski, Eric Quinnell, Daniel A. Jiménez, Tarun Nakra, Paul Kitchin, Ryan Hensley, Edward Brekelbaum, Vikas Sinha, and Ankit Ghiya. Evolution of the samsung exynos cpu microarchitecture. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 40--51, 2020.Google ScholarDigital Library
- Vishal Gupta, Neelu Shivprakash Kalani, and Biswabandan Panda. Runjump-run: Bouquet of instruction pointer jumpers for high performance instruction prefetching. The First Instruction Prefetching Championship, 2020.Google Scholar
- Yasuo Ishii, Jaekyu Lee, Krishnendra Nathella, and Dam Sunwoo. Rebasing instruction prefetching: An industry perspective. IEEE Computer Architecture Letters, 19(2):147--150, 2020.Google ScholarDigital Library
- Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. Profiling a warehouse-scale computer. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), pages 158--169, 2015.Google ScholarDigital Library
- Cansu Kaynak, Boris Grot, and Babak Falsafi. Shift: Shared history instruction fetch for lean-core server processors. In 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 272--283, 2013.Google Scholar
- Cansu Kaynak, Boris Grot, and Babak Falsafi. Confluence: Unified instruction supply for scale-out servers. In Microarchitecture (MICRO), 2015.Google Scholar
- Tanvir Ahmed Khan, Akshitha Sriraman, Joseph Devietti, Gilles Pokam, Heiner Litz, and Baris Kasikci. I-spy: Context-driven conditional instruction prefetching with coalescing. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 146--159, 2020.Google ScholarCross Ref
- Aasheesh Kolli, Ali Saidi, and Thomas F Wenisch. RDIP: Return-address-stack directed instruction prefetching. In Microarchitecture (MICRO), 2013.Google Scholar
- Rakesh Kumar, Boris Grot, and Vijay Nagarajan. Blasting through the front-end bottleneck with shotgun. In Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018.Google ScholarDigital Library
- Rakesh Kumar, Cheng-Chieh Huang, Boris Grot, and Vijay Nagarajan. Boomerang: A metadata-free architecture for control flow delivery. In High Performance Computer Architecture (HPCA), 2017.Google ScholarCross Ref
- S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 469--480, 2009.Google ScholarDigital Library
- Chi-Keung Luk and Todd C Mowry. Cooperative prefetching: Compiler and hardware support for effective instruction prefetching in modern processors. In Microarchitecture (MICRO), 1998.Google Scholar
- Nayana Prasad Nagendra, Bhargav Reddy Godala, Ishita Chaturvedi, Atmn Patel, Svilen Kanev, Tipp Moseley, Jared Stark, Gilles A. Pokam, Simone Campanoni, and David I. August. EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching. In Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA '23), June 17--21, 2023, Orlando, FL, USA. ACM, 2023.Google ScholarDigital Library
- Tomoki Nakamura, Toru Koizumi, Yuya Degawa, Hidetsugu Irie, Shuichi Sakai, and Ryota Shioya. D-jolt: Distant jolt prefetcher. The 1st Instruction Prefetching Championship (IPC1), 2020.Google Scholar
- K.J. Nesbit and J.E. Smith. Data cache prefetching using a global history buffer. In 10th International Symposium on High Performance Computer Architecture (HPCA'04), pages 96--96, 2004.Google ScholarDigital Library
- Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. Bolt: A practical binary optimizer for data centers and beyond. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2019, page 2--14. IEEE Press, 2019.Google ScholarCross Ref
- Andrea Pellegrini, Nigel Stephens, Magnus Bruce, Yasuo Ishii, Joseph Pusdesris, Abhishek Raja, Chris Abernathy, Jinson Koppanalil, Tushar Ringe, Ashok Tummala, Jamshed Jalal, Mark Werkheiser, and Anitha Kona. The arm neoverse n1 platform: Building blocks for the next-gen cloud-to-edge infrastructure soc. IEEE Micro, 40(2):53--62, 2020.Google ScholarCross Ref
- Jim Pierce and Trevor Mudge. Wrong-path instruction prefetching. In Microarchitecture (MICRO), 1996.Google ScholarCross Ref
- Aleksandar Prokopec, Andrea Rosà, David Leopoldseder, Gilles Duboscq, Petr Tůma, Martin Studener, Lubomír Bulej, Yudi Zheng, Alex Villazón, Doug Simon, Thomas Würthinger, and Walter Binder. Renaissance: Benchmarking suite for parallel applications on the jvm. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019, page 31--47, New York, NY, USA, 2019. Association for Computing Machinery.Google ScholarDigital Library
- G. Reinman, B. Calder, and T. Austin. Fetch directed instruction prefetching. In MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture, pages 16--27, 1999.Google ScholarDigital Library
- Alberto Ros and Alexandra Jimborean. A cost-effective entangling prefetcher for instructions. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 99--111, 2021.Google ScholarDigital Library
- J Rupley. Samsung exynos m3 processor. IEEE Hot Chips, 30, 2018.Google Scholar
- Jeff Rupley, Brad Burgess, Brian Grayson, and Gerald D Zuraski. Samsung m3 processor. IEEE Micro, 39(2):37--44, 2019.Google ScholarCross Ref
- David Schall, Artemiy Margaritov, Dmitrii Ustiugov, Andreas Sandberg, and Boris Grot. Lukewarm serverless functions: Characterization and optimization. In Proceedings of the 49th Annual International Symposium on Computer Architecture, ISCA '22, page 757--770, New York, NY, USA, 2022. Association for Computing Machinery.Google ScholarDigital Library
- André Seznec. A 64-kbytes ittage indirect branch predictor. In JWAC-2: Championship Branch Prediction, 2011.Google Scholar
- André Seznec. The fnl+mma instruction cache prefetcher. 2020.Google Scholar
- André Seznec and Pierre Michaud. A case for (partially) tagged geometric history length branch prediction. Journal of Instruction-level Parallelism - JILP, 8, 02 2006.Google Scholar
- Ahmad Yasin. A top-down method for performance analysis and counters architecture. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 35--44, 2014.Google ScholarCross Ref
Recommendations
Fetch directed instruction prefetching
MICRO 32: Proceedings of the 32nd annual ACM/IEEE international symposium on MicroarchitectureInstruction supply is a crucial component of processor performance. Instruction prefetching has been proposed as a mechanism to help reduce instruction cache misses, which in turn can help increase instruction supply to the processor. In this paper we ...
Wrong-path instruction prefetching
MICRO 29: Proceedings of the 29th annual ACM/IEEE international symposium on MicroarchitectureInstruction cache misses can severely limit the performance of both superscalar processors and high speed sequential machines. Instruction prefetch algorithms attempt to reduce the performance degradation by bringing lines into the instruction cache ...
Execution History Guided Instruction Prefetching
The increasing gap in performance between processors and main memory has made effective instructions prefetching techniques more important than ever. A major deficiency of existing prefetching methods is that most of them require an extra port to I-...
Comments