skip to main content
10.1145/3620665.3640394acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections

PDIP: Priority Directed Instruction Prefetching

Published:27 April 2024Publication History

ABSTRACT

Modern server workloads have large code footprints which are prone to front-end bottlenecks due to instruction cache capacity misses. Even with the aggressive fetch directed instruction prefetching (FDIP), implemented in modern processors, there are still significant front-end stalls due to I-Cache misses. A major portion of misses that occur on a BPU-predicted path are tolerated by FDIP without causing stalls. Prior work on instruction prefetching, however, has not been designed to work with FDIP processors. Their singular goal is reducing I-Cache misses, whereas FDIP processors are designed to tolerate them. Designing an instruction prefetcher that works in conjunction with FDIP requires identifying the fraction of cache misses that impact front-end performance (that are not fully hidden by FDIP), and only targeting them.

In this paper, we propose Priority Directed Instruction Prefetching (PDIP), a novel instruction prefetching technique that complements FDIP by issuing prefetches for only targets where FDIP struggles - along the resteer path of front-end stall-causing events. PDIP identifies these targets and associates them with a trigger for future prefetch. At a 43.5KB budget, PDIP achieves up to 5.1% IPC speedup on important workloads such as cassandra and a geomean IPC speedup of 3.2% across 16 benchmarks.

References

  1. Apache cassandra. http://cassandra.apache.org/.Google ScholarGoogle Scholar
  2. Apache kafka. https://kafka.apache.org/.Google ScholarGoogle Scholar
  3. Apache tomcat. https://tomcat.apache.org/.Google ScholarGoogle Scholar
  4. Browserbench. "https://browserbench.org".Google ScholarGoogle Scholar
  5. Dotty scala compiler. "https://github.com/lampepfl/dotty".Google ScholarGoogle Scholar
  6. Intel VTune. https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html.Google ScholarGoogle Scholar
  7. Postgresql. "https://www.postgresql.org/".Google ScholarGoogle Scholar
  8. Speedometer2.0. "https://browserbench.org/Speedometer2.0/".Google ScholarGoogle Scholar
  9. TPC-C. http://www.tpc.org/tpcc/.Google ScholarGoogle Scholar
  10. Twitter finagle. https://twitter.github.io/finagle/.Google ScholarGoogle Scholar
  11. Verilator. https://www.veripool.org/wiki/verilator.Google ScholarGoogle Scholar
  12. Wikichip. https://en.wikichip.org/wiki/intel/microarchitectures/golden_cove.Google ScholarGoogle Scholar
  13. Ycsb. "https://github.com/brianfrankcooper/YCSB/".Google ScholarGoogle Scholar
  14. Champsim Simulator. https://github.com/ChampSim/ChampSim, 2020.Google ScholarGoogle Scholar
  15. Narasimha Adiga, James Bonanno, Adam Collura, Matthias Heizmann, Brian R. Prasky, and Anthony Saporito. The ibm z15 high frequency mainframe branch predictor industrial product. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 27--39, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Alon Amid, David Biancolin, Abraham Gonzalez, Daniel Grubb, Sagar Karandikar, Harrison Liew, Albert Magyar, Howard Mao, Albert Ou, Nathan Pemberton, Paul Rigge, Colin Schmidt, John Wright, Jerry Zhao, Yakun Sophia Shao, Krste Asanović, and Borivoje Nikolić. Chipyard: Integrated design, simulation, and implementation framework for custom socs. IEEE Micro, 40(4):10--21, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ali Ansari, Fatemeh Golshan, Rahil Barati, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. Mana: Microarchitecting a temporal instruction prefetcher. IEEE Transactions on Computers, 72(3):732--743, 2023.Google ScholarGoogle Scholar
  18. Ali Ansari, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. Divide and conquer frontend bottleneck. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 65--78, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Grant Ayers, Nayana Prasad Nagendra, David I. August, Hyoun Kyu Cho, Svilen Kanev, Christos Kozyrakis, Trivikram Krishnamurthy, Heiner Litz, Tipp Moseley, and Parthasarathy Ranganathan. Asmdb: Understanding and mitigating front-end stalls in warehouse-scale computers. In International Symposium on Computer Architecture (ISCA), 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. The gem5 simulator. SIGARCH Comput. Archit. News, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Stephen M. Blackburn, Robin Garner, Chris Hoffmann, Asjad M. Khang, Kathryn S. McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanović, Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. The dacapo benchmarks: Java benchmarking development and analysis. In Proceedings of the 21st Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications, OOPSLA '06, page 169--190, New York, NY, USA, 2006. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino, and Philippe Cudre-Mauroux. Oltp-bench: An extensible testbed for benchmarking relational databases. Proc. VLDB Endow., 7(4):277--288, dec 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Michael Ferdman, Cansu Kaynak, and Babak Falsafi. Proactive instruction fetch. In 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 152--162, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Michael Ferdman, Thomas F. Wenisch, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. Temporal instruction fetch streaming. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 41, page 1--10, USA, 2008. IEEE Computer Society.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Nathan Gober, Gino Chacon, Daniel A. Jiménez, and Paul V. Gratz. The temporal ancestry prefetcher. 2020.Google ScholarGoogle Scholar
  26. Brian Grayson, Jeff Rupley, Gerald Zuraski Zuraski, Eric Quinnell, Daniel A. Jiménez, Tarun Nakra, Paul Kitchin, Ryan Hensley, Edward Brekelbaum, Vikas Sinha, and Ankit Ghiya. Evolution of the samsung exynos cpu microarchitecture. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 40--51, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Vishal Gupta, Neelu Shivprakash Kalani, and Biswabandan Panda. Runjump-run: Bouquet of instruction pointer jumpers for high performance instruction prefetching. The First Instruction Prefetching Championship, 2020.Google ScholarGoogle Scholar
  28. Yasuo Ishii, Jaekyu Lee, Krishnendra Nathella, and Dam Sunwoo. Rebasing instruction prefetching: An industry perspective. IEEE Computer Architecture Letters, 19(2):147--150, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. Profiling a warehouse-scale computer. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), pages 158--169, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Cansu Kaynak, Boris Grot, and Babak Falsafi. Shift: Shared history instruction fetch for lean-core server processors. In 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 272--283, 2013.Google ScholarGoogle Scholar
  31. Cansu Kaynak, Boris Grot, and Babak Falsafi. Confluence: Unified instruction supply for scale-out servers. In Microarchitecture (MICRO), 2015.Google ScholarGoogle Scholar
  32. Tanvir Ahmed Khan, Akshitha Sriraman, Joseph Devietti, Gilles Pokam, Heiner Litz, and Baris Kasikci. I-spy: Context-driven conditional instruction prefetching with coalescing. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 146--159, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  33. Aasheesh Kolli, Ali Saidi, and Thomas F Wenisch. RDIP: Return-address-stack directed instruction prefetching. In Microarchitecture (MICRO), 2013.Google ScholarGoogle Scholar
  34. Rakesh Kumar, Boris Grot, and Vijay Nagarajan. Blasting through the front-end bottleneck with shotgun. In Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Rakesh Kumar, Cheng-Chieh Huang, Boris Grot, and Vijay Nagarajan. Boomerang: A metadata-free architecture for control flow delivery. In High Performance Computer Architecture (HPCA), 2017.Google ScholarGoogle ScholarCross RefCross Ref
  36. S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 469--480, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Chi-Keung Luk and Todd C Mowry. Cooperative prefetching: Compiler and hardware support for effective instruction prefetching in modern processors. In Microarchitecture (MICRO), 1998.Google ScholarGoogle Scholar
  38. Nayana Prasad Nagendra, Bhargav Reddy Godala, Ishita Chaturvedi, Atmn Patel, Svilen Kanev, Tipp Moseley, Jared Stark, Gilles A. Pokam, Simone Campanoni, and David I. August. EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching. In Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA '23), June 17--21, 2023, Orlando, FL, USA. ACM, 2023.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Tomoki Nakamura, Toru Koizumi, Yuya Degawa, Hidetsugu Irie, Shuichi Sakai, and Ryota Shioya. D-jolt: Distant jolt prefetcher. The 1st Instruction Prefetching Championship (IPC1), 2020.Google ScholarGoogle Scholar
  40. K.J. Nesbit and J.E. Smith. Data cache prefetching using a global history buffer. In 10th International Symposium on High Performance Computer Architecture (HPCA'04), pages 96--96, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. Bolt: A practical binary optimizer for data centers and beyond. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2019, page 2--14. IEEE Press, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  42. Andrea Pellegrini, Nigel Stephens, Magnus Bruce, Yasuo Ishii, Joseph Pusdesris, Abhishek Raja, Chris Abernathy, Jinson Koppanalil, Tushar Ringe, Ashok Tummala, Jamshed Jalal, Mark Werkheiser, and Anitha Kona. The arm neoverse n1 platform: Building blocks for the next-gen cloud-to-edge infrastructure soc. IEEE Micro, 40(2):53--62, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  43. Jim Pierce and Trevor Mudge. Wrong-path instruction prefetching. In Microarchitecture (MICRO), 1996.Google ScholarGoogle ScholarCross RefCross Ref
  44. Aleksandar Prokopec, Andrea Rosà, David Leopoldseder, Gilles Duboscq, Petr Tůma, Martin Studener, Lubomír Bulej, Yudi Zheng, Alex Villazón, Doug Simon, Thomas Würthinger, and Walter Binder. Renaissance: Benchmarking suite for parallel applications on the jvm. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019, page 31--47, New York, NY, USA, 2019. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. G. Reinman, B. Calder, and T. Austin. Fetch directed instruction prefetching. In MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture, pages 16--27, 1999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Alberto Ros and Alexandra Jimborean. A cost-effective entangling prefetcher for instructions. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 99--111, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. J Rupley. Samsung exynos m3 processor. IEEE Hot Chips, 30, 2018.Google ScholarGoogle Scholar
  48. Jeff Rupley, Brad Burgess, Brian Grayson, and Gerald D Zuraski. Samsung m3 processor. IEEE Micro, 39(2):37--44, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  49. David Schall, Artemiy Margaritov, Dmitrii Ustiugov, Andreas Sandberg, and Boris Grot. Lukewarm serverless functions: Characterization and optimization. In Proceedings of the 49th Annual International Symposium on Computer Architecture, ISCA '22, page 757--770, New York, NY, USA, 2022. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. André Seznec. A 64-kbytes ittage indirect branch predictor. In JWAC-2: Championship Branch Prediction, 2011.Google ScholarGoogle Scholar
  51. André Seznec. The fnl+mma instruction cache prefetcher. 2020.Google ScholarGoogle Scholar
  52. André Seznec and Pierre Michaud. A case for (partially) tagged geometric history length branch prediction. Journal of Instruction-level Parallelism - JILP, 8, 02 2006.Google ScholarGoogle Scholar
  53. Ahmad Yasin. A top-down method for performance analysis and counters architecture. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 35--44, 2014.Google ScholarGoogle ScholarCross RefCross Ref

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Article Metrics

    • Downloads (Last 12 months)218
    • Downloads (Last 6 weeks)218

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader