skip to main content
10.1145/3466752.3480046acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article
Public Access

PDede: Partitioned, Deduplicated, Delta Branch Target Buffer

Published:17 October 2021Publication History

ABSTRACT

Due to large instruction footprints, contemporary data center applications suffer from frequent frontend stalls. Despite being a significant contributor to these stalls, the Branch Target Buffer (BTB) has received less attention compared to other frontend structures such as the instruction cache. While prior works have looked at enhancing the BTB through more efficient replacement policies and prefetching policies, a thorough analysis into optimizing the BTB’s storage efficiency is missing. In this work, we analyze BTB accesses for a large number (100+) of frontend bound applications to understand their branch target characteristics. This analysis, provides three significant observations about the nature of branch targets: (1) a significant number of branch instructions have the same branch target, (2) a significant number of branch targets share the same page address, and (3) a significant percentage of branch instructions and their targets are located on the same page. Furthermore, we observe that while applications’ address spaces are sparsely populated, they exhibit spatial locality within and across pages. We refer to these multi-page addresses as regions and we show that applications traverse a significantly smaller number of regions than pages. Based on these insights, we propose PDede, an efficient re-design of the BTB micro-architecture that improves storage efficiency by removing redundancy among branches and their targets. PDede introduces three techniques, (a) BTB Partitioning, (b) Branch Target Deduplication, and (c) Delta Branch Target Encoding to reduce BTB miss induced frontend stalls. We evaluate PDede across 100+ applications, spanning several usage scenarios, and show that it provides an average 14.4% (up to 76%) IPC speedup by reducing BTB misses by 54.7% on average (and up to 99.8%).

References

  1. 2019. Icelake. https://www.anandtech.com/show/14514/examining-intels-ice-lake-microarchitecture-and-sunny-cove/.Google ScholarGoogle Scholar
  2. N. Adiga, J. Bonanno, A. Collura, M. Heizmann, B. R. Prasky, and A. Saporito. 2020. The IBM z15 High Frequency Mainframe Branch Predictor Industrial Product. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 27–39.Google ScholarGoogle Scholar
  3. Samira Mirbagher Ajorpaz, Elba Garza, Sangam Jindal, and Daniel A Jiménez. 2018. Exploring predictive replacement policies for instruction cache and branch target buffer. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 519–532.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. R. Alameldeen and D. A. Wood. 2004. Adaptive cache compression for high-performance processors. In Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.212–223.Google ScholarGoogle ScholarCross RefCross Ref
  5. Ali Ansari, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2020. Divide and Conquer Frontend Bottleneck. In Proceedings of the 47th Annual International Symposium on Computer Architecture (ISCA).Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Grant Ayers, Jung Ho Ahn, Christos Kozyrakis, and Parthasarathy Ranganathan. 2018. Memory hierarchy for web search. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 643–656.Google ScholarGoogle ScholarCross RefCross Ref
  7. Grant Ayers, Heiner Litz, Christos Kozyrakis, and Parthasarathy Ranganathan. 2020. Classifying Memory Access Patterns for Prefetching. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 513–526.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Grant Ayers, Nayana Prasad Nagendra, David I August, Hyoun Kyu Cho, Svilen Kanev, Christos Kozyrakis, Trivikram Krishnamurthy, Heiner Litz, Tipp Moseley, and Parthasarathy Ranganathan. 2019. AsmDB: understanding and mitigating front-end stalls in warehouse-scale computers. In Proceedings of the 46th International Symposium on Computer Architecture. 462–473.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Brian K. Bray and M. J. Flynn. 1991. Strategies for branch target buffers. In Proceedings of the 24th annual international symposium on Microarchitecture - MICRO 24. ACM Press, Albuquerque, New Mexico, Puerto Rico, 42–50. https://doi.org/10.1145/123465.123473Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ioana Burcea and Andreas Moshovos. 2009. Phantom-BTB: a virtualized branch target buffer design. Acm Sigplan Notices 44, 3 (2009), 313–324.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. CBP-5. 2016. Championship Branch Prediction (CBP-5). https://www.jilp.org/cbp2016/.Google ScholarGoogle Scholar
  12. Licheng Chen, Zhipeng Wei, Zehan Cui, Mingyu Chen, Haiyang Pan, and Yungang Bao. 2014. CMD: Classification-based memory deduplication through page access characteristics. ACM SIGPLAN Notices 49, 7 (2014), 65–76.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. David Cheriton, Amin Firoozshahian, Alex Solomatnikov, John P. Stevenson, and Omid Azizi. 2012. HICAMP: architectural support for efficient concurrency-safe shared structured data access. ACM SIGPLAN Notices 47, 4 (March 2012), 287–300. https://doi.org/10.1145/2248487.2151007Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Timothy E Denehy and Windsor W Hsu. 2003. Duplicate Management for Reference Data. Research Report RJ10305, IBM(2003), 15.Google ScholarGoogle Scholar
  15. M. Farooq, L. Chen, and L. Kurian. 2010. Value Based BTB Indexing for indirect jump prediction. In 2010 IEEE 16th International Symposium on High Performance Computer Architecture (HPCA). IEEE Computer Society, Los Alamitos, CA, USA. https://doi.org/10.1109/HPCA.2010.5416659Google ScholarGoogle Scholar
  16. Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. 2012. Clearing the clouds: a study of emerging scale-out workloads on modern hardware. Acm sigplan notices 47, 4 (2012), 37–48.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Michael Ferdman, Cansu Kaynak, and Babak Falsafi. 2011. Proactive instruction fetch. In International Symposium on Microarchitecture.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Michael Ferdman, Thomas F Wenisch, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2008. Temporal instruction fetch streaming. In International Symposium on Microarchitecture.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. Garza, S. Mirbagher-Ajorpaz, T. A. Khan, and D. A. Jiménez. 2019. Bit-level Perceptron Prediction for Indirect Branches. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). 27–38.Google ScholarGoogle Scholar
  20. E. Garza, S. Mirbagher-Ajorpaz, T. A. Khan, and D. A. Jiménez. 2019. Bit-level Perceptron Prediction for Indirect Branches. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). 27–38.Google ScholarGoogle Scholar
  21. Amin Ghasemazar, Prashant Nair, and Mieszko Lis. 2020. Thesaurus: Efficient Cache Compression via Dynamic Clustering. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 527–540.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. B. Grayson, J. Rupley, G. Z. Zuraski, E. Quinnell, D. A. Jiménez, T. Nakra, P. Kitchin, R. Hensley, E. Brekelbaum, V. Sinha, and A. Ghiya. 2020. Evolution of the Samsung Exynos CPU Microarchitecture. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 40–51.Google ScholarGoogle Scholar
  23. Greg Hamerly, Erez Perelman, Jeremy Lau, and Brad Calder. 2005. SimPoint 3.0: Faster and More Flexible Program Phase Analysis. Journal of Instruction Level Parallelism(2005), 1–28.Google ScholarGoogle Scholar
  24. Bo Hong and Demyn Plantenberg. 2004. Duplicate Data Elimination in a SAN File System. MSST 2004 (2004), 14.Google ScholarGoogle Scholar
  25. Intel. 2017. 5-Level Paging and 5-Level EPT. Technical Report. Intel.Google ScholarGoogle Scholar
  26. Y. Ishii, J. Lee, K. Nathella, and D. Sunwoo. 5555. Rebasing Instruction Prefetching: An Industry Perspective. IEEE Computer Architecture Letters01 (oct 5555), 1–1. https://doi.org/10.1109/LCA.2020.3035068Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, and Joel Emer. 2010. High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP). In Proceedings of the 37th Annual International Symposium on Computer Architecture (Saint-Malo, France) (ISCA ’10). Association for Computing Machinery, New York, NY, USA, 60–71. https://doi.org/10.1145/1815961.1815971Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. 2015. Profiling a warehouse-scale computer. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. 158–169.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Cansu Kaynak, Boris Grot, and Babak Falsafi. 2015. Confluence: unified instruction supply for scale-out servers. In Proceedings of the 48th International Symposium on Microarchitecture. 166–177.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Tanvir Ahmed Khan, Nathan Brown, Akshitha Sriraman, Niranjan K Soundararajan, Rakesh Kumar, Joseph Devietti, Sreenivas Subramoney, Gilles A Pokam, Heiner Litz, and Baris Kasikci. 2021. Twig: Profile-Guided BTB Prefetching for Data Center Applications. In 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).Google ScholarGoogle Scholar
  31. Tanvir Ahmed Khan, Akshitha Sriraman, Joseph Devietti, Gilles Pokam, Heiner Litz, and Baris Kasikci. 2020. I-SPY: Context-Driven Conditional Instruction Prefetching with Coalescing. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 146–159.Google ScholarGoogle Scholar
  32. Tanvir Ahmed Khan, Dexin Zhang, Akshitha Sriraman, Joseph Devietti, Gilles Pokam, Heiner Litz, and Baris Kasikci. 2021. Ripple: Profile-Guided Instruction Cache Replacement for Data Center Applications. In Proceedings of the 48th International Symposium on Computer Architecture(ISCA 2021).Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. H. Kim, J. A. Joao, O. Mutlu, C. J. Lee, Y. N. Patt, and R. Cohn. 2009. Virtual Program Counter (VPC) Prediction: Very Low Cost Indirect Branch Prediction Using Conditional Branch Prediction Hardware. IEEE Trans. Comput. 58, 9 (2009), 1153–1170.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Aasheesh Kolli, Ali Saidi, and Thomas F Wenisch. 2013. RDIP: return-address-stack directed instruction prefetching. In 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 260–271.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Rakesh Kumar, Boris Grot, and Vijay Nagarajan. 2018. Blasting through the front-end bottleneck with shotgun. ACM SIGPLAN Notices 53, 2 (2018), 30–42.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Rakesh Kumar, Cheng-Chieh Huang, Boris Grot, and Vijay Nagarajan. 2017. Boomerang: A metadata-free architecture for control flow delivery. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 493–504.Google ScholarGoogle ScholarCross RefCross Ref
  37. Tao Li, Ravi Bhargava, and Lizy Kurian John. 2002. Rehashable BTB: an adaptive branch target buffer to improve the target predictability of Java code. In International Conference on High-Performance Computing. Springer, 597–608.Google ScholarGoogle ScholarCross RefCross Ref
  38. Tao Li, Ravi Bhargava, and Lizy Kurian John. 2005. Adapting branch-target buffer to improve the target predictability of java code. ACM Transactions on Architecture and Code Optimization (TACO) 2, 2(2005), 109–130.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Pejman Lotfi-Kamran, Boris Grot, Michael Ferdman, Stavros Volos, Onur Kocberber, Javier Picorel, Almutaz Adileh, Djordje Jevdjic, Sachin Idgunji, Emre Ozer, and Babak Falsafi. 2012. Scale-out processors. ACM SIGARCH Computer Architecture News 40, 3 (June 2012), 500–511. https://doi.org/10.1145/2366231.2337217Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. A. Perais, R. Sheikh, L. Yen, M. McIlvaine, and R. D. Clancy. 2019. Elastic Instruction Fetching. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE Computer Society, Los Alamitos, CA, USA, 478–490. https://doi.org/10.1109/HPCA.2019.00059Google ScholarGoogle Scholar
  41. Chris H Perleberg and Alan Jay Smith. 1993. Branch target buffer design and optimization. IEEE transactions on computers 42, 4 (1993), 396–412.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. A. Ramirez, O.J. Santana, J.L. Larriba-Pey, and M. Valero. 2002. Fetching instruction streams. In 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings.371–382. https://doi.org/10.1109/MICRO.2002.1176264 ISSN: 1072-4451.Google ScholarGoogle ScholarCross RefCross Ref
  43. Glenn Reinman, Brad Calder, and Todd Austin. 1999. Fetch directed instruction prefetching. In MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture. IEEE, 16–27.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Glenn Reinman, Brad Calder, and Todd Austin. 2001. Optimizations Enabled by a Decoupled Front-End Architecture. IEEE Trans. Comput. 50, 4 (April 2001), 338–355. https://doi.org/10.1109/12.919279Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. André Seznec. 2011. A 64-Kbytes ITTAGE indirect branch predictor. In JILP. https://hal.inria.fr/hal-00639041Google ScholarGoogle Scholar
  46. S. Seznec. 1996. Don’t Use the Page Number, but a Pointer to It. In 23rd Annual International Symposium on Computer Architecture. IEEE Computer Society, Los Alamitos, CA, USA, 104. https://doi.org/10.1145/232973.232985Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Dimitrios Skarlatos, Nam Sung Kim, and Josep Torrellas. 2017. Pageforge: a near-memory content-aware page-merging architecture. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture(MICRO-50 ’17). Association for Computing Machinery, New York, NY, USA, 302–314. https://doi.org/10.1145/3123939.3124540Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Lawrence Spracklen, Yuan Chou, and Santosh G Abraham. 2005. Effective instruction prefetching in chip multiprocessors for modern commercial applications. In International Symposium on High-Performance Computer Architecture.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Viji Srinivasan, Edward S Davidson, Gary S Tyson, Mark J Charney, and Thomas R Puzak. 2001. Branch history guided instruction prefetching. In International Symposium on High-Performance Computer Architecture.Google ScholarGoogle ScholarCross RefCross Ref
  50. Akshitha Sriraman, Abhishek Dhanotia, and Thomas F Wenisch. 2019. Softsku: Optimizing server architectures for microservice diversity@ scale. In Proceedings of the 46th International Symposium on Computer Architecture. 513–526.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. D-CD Tang, Ann Marie Grizzaffi Maynard, and Lizy Kurian John. 1999. Contrasting branch characteristics and branch predictor performance of C++ and C programs. In 1999 IEEE International Performance, Computing and Communications Conference (Cat. No. 99CH36305). IEEE, 275–283.Google ScholarGoogle Scholar
  52. Yingying Tian, Samira M. Khan, Daniel A. Jiménez, and Gabriel H. Loh. 2014. Last-Level Cache Deduplication. In Proceedings of the 28th ACM International Conference on Supercomputing (Munich, Germany) (ICS ’14). Association for Computing Machinery, New York, NY, USA, 53–62. https://doi.org/10.1145/2597652.2597655Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Alexander V. Veidenbaum. 1997. Instruction cache prefetching using multilevel branch prediction. In High Performance Computing(Lecture Notes in Computer Science), Constantine Polychronopoulos, Kazuki Joe, Keijiro Araki, and Makoto Amamiya (Eds.). Springer, Berlin, Heidelberg, 51–70. https://doi.org/10.1007/BFb0024203Google ScholarGoogle Scholar
  54. Wikipedia contributors. 2018. ASLR. https://en.wikipedia.org/wiki/Address_space_layout_randomization.Google ScholarGoogle Scholar
  55. Wikipedia contributors. 2018. WebAssembly. https://en.wikipedia.org/wiki/WebAssembly.Google ScholarGoogle Scholar
  56. S. J. E. Wilton and N. P. Jouppi. 1996. CACTI: an enhanced cache access and cycle time model. IEEE Journal of Solid-State Circuits 31, 5 (1996), 677–688. https://doi.org/10.1109/4.509850Google ScholarGoogle ScholarCross RefCross Ref
  57. Ahmad Yasin. 2014. A top-down method for performance analysis and counters architecture. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 35–44.Google ScholarGoogle ScholarCross RefCross Ref
  58. Yi Zhang, Steve Haga, and Rajeev Barua. 2002. Execution history guided instruction prefetching. In Proceedings of the 16th international conference on Supercomputing(ICS ’02). Association for Computing Machinery, New York, NY, USA, 199–208. https://doi.org/10.1145/514191.514220Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture
    October 2021
    1322 pages
    ISBN:9781450385572
    DOI:10.1145/3466752

    Copyright © 2021 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 17 October 2021

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate484of2,242submissions,22%

    Upcoming Conference

    MICRO '24

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format