skip to main content
10.1145/3466752.3480046acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article
Public Access

PDede: Partitioned, Deduplicated, Delta Branch Target Buffer

Published: 17 October 2021 Publication History

Abstract

Due to large instruction footprints, contemporary data center applications suffer from frequent frontend stalls. Despite being a significant contributor to these stalls, the Branch Target Buffer (BTB) has received less attention compared to other frontend structures such as the instruction cache. While prior works have looked at enhancing the BTB through more efficient replacement policies and prefetching policies, a thorough analysis into optimizing the BTB’s storage efficiency is missing. In this work, we analyze BTB accesses for a large number (100+) of frontend bound applications to understand their branch target characteristics. This analysis, provides three significant observations about the nature of branch targets: (1) a significant number of branch instructions have the same branch target, (2) a significant number of branch targets share the same page address, and (3) a significant percentage of branch instructions and their targets are located on the same page. Furthermore, we observe that while applications’ address spaces are sparsely populated, they exhibit spatial locality within and across pages. We refer to these multi-page addresses as regions and we show that applications traverse a significantly smaller number of regions than pages. Based on these insights, we propose PDede, an efficient re-design of the BTB micro-architecture that improves storage efficiency by removing redundancy among branches and their targets. PDede introduces three techniques, (a) BTB Partitioning, (b) Branch Target Deduplication, and (c) Delta Branch Target Encoding to reduce BTB miss induced frontend stalls. We evaluate PDede across 100+ applications, spanning several usage scenarios, and show that it provides an average 14.4% (up to 76%) IPC speedup by reducing BTB misses by 54.7% on average (and up to 99.8%).

References

[1]
2019. Icelake. https://www.anandtech.com/show/14514/examining-intels-ice-lake-microarchitecture-and-sunny-cove/.
[2]
N. Adiga, J. Bonanno, A. Collura, M. Heizmann, B. R. Prasky, and A. Saporito. 2020. The IBM z15 High Frequency Mainframe Branch Predictor Industrial Product. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 27–39.
[3]
Samira Mirbagher Ajorpaz, Elba Garza, Sangam Jindal, and Daniel A Jiménez. 2018. Exploring predictive replacement policies for instruction cache and branch target buffer. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 519–532.
[4]
A. R. Alameldeen and D. A. Wood. 2004. Adaptive cache compression for high-performance processors. In Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.212–223.
[5]
Ali Ansari, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2020. Divide and Conquer Frontend Bottleneck. In Proceedings of the 47th Annual International Symposium on Computer Architecture (ISCA).
[6]
Grant Ayers, Jung Ho Ahn, Christos Kozyrakis, and Parthasarathy Ranganathan. 2018. Memory hierarchy for web search. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 643–656.
[7]
Grant Ayers, Heiner Litz, Christos Kozyrakis, and Parthasarathy Ranganathan. 2020. Classifying Memory Access Patterns for Prefetching. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 513–526.
[8]
Grant Ayers, Nayana Prasad Nagendra, David I August, Hyoun Kyu Cho, Svilen Kanev, Christos Kozyrakis, Trivikram Krishnamurthy, Heiner Litz, Tipp Moseley, and Parthasarathy Ranganathan. 2019. AsmDB: understanding and mitigating front-end stalls in warehouse-scale computers. In Proceedings of the 46th International Symposium on Computer Architecture. 462–473.
[9]
Brian K. Bray and M. J. Flynn. 1991. Strategies for branch target buffers. In Proceedings of the 24th annual international symposium on Microarchitecture - MICRO 24. ACM Press, Albuquerque, New Mexico, Puerto Rico, 42–50. https://doi.org/10.1145/123465.123473
[10]
Ioana Burcea and Andreas Moshovos. 2009. Phantom-BTB: a virtualized branch target buffer design. Acm Sigplan Notices 44, 3 (2009), 313–324.
[11]
CBP-5. 2016. Championship Branch Prediction (CBP-5). https://www.jilp.org/cbp2016/.
[12]
Licheng Chen, Zhipeng Wei, Zehan Cui, Mingyu Chen, Haiyang Pan, and Yungang Bao. 2014. CMD: Classification-based memory deduplication through page access characteristics. ACM SIGPLAN Notices 49, 7 (2014), 65–76.
[13]
David Cheriton, Amin Firoozshahian, Alex Solomatnikov, John P. Stevenson, and Omid Azizi. 2012. HICAMP: architectural support for efficient concurrency-safe shared structured data access. ACM SIGPLAN Notices 47, 4 (March 2012), 287–300. https://doi.org/10.1145/2248487.2151007
[14]
Timothy E Denehy and Windsor W Hsu. 2003. Duplicate Management for Reference Data. Research Report RJ10305, IBM(2003), 15.
[15]
M. Farooq, L. Chen, and L. Kurian. 2010. Value Based BTB Indexing for indirect jump prediction. In 2010 IEEE 16th International Symposium on High Performance Computer Architecture (HPCA). IEEE Computer Society, Los Alamitos, CA, USA. https://doi.org/10.1109/HPCA.2010.5416659
[16]
Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. 2012. Clearing the clouds: a study of emerging scale-out workloads on modern hardware. Acm sigplan notices 47, 4 (2012), 37–48.
[17]
Michael Ferdman, Cansu Kaynak, and Babak Falsafi. 2011. Proactive instruction fetch. In International Symposium on Microarchitecture.
[18]
Michael Ferdman, Thomas F Wenisch, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2008. Temporal instruction fetch streaming. In International Symposium on Microarchitecture.
[19]
E. Garza, S. Mirbagher-Ajorpaz, T. A. Khan, and D. A. Jiménez. 2019. Bit-level Perceptron Prediction for Indirect Branches. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). 27–38.
[20]
E. Garza, S. Mirbagher-Ajorpaz, T. A. Khan, and D. A. Jiménez. 2019. Bit-level Perceptron Prediction for Indirect Branches. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). 27–38.
[21]
Amin Ghasemazar, Prashant Nair, and Mieszko Lis. 2020. Thesaurus: Efficient Cache Compression via Dynamic Clustering. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 527–540.
[22]
B. Grayson, J. Rupley, G. Z. Zuraski, E. Quinnell, D. A. Jiménez, T. Nakra, P. Kitchin, R. Hensley, E. Brekelbaum, V. Sinha, and A. Ghiya. 2020. Evolution of the Samsung Exynos CPU Microarchitecture. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 40–51.
[23]
Greg Hamerly, Erez Perelman, Jeremy Lau, and Brad Calder. 2005. SimPoint 3.0: Faster and More Flexible Program Phase Analysis. Journal of Instruction Level Parallelism(2005), 1–28.
[24]
Bo Hong and Demyn Plantenberg. 2004. Duplicate Data Elimination in a SAN File System. MSST 2004 (2004), 14.
[25]
Intel. 2017. 5-Level Paging and 5-Level EPT. Technical Report. Intel.
[26]
Y. Ishii, J. Lee, K. Nathella, and D. Sunwoo. 5555. Rebasing Instruction Prefetching: An Industry Perspective. IEEE Computer Architecture Letters01 (oct 5555), 1–1. https://doi.org/10.1109/LCA.2020.3035068
[27]
Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, and Joel Emer. 2010. High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP). In Proceedings of the 37th Annual International Symposium on Computer Architecture (Saint-Malo, France) (ISCA ’10). Association for Computing Machinery, New York, NY, USA, 60–71. https://doi.org/10.1145/1815961.1815971
[28]
Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. 2015. Profiling a warehouse-scale computer. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. 158–169.
[29]
Cansu Kaynak, Boris Grot, and Babak Falsafi. 2015. Confluence: unified instruction supply for scale-out servers. In Proceedings of the 48th International Symposium on Microarchitecture. 166–177.
[30]
Tanvir Ahmed Khan, Nathan Brown, Akshitha Sriraman, Niranjan K Soundararajan, Rakesh Kumar, Joseph Devietti, Sreenivas Subramoney, Gilles A Pokam, Heiner Litz, and Baris Kasikci. 2021. Twig: Profile-Guided BTB Prefetching for Data Center Applications. In 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[31]
Tanvir Ahmed Khan, Akshitha Sriraman, Joseph Devietti, Gilles Pokam, Heiner Litz, and Baris Kasikci. 2020. I-SPY: Context-Driven Conditional Instruction Prefetching with Coalescing. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 146–159.
[32]
Tanvir Ahmed Khan, Dexin Zhang, Akshitha Sriraman, Joseph Devietti, Gilles Pokam, Heiner Litz, and Baris Kasikci. 2021. Ripple: Profile-Guided Instruction Cache Replacement for Data Center Applications. In Proceedings of the 48th International Symposium on Computer Architecture(ISCA 2021).
[33]
H. Kim, J. A. Joao, O. Mutlu, C. J. Lee, Y. N. Patt, and R. Cohn. 2009. Virtual Program Counter (VPC) Prediction: Very Low Cost Indirect Branch Prediction Using Conditional Branch Prediction Hardware. IEEE Trans. Comput. 58, 9 (2009), 1153–1170.
[34]
Aasheesh Kolli, Ali Saidi, and Thomas F Wenisch. 2013. RDIP: return-address-stack directed instruction prefetching. In 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 260–271.
[35]
Rakesh Kumar, Boris Grot, and Vijay Nagarajan. 2018. Blasting through the front-end bottleneck with shotgun. ACM SIGPLAN Notices 53, 2 (2018), 30–42.
[36]
Rakesh Kumar, Cheng-Chieh Huang, Boris Grot, and Vijay Nagarajan. 2017. Boomerang: A metadata-free architecture for control flow delivery. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 493–504.
[37]
Tao Li, Ravi Bhargava, and Lizy Kurian John. 2002. Rehashable BTB: an adaptive branch target buffer to improve the target predictability of Java code. In International Conference on High-Performance Computing. Springer, 597–608.
[38]
Tao Li, Ravi Bhargava, and Lizy Kurian John. 2005. Adapting branch-target buffer to improve the target predictability of java code. ACM Transactions on Architecture and Code Optimization (TACO) 2, 2(2005), 109–130.
[39]
Pejman Lotfi-Kamran, Boris Grot, Michael Ferdman, Stavros Volos, Onur Kocberber, Javier Picorel, Almutaz Adileh, Djordje Jevdjic, Sachin Idgunji, Emre Ozer, and Babak Falsafi. 2012. Scale-out processors. ACM SIGARCH Computer Architecture News 40, 3 (June 2012), 500–511. https://doi.org/10.1145/2366231.2337217
[40]
A. Perais, R. Sheikh, L. Yen, M. McIlvaine, and R. D. Clancy. 2019. Elastic Instruction Fetching. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE Computer Society, Los Alamitos, CA, USA, 478–490. https://doi.org/10.1109/HPCA.2019.00059
[41]
Chris H Perleberg and Alan Jay Smith. 1993. Branch target buffer design and optimization. IEEE transactions on computers 42, 4 (1993), 396–412.
[42]
A. Ramirez, O.J. Santana, J.L. Larriba-Pey, and M. Valero. 2002. Fetching instruction streams. In 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings.371–382. https://doi.org/10.1109/MICRO.2002.1176264 ISSN: 1072-4451.
[43]
Glenn Reinman, Brad Calder, and Todd Austin. 1999. Fetch directed instruction prefetching. In MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture. IEEE, 16–27.
[44]
Glenn Reinman, Brad Calder, and Todd Austin. 2001. Optimizations Enabled by a Decoupled Front-End Architecture. IEEE Trans. Comput. 50, 4 (April 2001), 338–355. https://doi.org/10.1109/12.919279
[45]
André Seznec. 2011. A 64-Kbytes ITTAGE indirect branch predictor. In JILP. https://hal.inria.fr/hal-00639041
[46]
S. Seznec. 1996. Don’t Use the Page Number, but a Pointer to It. In 23rd Annual International Symposium on Computer Architecture. IEEE Computer Society, Los Alamitos, CA, USA, 104. https://doi.org/10.1145/232973.232985
[47]
Dimitrios Skarlatos, Nam Sung Kim, and Josep Torrellas. 2017. Pageforge: a near-memory content-aware page-merging architecture. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture(MICRO-50 ’17). Association for Computing Machinery, New York, NY, USA, 302–314. https://doi.org/10.1145/3123939.3124540
[48]
Lawrence Spracklen, Yuan Chou, and Santosh G Abraham. 2005. Effective instruction prefetching in chip multiprocessors for modern commercial applications. In International Symposium on High-Performance Computer Architecture.
[49]
Viji Srinivasan, Edward S Davidson, Gary S Tyson, Mark J Charney, and Thomas R Puzak. 2001. Branch history guided instruction prefetching. In International Symposium on High-Performance Computer Architecture.
[50]
Akshitha Sriraman, Abhishek Dhanotia, and Thomas F Wenisch. 2019. Softsku: Optimizing server architectures for microservice diversity@ scale. In Proceedings of the 46th International Symposium on Computer Architecture. 513–526.
[51]
D-CD Tang, Ann Marie Grizzaffi Maynard, and Lizy Kurian John. 1999. Contrasting branch characteristics and branch predictor performance of C++ and C programs. In 1999 IEEE International Performance, Computing and Communications Conference (Cat. No. 99CH36305). IEEE, 275–283.
[52]
Yingying Tian, Samira M. Khan, Daniel A. Jiménez, and Gabriel H. Loh. 2014. Last-Level Cache Deduplication. In Proceedings of the 28th ACM International Conference on Supercomputing (Munich, Germany) (ICS ’14). Association for Computing Machinery, New York, NY, USA, 53–62. https://doi.org/10.1145/2597652.2597655
[53]
Alexander V. Veidenbaum. 1997. Instruction cache prefetching using multilevel branch prediction. In High Performance Computing(Lecture Notes in Computer Science), Constantine Polychronopoulos, Kazuki Joe, Keijiro Araki, and Makoto Amamiya (Eds.). Springer, Berlin, Heidelberg, 51–70. https://doi.org/10.1007/BFb0024203
[54]
Wikipedia contributors. 2018. ASLR. https://en.wikipedia.org/wiki/Address_space_layout_randomization.
[55]
Wikipedia contributors. 2018. WebAssembly. https://en.wikipedia.org/wiki/WebAssembly.
[56]
S. J. E. Wilton and N. P. Jouppi. 1996. CACTI: an enhanced cache access and cycle time model. IEEE Journal of Solid-State Circuits 31, 5 (1996), 677–688. https://doi.org/10.1109/4.509850
[57]
Ahmad Yasin. 2014. A top-down method for performance analysis and counters architecture. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 35–44.
[58]
Yi Zhang, Steve Haga, and Rajeev Barua. 2002. Execution history guided instruction prefetching. In Proceedings of the 16th international conference on Supercomputing(ICS ’02). Association for Computing Machinery, New York, NY, USA, 199–208. https://doi.org/10.1145/514191.514220

Cited By

View all
  • (2025)gCom: Fine-grained Compressors in Graphics Memory of Mobile GPUACM Transactions on Architecture and Code Optimization10.1145/3711819Online publication date: 8-Jan-2025
  • (2024)Weeding out Front-End Stalls with Uneven Block Size Instruction Cache2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00102(1382-1396)Online publication date: 2-Nov-2024
  • (2024)Alternate Path Fetch2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00091(1217-1229)Online publication date: 29-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture
October 2021
1322 pages
ISBN:9781450385572
DOI:10.1145/3466752
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Branch Target Buffer
  2. Performance
  3. Superscalar cores

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • NSF

Conference

MICRO '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)460
  • Downloads (Last 6 weeks)35
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)gCom: Fine-grained Compressors in Graphics Memory of Mobile GPUACM Transactions on Architecture and Code Optimization10.1145/3711819Online publication date: 8-Jan-2025
  • (2024)Weeding out Front-End Stalls with Uneven Block Size Instruction Cache2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00102(1382-1396)Online publication date: 2-Nov-2024
  • (2024)Alternate Path Fetch2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00091(1217-1229)Online publication date: 29-Jun-2024
  • (2024)UDP: Utility-Driven Fetch Directed Instruction Prefetching2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00089(1188-1201)Online publication date: 29-Jun-2024
  • (2024)AVM-BTB: Adaptive and Virtualized Multi-level Branch Target Buffer2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00012(17-31)Online publication date: 29-Jun-2024
  • (2023)Protean: Resource-efficient Instruction PrefetchingProceedings of the International Symposium on Memory Systems10.1145/3631882.3631904(1-13)Online publication date: 2-Oct-2023
  • (2023)Branch Target Buffer OrganizationsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623774(240-253)Online publication date: 28-Oct-2023
  • (2023)Warming Up a Cold Front-End with IgniteProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614258(254-267)Online publication date: 28-Oct-2023
  • (2023)Wrong-Path-Aware Entangling Instruction PrefetcherIEEE Transactions on Computers10.1109/TC.2023.333730873:2(548-559)Online publication date: 1-Dec-2023
  • (2023)A Storage-Effective BTB Organization for Servers2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070938(1153-1167)Online publication date: Feb-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media