research-article

Public Access

PDede: Partitioned, Deduplicated, Delta Branch Target Buffer

Authors:

Niranjan K Soundararajan,

Tanvir Ahmed Khan,

Sreenivas SubramoneyAuthors Info & Claims

MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 779 - 791

https://doi.org/10.1145/3466752.3480046

Published: 17 October 2021 Publication History

All formats PDF

Abstract

Due to large instruction footprints, contemporary data center applications suffer from frequent frontend stalls. Despite being a significant contributor to these stalls, the Branch Target Buffer (BTB) has received less attention compared to other frontend structures such as the instruction cache. While prior works have looked at enhancing the BTB through more efficient replacement policies and prefetching policies, a thorough analysis into optimizing the BTB’s storage efficiency is missing. In this work, we analyze BTB accesses for a large number (100+) of frontend bound applications to understand their branch target characteristics. This analysis, provides three significant observations about the nature of branch targets: (1) a significant number of branch instructions have the same branch target, (2) a significant number of branch targets share the same page address, and (3) a significant percentage of branch instructions and their targets are located on the same page. Furthermore, we observe that while applications’ address spaces are sparsely populated, they exhibit spatial locality within and across pages. We refer to these multi-page addresses as regions and we show that applications traverse a significantly smaller number of regions than pages. Based on these insights, we propose PDede, an efficient re-design of the BTB micro-architecture that improves storage efficiency by removing redundancy among branches and their targets. PDede introduces three techniques, (a) BTB Partitioning, (b) Branch Target Deduplication, and (c) Delta Branch Target Encoding to reduce BTB miss induced frontend stalls. We evaluate PDede across 100+ applications, spanning several usage scenarios, and show that it provides an average 14.4% (up to 76%) IPC speedup by reducing BTB misses by 54.7% on average (and up to 99.8%).

References

[1]

2019. Icelake. https://www.anandtech.com/show/14514/examining-intels-ice-lake-microarchitecture-and-sunny-cove/.

[2]

N. Adiga, J. Bonanno, A. Collura, M. Heizmann, B. R. Prasky, and A. Saporito. 2020. The IBM z15 High Frequency Mainframe Branch Predictor Industrial Product. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 27–39.

[3]

Samira Mirbagher Ajorpaz, Elba Garza, Sangam Jindal, and Daniel A Jiménez. 2018. Exploring predictive replacement policies for instruction cache and branch target buffer. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 519–532.

Digital Library

[4]

A. R. Alameldeen and D. A. Wood. 2004. Adaptive cache compression for high-performance processors. In Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.212–223.

[5]

Ali Ansari, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2020. Divide and Conquer Frontend Bottleneck. In Proceedings of the 47th Annual International Symposium on Computer Architecture (ISCA).

Digital Library

[6]

Grant Ayers, Jung Ho Ahn, Christos Kozyrakis, and Parthasarathy Ranganathan. 2018. Memory hierarchy for web search. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 643–656.

[7]

Grant Ayers, Heiner Litz, Christos Kozyrakis, and Parthasarathy Ranganathan. 2020. Classifying Memory Access Patterns for Prefetching. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 513–526.

Digital Library

[8]

Grant Ayers, Nayana Prasad Nagendra, David I August, Hyoun Kyu Cho, Svilen Kanev, Christos Kozyrakis, Trivikram Krishnamurthy, Heiner Litz, Tipp Moseley, and Parthasarathy Ranganathan. 2019. AsmDB: understanding and mitigating front-end stalls in warehouse-scale computers. In Proceedings of the 46th International Symposium on Computer Architecture. 462–473.

Digital Library

[9]

Brian K. Bray and M. J. Flynn. 1991. Strategies for branch target buffers. In Proceedings of the 24th annual international symposium on Microarchitecture - MICRO 24. ACM Press, Albuquerque, New Mexico, Puerto Rico, 42–50. https://doi.org/10.1145/123465.123473

Digital Library

[10]

Ioana Burcea and Andreas Moshovos. 2009. Phantom-BTB: a virtualized branch target buffer design. Acm Sigplan Notices 44, 3 (2009), 313–324.

Digital Library

[11]

CBP-5. 2016. Championship Branch Prediction (CBP-5). https://www.jilp.org/cbp2016/.

[12]

Licheng Chen, Zhipeng Wei, Zehan Cui, Mingyu Chen, Haiyang Pan, and Yungang Bao. 2014. CMD: Classification-based memory deduplication through page access characteristics. ACM SIGPLAN Notices 49, 7 (2014), 65–76.

Digital Library

[13]

David Cheriton, Amin Firoozshahian, Alex Solomatnikov, John P. Stevenson, and Omid Azizi. 2012. HICAMP: architectural support for efficient concurrency-safe shared structured data access. ACM SIGPLAN Notices 47, 4 (March 2012), 287–300. https://doi.org/10.1145/2248487.2151007

Digital Library

[14]

Timothy E Denehy and Windsor W Hsu. 2003. Duplicate Management for Reference Data. Research Report RJ10305, IBM(2003), 15.

[15]

M. Farooq, L. Chen, and L. Kurian. 2010. Value Based BTB Indexing for indirect jump prediction. In 2010 IEEE 16th International Symposium on High Performance Computer Architecture (HPCA). IEEE Computer Society, Los Alamitos, CA, USA. https://doi.org/10.1109/HPCA.2010.5416659

[16]

Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. 2012. Clearing the clouds: a study of emerging scale-out workloads on modern hardware. Acm sigplan notices 47, 4 (2012), 37–48.

Digital Library

[17]

Michael Ferdman, Cansu Kaynak, and Babak Falsafi. 2011. Proactive instruction fetch. In International Symposium on Microarchitecture.

Digital Library

[18]

Michael Ferdman, Thomas F Wenisch, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2008. Temporal instruction fetch streaming. In International Symposium on Microarchitecture.

Digital Library

[19]

E. Garza, S. Mirbagher-Ajorpaz, T. A. Khan, and D. A. Jiménez. 2019. Bit-level Perceptron Prediction for Indirect Branches. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). 27–38.

[20]

E. Garza, S. Mirbagher-Ajorpaz, T. A. Khan, and D. A. Jiménez. 2019. Bit-level Perceptron Prediction for Indirect Branches. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). 27–38.

[21]

Amin Ghasemazar, Prashant Nair, and Mieszko Lis. 2020. Thesaurus: Efficient Cache Compression via Dynamic Clustering. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 527–540.

Digital Library

[22]

B. Grayson, J. Rupley, G. Z. Zuraski, E. Quinnell, D. A. Jiménez, T. Nakra, P. Kitchin, R. Hensley, E. Brekelbaum, V. Sinha, and A. Ghiya. 2020. Evolution of the Samsung Exynos CPU Microarchitecture. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 40–51.

[23]

Greg Hamerly, Erez Perelman, Jeremy Lau, and Brad Calder. 2005. SimPoint 3.0: Faster and More Flexible Program Phase Analysis. Journal of Instruction Level Parallelism(2005), 1–28.

[24]

Bo Hong and Demyn Plantenberg. 2004. Duplicate Data Elimination in a SAN File System. MSST 2004 (2004), 14.

[25]

Intel. 2017. 5-Level Paging and 5-Level EPT. Technical Report. Intel.

[26]

Y. Ishii, J. Lee, K. Nathella, and D. Sunwoo. 5555. Rebasing Instruction Prefetching: An Industry Perspective. IEEE Computer Architecture Letters01 (oct 5555), 1–1. https://doi.org/10.1109/LCA.2020.3035068

Digital Library

[27]

Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, and Joel Emer. 2010. High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP). In Proceedings of the 37th Annual International Symposium on Computer Architecture (Saint-Malo, France) (ISCA ’10). Association for Computing Machinery, New York, NY, USA, 60–71. https://doi.org/10.1145/1815961.1815971

Digital Library

[28]

Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. 2015. Profiling a warehouse-scale computer. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. 158–169.

Digital Library

[29]

Cansu Kaynak, Boris Grot, and Babak Falsafi. 2015. Confluence: unified instruction supply for scale-out servers. In Proceedings of the 48th International Symposium on Microarchitecture. 166–177.

Digital Library

[30]

Tanvir Ahmed Khan, Nathan Brown, Akshitha Sriraman, Niranjan K Soundararajan, Rakesh Kumar, Joseph Devietti, Sreenivas Subramoney, Gilles A Pokam, Heiner Litz, and Baris Kasikci. 2021. Twig: Profile-Guided BTB Prefetching for Data Center Applications. In 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31]

Tanvir Ahmed Khan, Akshitha Sriraman, Joseph Devietti, Gilles Pokam, Heiner Litz, and Baris Kasikci. 2020. I-SPY: Context-Driven Conditional Instruction Prefetching with Coalescing. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 146–159.

[32]

Tanvir Ahmed Khan, Dexin Zhang, Akshitha Sriraman, Joseph Devietti, Gilles Pokam, Heiner Litz, and Baris Kasikci. 2021. Ripple: Profile-Guided Instruction Cache Replacement for Data Center Applications. In Proceedings of the 48th International Symposium on Computer Architecture(ISCA 2021).

Digital Library

[33]

H. Kim, J. A. Joao, O. Mutlu, C. J. Lee, Y. N. Patt, and R. Cohn. 2009. Virtual Program Counter (VPC) Prediction: Very Low Cost Indirect Branch Prediction Using Conditional Branch Prediction Hardware. IEEE Trans. Comput. 58, 9 (2009), 1153–1170.

Digital Library

[34]

Aasheesh Kolli, Ali Saidi, and Thomas F Wenisch. 2013. RDIP: return-address-stack directed instruction prefetching. In 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 260–271.

Digital Library

[35]

Rakesh Kumar, Boris Grot, and Vijay Nagarajan. 2018. Blasting through the front-end bottleneck with shotgun. ACM SIGPLAN Notices 53, 2 (2018), 30–42.

Digital Library

[36]

Rakesh Kumar, Cheng-Chieh Huang, Boris Grot, and Vijay Nagarajan. 2017. Boomerang: A metadata-free architecture for control flow delivery. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 493–504.

[37]

Tao Li, Ravi Bhargava, and Lizy Kurian John. 2002. Rehashable BTB: an adaptive branch target buffer to improve the target predictability of Java code. In International Conference on High-Performance Computing. Springer, 597–608.

[38]

Tao Li, Ravi Bhargava, and Lizy Kurian John. 2005. Adapting branch-target buffer to improve the target predictability of java code. ACM Transactions on Architecture and Code Optimization (TACO) 2, 2(2005), 109–130.

Digital Library

[39]

Pejman Lotfi-Kamran, Boris Grot, Michael Ferdman, Stavros Volos, Onur Kocberber, Javier Picorel, Almutaz Adileh, Djordje Jevdjic, Sachin Idgunji, Emre Ozer, and Babak Falsafi. 2012. Scale-out processors. ACM SIGARCH Computer Architecture News 40, 3 (June 2012), 500–511. https://doi.org/10.1145/2366231.2337217

Digital Library

[40]

A. Perais, R. Sheikh, L. Yen, M. McIlvaine, and R. D. Clancy. 2019. Elastic Instruction Fetching. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE Computer Society, Los Alamitos, CA, USA, 478–490. https://doi.org/10.1109/HPCA.2019.00059

[41]

Chris H Perleberg and Alan Jay Smith. 1993. Branch target buffer design and optimization. IEEE transactions on computers 42, 4 (1993), 396–412.

Digital Library

[42]

A. Ramirez, O.J. Santana, J.L. Larriba-Pey, and M. Valero. 2002. Fetching instruction streams. In 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings.371–382. https://doi.org/10.1109/MICRO.2002.1176264 ISSN: 1072-4451.

[43]

Glenn Reinman, Brad Calder, and Todd Austin. 1999. Fetch directed instruction prefetching. In MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture. IEEE, 16–27.

Digital Library

[44]

Glenn Reinman, Brad Calder, and Todd Austin. 2001. Optimizations Enabled by a Decoupled Front-End Architecture. IEEE Trans. Comput. 50, 4 (April 2001), 338–355. https://doi.org/10.1109/12.919279

Digital Library

[45]

André Seznec. 2011. A 64-Kbytes ITTAGE indirect branch predictor. In JILP. https://hal.inria.fr/hal-00639041

[46]

S. Seznec. 1996. Don’t Use the Page Number, but a Pointer to It. In 23rd Annual International Symposium on Computer Architecture. IEEE Computer Society, Los Alamitos, CA, USA, 104. https://doi.org/10.1145/232973.232985

Digital Library

[47]

Dimitrios Skarlatos, Nam Sung Kim, and Josep Torrellas. 2017. Pageforge: a near-memory content-aware page-merging architecture. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture(MICRO-50 ’17). Association for Computing Machinery, New York, NY, USA, 302–314. https://doi.org/10.1145/3123939.3124540

Digital Library

[48]

Lawrence Spracklen, Yuan Chou, and Santosh G Abraham. 2005. Effective instruction prefetching in chip multiprocessors for modern commercial applications. In International Symposium on High-Performance Computer Architecture.

Digital Library

[49]

Viji Srinivasan, Edward S Davidson, Gary S Tyson, Mark J Charney, and Thomas R Puzak. 2001. Branch history guided instruction prefetching. In International Symposium on High-Performance Computer Architecture.

[50]

Akshitha Sriraman, Abhishek Dhanotia, and Thomas F Wenisch. 2019. Softsku: Optimizing server architectures for microservice diversity@ scale. In Proceedings of the 46th International Symposium on Computer Architecture. 513–526.

Digital Library

[51]

D-CD Tang, Ann Marie Grizzaffi Maynard, and Lizy Kurian John. 1999. Contrasting branch characteristics and branch predictor performance of C++ and C programs. In 1999 IEEE International Performance, Computing and Communications Conference (Cat. No. 99CH36305). IEEE, 275–283.

[52]

Yingying Tian, Samira M. Khan, Daniel A. Jiménez, and Gabriel H. Loh. 2014. Last-Level Cache Deduplication. In Proceedings of the 28th ACM International Conference on Supercomputing (Munich, Germany) (ICS ’14). Association for Computing Machinery, New York, NY, USA, 53–62. https://doi.org/10.1145/2597652.2597655

Digital Library

[53]

Alexander V. Veidenbaum. 1997. Instruction cache prefetching using multilevel branch prediction. In High Performance Computing(Lecture Notes in Computer Science), Constantine Polychronopoulos, Kazuki Joe, Keijiro Araki, and Makoto Amamiya (Eds.). Springer, Berlin, Heidelberg, 51–70. https://doi.org/10.1007/BFb0024203

[54]

Wikipedia contributors. 2018. ASLR. https://en.wikipedia.org/wiki/Address_space_layout_randomization.

[55]

Wikipedia contributors. 2018. WebAssembly. https://en.wikipedia.org/wiki/WebAssembly.

[56]

S. J. E. Wilton and N. P. Jouppi. 1996. CACTI: an enhanced cache access and cycle time model. IEEE Journal of Solid-State Circuits 31, 5 (1996), 677–688. https://doi.org/10.1109/4.509850

[57]

Ahmad Yasin. 2014. A top-down method for performance analysis and counters architecture. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 35–44.

[58]

Yi Zhang, Steve Haga, and Rajeev Barua. 2002. Execution history guided instruction prefetching. In Proceedings of the 16th international conference on Supercomputing(ICS ’02). Association for Computing Machinery, New York, NY, USA, 199–208. https://doi.org/10.1145/514191.514220

Digital Library

Cited By

Tang DWu ZWang YGu YLiu FQi Z(2025)gCom: Fine-grained Compressors in Graphics Memory of Mobile GPUACM Transactions on Architecture and Code Optimization10.1145/3711819Online publication date: 8-Jan-2025
https://dl.acm.org/doi/10.1145/3711819
Brunner RKumar R(2024)Weeding out Front-End Stalls with Uneven Block Size Instruction Cache2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00102(1382-1396)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00102
Deshmukh ACai LPatt Y(2024)Alternate Path Fetch2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00091(1217-1229)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00091
Show More Cited By

Recommendations

Micro BTB: a high performance and storage efficient last-level branch target buffer for servers
CF '22: Proceedings of the 19th ACM International Conference on Computing Frontiers

High-performance branch target buffers (BTBs) and the L1I cache are key to high-performance front-end. Modern branch predictors are highly accurate, but with an increase in code footprint in modern-day server workloads, BTB and L1I misses are still ...
Branch target buffer design for embedded processors

The demand for embedded application processors that support multi-tasking operating system and can execute complex applications bring them closer to general purpose processors. These strong processors have a limited power source because they are usually ...
Branch Target Buffer Design and Optimization

A branch target buffer (BTB) can reduce the performance penalty of branches in pipelined processors by predicting the path of the branch and caching information used by the branch. Two major issues in the design of BTBs that achieves maximum performance ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

October 2021

1322 pages

ISBN:9781450385572

DOI:10.1145/3466752

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

NSF

Conference

MICRO '21

Sponsor:

SIGMICRO

MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture

October 18 - 22, 2021

Virtual Event, Greece

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
1,227
Total Downloads

Downloads (Last 12 months)460
Downloads (Last 6 weeks)35

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tang DWu ZWang YGu YLiu FQi Z(2025)gCom: Fine-grained Compressors in Graphics Memory of Mobile GPUACM Transactions on Architecture and Code Optimization10.1145/3711819Online publication date: 8-Jan-2025
https://dl.acm.org/doi/10.1145/3711819
Brunner RKumar R(2024)Weeding out Front-End Stalls with Uneven Block Size Instruction Cache2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00102(1382-1396)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00102
Deshmukh ACai LPatt Y(2024)Alternate Path Fetch2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00091(1217-1229)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00091
Oh SXu MKhan TKasikci BLitz H(2024)UDP: Utility-Driven Fetch Directed Instruction Prefetching2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00089(1188-1201)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00089
Liu YLi XZhang TLiu TGuo QZhang FWang J(2024)AVM-BTB: Adaptive and Virtualized Multi-level Branch Target Buffer2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00012(17-31)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00012
Hassan MPark CBlack-Schaffer D(2023)Protean: Resource-efficient Instruction PrefetchingProceedings of the International Symposium on Memory Systems10.1145/3631882.3631904(1-13)Online publication date: 2-Oct-2023
https://dl.acm.org/doi/10.1145/3631882.3631904
Perais ASheikh R(2023)Branch Target Buffer OrganizationsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623774(240-253)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3623774
Schall DSandberg AGrot B(2023)Warming Up a Cold Front-End with IgniteProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614258(254-267)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614258
Ros AJimborean A(2023)Wrong-Path-Aware Entangling Instruction PrefetcherIEEE Transactions on Computers10.1109/TC.2023.333730873:2(548-559)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1109/TC.2023.3337308
Asheim TGrot BKumar R(2023)A Storage-Effective BTB Organization for Servers2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070938(1153-1167)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070938
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten