skip to main content
10.1145/3470496.3527430acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article
Public Access

Thermometer: profile-guided btb replacement for data center applications

Published: 11 June 2022 Publication History

Abstract

Modern processors employ a decoupled frontend with Fetch Directed Instruction Prefetching (FDIP) to avoid frontend stalls in data center applications. However, the large branch footprint of data center applications precipitates frequent Branch Target Buffer (BTB) misses that prohibit FDIP from eliminating more than 40% of all frontend stalls. We find that the state-of-the-art BTB optimization techniques (e.g., BTB prefetching and replacement mechanisms) cannot eliminate these misses due to their inadequate understanding of branch reuse behavior in data center applications.
In this paper, we first perform a comprehensive characterization of the branch behavior of data center applications, and determine that identifying optimal BTB replacement decisions requires considering both transient and holistic (i.e., across the entire execution) branch behavior. We then present Thermometer, a novel BTB replacement technique that realizes the holistic branch behavior via a profile-guided analysis. Based on the collected profile, Thermometer generates useful BTB replacement hints that the underlying hardware can leverage. We evaluate Thermometer using 13 widely-used data center applications and demonstrate that it provides an average speedup of 8.7% (0.4%-64.9%) while outperforming the state-of-the-art BTB replacement techniques by 5.6× (on average, the best performing prior work achieves 1.5% speedup). We also demonstrate that Thermometer achieves a performance speedup that is, on average, 83.6% of the speedup achieved by the optimal BTB replacement policy.

References

[1]
"Adding processor trace support to linux," https://lwn.net/Articles/648154/.
[2]
"Apache cassandra," http://cassandra.apache.org/.
[3]
"Apache kafka," https://kafka.apache.org/powered-by.
[4]
"Apache tomcat," https://tomcat.apache.org/.
[5]
"Champsim," https://github.com/ChampSim/ChampSim.
[6]
"Clang c language family frontend for llvm," [Online; accessed 19-Nov-2021]. [Online]. Available: https://clang.llvm.org/
[7]
"Github - chipsalliance/rocket-chip: Rocket chip generator," [Online; accessed 19-Nov-2021]. [Online]. Available: https://github.com/chipsalliance/rocket-chip
[8]
"An introduction to last branch records," https://lwn.net/Articles/680985/.
[9]
"Postgresql: Documentation: 14: pgbench," [Online; accessed 19-Nov-2021]. [Online]. Available: https://www.postgresql.org/docs/current/pgbench.html
[10]
"Postgresql: The world's most advanced open source database," [Online; accessed 19-Nov-2021]. [Online]. Available: https://www.postgresql.org/
[11]
"The python performance benchmark suite," [Online; accessed 19-Nov-2021]. [Online]. Available: https://pyperformance.readthedocs.io/
[12]
"Twitter finagle," https://twitter.github.io/finagle/.
[13]
"Verilator," https://www.veripool.org/wiki/verilator.
[14]
"Welcome to python.org," [Online; accessed 19-Nov-2021]. [Online]. Available: https://www.python.org/
[15]
"Championship branch prediction," https://jilp.org/cbp2016/, 2016.
[16]
"facebookarchive/oss-performance: Scripts for benchmarking various php implementations when running open source software," https://github.com/facebookarchive/oss-performance, 2019, (Online; last accessed 15-November-2019).
[17]
"The 1st instruction prefetching championship," https://research.ece.ncsu.edu/ipc/, 2020.
[18]
J. Abella, A. González, X. Vera, and M. F. O'Boyle, "Iatac: a smart predictor to turn-off l2 cache lines," ACM Transactions on Architecture and Code Optimization (TACO), vol. 2, no. 1, pp. 55--77, 2005.
[19]
K. Adams, J. Evans, B. Maher, G. Ottoni, A. Paroski, B. Simmers, E. Smith, and O. Yamauchi, "The hiphop virtual machine," in Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications, 2014, pp. 777--790.
[20]
S. M. Ajorpaz, E. Garza, S. Jindal, and D. A. Jiménez, "Exploring predictive replacement policies for instruction cache and branch target buffer," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018, pp. 519--532.
[21]
S. Ananthanarayanan, M. S. Ardekani, D. Haenikel, B. Varadarajan, S. Soriano, D. Patel, and A.-R. Adl-Tabatabai, "Keeping master green at scale," in Proceedings of the Fourteenth EuroSys Conference 2019, ser. EuroSys '19. New York, NY, USA: Association for Computing Machinery, 2019. [Online].
[22]
A. Ansari, F. Golshan, P. Lotfi-Kamran, and H. Sarbazi-Azad, "Mana: Microarchitecting an instruction prefetcher," The First Instruction Prefetching Championship, 2020.
[23]
A. Ansari, P. Lotfi-Kamran, and H. Sarbazi-Azad, "Divide and conquer frontend bottleneck," in Proceedings of the 47th Annual International Symposium on Computer Architecture (ISCA), 2020.
[24]
T. Asheim, B. Grot, and R. Kumar, "Btb-x: A storage-effective btb organization," IEEE Computer Architecture Letters, vol. 20, no. 2, pp. 134--137, 2021.
[25]
G. Ayers, J. H. Ahn, C. Kozyrakis, and P. Ranganathan, "Memory hierarchy for web search," in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2018, pp. 643--656.
[26]
G. Ayers, H. Litz, C. Kozyrakis, and P. Ranganathan, "Classifying memory access patterns for prefetching," in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 2020, pp. 513--526.
[27]
G. Ayers, N. P. Nagendra, D. I. August, H. K. Cho, S. Kanev, C. Kozyrakis, T. Krishnamurthy, H. Litz, T. Moseley, and P. Ranganathan, "Asmdb: understanding and mitigating front-end stalls in warehouse-scale computers," in Proceedings of the 46th ISCA, 2019.
[28]
N. Beckmann and D. Sanchez, "Talus: A simple way to remove cliffs in cache performance," in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2015, pp. 64--75.
[29]
L. A. Belady, "A study of replacement algorithms for a virtual-storage computer," IBM Systems journal, vol. 5, no. 2, pp. 78--101, 1966.
[30]
L. A. Belady and F. P. Palermo, "On-line measurement of paging behavior by the multivalued min algorithm," IBM Journal of Research and Development, vol. 18, no. 1, pp. 2--19, 1974.
[31]
S. M. Blackburn, R. Garner, C. Hoffmann, A. M. Khang, K. S. McKinley, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer et al., "The dacapo benchmarks: Java benchmarking development and analysis," in Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications, 2006, pp. 169--190.
[32]
M. Butler, L. Barnes, D. D. Sarma, and B. Gelinas, "Bulldozer: An approach to multithreaded compute performance," IEEE Micro, vol. 31, no. 2, pp. 6--15, 2011.
[33]
D. Chen, T. Moseley, and D. X. Li, "Autofdo: Automatic feedback-directed optimization for warehouse-scale applications," in CGO, 2016.
[34]
R. Cohn and P. G. Lowney, "Hot cold optimization of large windows/nt applications," in Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29. IEEE, 1996, pp. 80--89.
[35]
T. P. P. Council, "Tpc-c," [Online; accessed 19-Nov-2021]. [Online]. Available: http://www.tpc.org/tpcc/
[36]
W. Cui, X. Ge, B. Kasikci, B. Niu, U. Sharma, R. Wang, and I. Yun, "{REPT}: Reverse debugging of failures in deployed software," in 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), 2018, pp. 17--32.
[37]
C. Ding and Y. Zhong, "Predicting whole-program locality through reuse distance analysis," in Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation, 2003, pp. 245--257.
[38]
N. Duong, D. Zhao, T. Kim, R. Cammarota, M. Valero, and A. V. Veidenbaum, "Improving cache management policies using dynamic reuse distances," in 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 2012, pp. 389--400.
[39]
W. Erquinigo, D. Carrillo-Cisneros, and A. Tang, "Reverse debugging at scale," https://engineering.fb.com/2021/04/27/developer-tools/reverse-debugging/.
[40]
B. Fagin, "Partial resolution in branch target buffers," IEEE Transactions on Computers, vol. 46, no. 10, pp. 1142--1145, 1997.
[41]
P. Faldu and B. Grot, "Leeway: Addressing variability in dead-block prediction for last-level caches," in 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 2017, pp. 180--193.
[42]
M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, "Clearing the clouds: a study of emerging scale-out workloads on modern hardware," Acm sigplan notices, vol. 47, no. 4, pp. 37--48, 2012.
[43]
M. Ferdman, C. Kaynak, and B. Falsafi, "Proactive instruction fetch," in International Symposium on Microarchitecture, 2011.
[44]
M. Ferdman, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos, "Temporal instruction fetch streaming," in International Symposium on Microarchitecture, 2008.
[45]
H. Gao and C. Wilkerson, "A dueling segmented lru replacement algorithm with adaptive bypassing," in JWAC 2010-1st JILP Worshop on Computer Architecture Competitions: Cache Replacement Championship, 2010.
[46]
N. Gober, G. Chacon, D. Jiménez, and P. V. Gratz, "The temporal ancestry prefetcher."
[47]
Google, "Propeller: Profile guided optimizing large scale llvm-based relinker," https://github.com/google/llvm-propeller, 2020.
[48]
D. A. J. P. V. Gratz and G. C. N. Gober, "Barca: Branch agnostic region searching algorithm."
[49]
B. Grayson, J. Rupley, G. Z. Zuraski, E. Quinnell, D. A. Jiménez, T. Nakra, P. Kitchin, R. Hensley, E. Brekelbaum, V. Sinha et al., "Evolution of the samsung exynos cpu microarchitecture," in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020, pp. 40--51.
[50]
P. Guide, "Intel® 64 and ia-32 architectures software developer's manual," Volume 3B: System programming Guide, Part, vol. 2, no. 11, 2011.
[51]
V. Gupta, N. S. Kalani, and B. Panda, "Run-jump-run: Bouquet of instruction pointer jumpers for high performance instruction prefetching."
[52]
S. Harizopoulos and A. Ailamaki, "Steps towards cache-resident transaction processing," in International conference on Very large data bases, 2004.
[53]
I. Harshard Sane, Principle Software Engineer, "Active benchmarking for better performance predictions," https://www.intel.com/content/dam/www/central-libraries/us/en/documents/dpm-workloads-explainer-tech-brief.pdf.
[54]
M. Hashemi, K. Swersky, J. A. Smith, G. Ayers, H. Litz, J. Chang, C. Kozyrakis, and P. Ranganathan, "Learning memory access patterns," arXiv preprint arXiv:1803.02329, 2018.
[55]
W. He, J. Mestre, S. Pupyrev, L. Wang, and H. Yu, "Profile inference revisited," Proceedings of the ACM on Programming Languages, vol. 6, no. POPL, pp. 1--24, 2022.
[56]
Z. Hu, S. Kaxiras, and M. Martonosi, "Timekeeping in the memory system: predicting and optimizing memory behavior," in Proceedings 29th Annual International Symposium on Computer Architecture. IEEE, 2002, pp. 209--220.
[57]
Y. Ishii, J. Lee, K. Nathella, and D. Sunwoo, "Rebasing instruction prefetching: An industry perspective," IEEE Computer Architecture Letters, 2020.
[58]
Y. Ishii, J. Lee, K. Nathella, and D. Sunwoo, "Re-establishing fetch-directed instruction prefetching: An industry perspective," IEEE International Symposium on Performance Analysis of Systems and Software, 2021.
[59]
Q. Jacobson, E. Rotenberg, and J. E. Smith, "Path-based next trace prediction," in Proceedings of 30th Annual International Symposium on Microarchitecture. IEEE, 1997, pp. 14--23.
[60]
A. Jain and C. Lin, "Back to the future: leveraging belady's algorithm for improvedcache replacement," in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 2016, pp. 78--89.
[61]
A. Jain and C. Lin, "Rethinking belady's algorithm to accommodate prefetching," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018, pp. 110--123.
[62]
A. Jaleel, K. B. Theobald, S. C. Steely Jr, and J. Emer, "High performance cache replacement using re-reference interval prediction (rrip)," ACM SIGARCH Computer Architecture News, vol. 38, no. 3, pp. 60--71, 2010.
[63]
S. Jamilan, T. A. Khan, G. Ayers, B. Kasikci, and H. Litz, "Apt-get: Profile-guided timely software prefetching," in Proceedings of the Seventeenth European Conference on Computer Systems, 2022, pp. 747--764.
[64]
D. A. Jiménez, "Insertion and promotion for tree-based pseudolru last-level caches," in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013, pp. 284--296.
[65]
D. A. Jiménez, S. W. Keckler, and C. Lin, "The impact of delay on the design of branch predictors," in Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, 2000, pp. 67--76.
[66]
D. A. Jiménez and E. Teran, "Multiperspective reuse prediction," in 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2017, pp. 436--448.
[67]
S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-Y. Wei, and D. Brooks, "Profiling a warehouse-scale computer," in Proceedings of the 42nd ISCA, 2015.
[68]
R. Karedla, J. S. Love, and B. G. Wherry, "Caching strategies to improve disk system performance," Computer, vol. 27, no. 3, pp. 38--46, 1994.
[69]
B. Kasikci, W. Cui, X. Ge, and B. Niu, "Lazy diagnosis of in-production concurrency bugs," in Proceedings of the 26th Symposium on Operating Systems Principles, 2017, pp. 582--598.
[70]
B. Kasikci, C. Pereira, G. Pokam, B. Schubert, M. Musuvathi, and G. Candea, "Failure sketches: A better way to debug," ser. Hot Topics in Operating Systems, 2015, p. 5.
[71]
B. Kasikci, B. Schubert, C. Pereira, G. Pokam, and G. Candea, "Failure sketching: A technique for automated root cause diagnosis of in-production failures," in Proceedings of the 25th Symposium on Operating Systems Principles, 2015, p. 344--360.
[72]
C. Kaynak, B. Grot, and B. Falsafi, "Shift: Shared history instruction fetch for lean-core server processors," in International Symposium on Microarchitecture, 2013.
[73]
C. Kaynak, B. Grot, and B. Falsafi, "Confluence: unified instruction supply for scale-out servers," in Proceedings of the 48th International Symposium on Microarchitecture, 2015, pp. 166--177.
[74]
S. M. Khan, Y. Tian, and D. A. Jimenez, "Sampling dead block prediction for last-level caches," in 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 2010, pp. 175--186.
[75]
T. A. Khan, N. Brown, A. Sriraman, N. K. Soundararajan, R. Kumar, J. Devietti, S. Subramoney, G. A. Pokam, H. Litz, and B. Kasikci, "Twig: Profile-guided btb prefetching for data center applications," in MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021, pp. 816--829.
[76]
T. A. Khan, I. Neal, G. Pokam, B. Mozafari, and B. Kasikci, "Dmon: Efficient detection and correction of data locality problems using selective profiling," in 15th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 21), 2021, pp. 163--181.
[77]
T. A. Khan, A. Sriraman, J. Devietti, G. Pokam, H. Litz, and B. Kasikci, "I-spy: Context-driven conditional instruction prefetching with coalescing," in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 146--159.
[78]
T. A. Khan, D. Zhang, A. Sriraman, J. Devietti, G. Pokam, H. Litz, and B. Kasikci, "Ripple: Profile-guided instruction cache replacement for data center applications," in Proceedings (to appear) of the 48th International Symposium on Computer Architecture (ISCA), ser. ISCA 2021, Jun. 2021.
[79]
T. A. Khan, Y. Zhao, G. Pokam, B. Mozafari, and B. Kasikci, "Huron: hybrid false sharing detection and repair," in Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2019, pp. 453--468.
[80]
M. Kharbutli and Y. Solihin, "Counter-based cache replacement algorithms," in 2005 International Conference on Computer Design. IEEE, 2005, pp. 61--68.
[81]
R. Kobayashi, Y. Yamada, H. Ando, and T. Shimada, "A cost-effective branch target buffer with a two-level table organization," in Proceedings of the 2nd International Symposium of Low-Power and High-Speed Chips (COOL Chips II), 1999.
[82]
A. Kolli, A. Saidi, and T. F. Wenisch, "Rdip: return-address-stack directed instruction prefetching," in 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2013, pp. 260--271.
[83]
R. Kumar, B. Grot, and V. Nagarajan, "Blasting through the front-end bottleneck with shotgun," ACM SIGPLAN Notices, vol. 53, no. 2, pp. 30--42, 2018.
[84]
R. Kumar, C.-C. Huang, B. Grot, and V. Nagarajan, "Boomerang: A metadata-free architecture for control flow delivery," in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2017, pp. 493--504.
[85]
C. Lattner and V. Adve, "Llvm: A compilation framework for lifelong program analysis & transformation," in International Symposium on Code Generation and Optimization, 2004. CGO 2004. IEEE, 2004, pp. 75--86.
[86]
R. Lavaee, J. Criswell, and C. Ding, "Codestitcher: inter-procedural basic block layout optimization," in Proceedings of the 28th International Conference on Compiler Construction, 2019, pp. 65--75.
[87]
Lee and Smith, "Branch prediction strategies and branch target buffer design," Computer, vol. 17, no. 1, pp. 6--22, 1984.
[88]
D. Lee, J. Choi, J.-H. Kim, S. H. Noh, S. L. Min, Y. Cho, and C. S. Kim, "On the existence of a spectrum of policies that subsumes the least recently used (lru) and least frequently used (lfu) policies," in Proceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, 1999, pp. 134--143.
[89]
D. X. Li, R. Ashok, and R. Hundt, "Lightweight feedback-directed cross-module optimization," in Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, 2010, pp. 53--61.
[90]
L. Li, D. Tong, Z. Xie, J. Lu, and X. Cheng, "Optimal bypass monitor for high performance last-level caches," in Proceedings of the 21st international conference on Parallel architectures and compilation techniques, 2012, pp. 315--324.
[91]
H. Litz, G. Ayers, and P. Ranganathan, "CRISP: critical slice prefetching," in ASPLOS '22: 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 28 February 2022 - 4 March 2022, B. Falsafi, M. Ferdman, S. Lu, and T. F. Wenisch, Eds. ACM, 2022, pp. 300--313. [Online].
[92]
E. Z. Liu, M. Hashemi, K. Swersky, P. Ranganathan, and J. Ahn, "An imitation learning approach for cache replacement," arXiv preprint arXiv:2006.16239, 2020.
[93]
H. Liu, M. Ferdman, J. Huh, and D. Burger, "Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency," in 2008 41st IEEE/ACM International Symposium on Microarchitecture. IEEE, 2008, pp. 222--233.
[94]
C.-K. Luk, R. Muth, H. Patil, R. Cohn, and G. Lowney, "Ispike: a post-link optimizer for the intel/spl reg/itanium/spl reg/architecture," in International Symposium on Code Generation and Optimization, 2004. CGO 2004. IEEE, 2004, pp. 15--26.
[95]
C.-K. Luk and T. C. Mowry, "Cooperative prefetching: Compiler and hardware support for effective instruction prefetching in modern processors," in International Symposium on Microarchitecture, 1998.
[96]
R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger, "Evaluation techniques for storage hierarchies," IBM Systems journal, vol. 9, no. 2, pp. 78--117, 1970.
[97]
C. Mazumdar, P. Mitra, and A. Basu, "Dead page and dead block predictors: Cleaning tlbs and caches together," in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021, pp. 507--519.
[98]
P. Michaud, "Some mathematical facts about optimal cache replacement," ACM Transactions on Architecture and Code Optimization (TACO), vol. 13, no. 4, pp. 1--19, 2016.
[99]
P. Michaud, "Pips: Prefetching instructions with probabilistic scouts," in The 1st Instruction Prefetching Championship, 2020.
[100]
A. A. Moreira, G. Ottoni, and F. M. Quintão Pereira, "Vespa: static profiling for binary optimization," Proceedings of the ACM on Programming Languages, vol. 5, no. OOPSLA, pp. 1--28, 2021.
[101]
T. Nakamura, T. Koizumi, Y. Degawa, H. Irie, S. Sakai, and R. Shioya, "D-jolt: Distant jolt prefetcher."
[102]
E. J. O'neil, P. E. O'neil, and G. Weikum, "The lru-k page replacement algorithm for database disk buffering," Acm Sigmod Record, vol. 22, no. 2, pp. 297--306, 1993.
[103]
G. Ottoni, "Hhvm jit: A profile-guided, region-based compiler for php and hack," in Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2018, pp. 151--165.
[104]
G. Ottoni and B. Liu, "Hhvm jump-start: Boosting both warmup and steady-state performance at scale," in 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, pp. 340--350.
[105]
G. Ottoni and B. Maher, "Optimizing function placement for large-scale data-center applications," in 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2017, pp. 233--244.
[106]
M. Panchenko, R. Auler, B. Nell, and G. Ottoni, "Bolt: a practical binary optimizer for data centers and beyond," in 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2019, pp. 2--14.
[107]
M. Panchenko, R. Auler, L. Sakka, and G. Ottoni, "Lightning bolt: powerful, fast, and scalable binary optimization," in Proceedings of the 30th ACM SIGPLAN International Conference on Compiler Construction, 2021, pp. 119--130.
[108]
R. Panda, P. V. Gratz, and D. A. Jiménez, "B-fetch: Branch prediction directed prefetching for in-order processors," IEEE Computer Architecture Letters, vol. 11, no. 2, pp. 41--44, 2011.
[109]
A. Pellegrini, N. Stephens, M. Bruce, Y. Ishii, J. Pusdesris, A. Raja, C. Abernathy, J. Koppanalil, T. Ringe, A. Tummala et al., "The arm neoverse n1 platform: Building blocks for the next-gen cloud-to-edge infrastructure soc," IEEE Micro, vol. 40, no. 2, pp. 53--62, 2020.
[110]
C. H. Perleberg and A. J. Smith, "Branch target buffer design and optimization," IEEE transactions on computers, vol. 42, no. 4, pp. 396--412, 1993.
[111]
L. L. Peterson, "Architectural and compiler support for effective instruction prefetching: a cooperative approach," ACM Transactions on Computer Systems, 2001.
[112]
E. Petrank and D. Rawitz, "The hardness of cache conscious data placement," in POPL, 2002.
[113]
K. Pettis and R. C. Hansen, "Profile guided code positioning," in Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation, 1990, pp. 16--27.
[114]
A. Prokopec, A. Rosà, D. Leopoldseder, G. Duboscq, P. Tůma, M. Studener, L. Bulej, Y. Zheng, A. Villazón, D. Simon, T. Würthinger, and W. Binder, "Renaissance: Benchmarking suite for parallel applications on the jvm," in Programming Language Design and Implementation, 2019.
[115]
M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer, "Adaptive insertion policies for high performance caching," ACM SIGARCH Computer Architecture News, vol. 35, no. 2, pp. 381--391, 2007.
[116]
M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt, "A case for mlp-aware cache replacement," in 33rd International Symposium on Computer Architecture (ISCA'06). IEEE, 2006, pp. 167--178.
[117]
A. Ramirez, L. A. Barroso, K. Gharachorloo, R. Cohn, J. Larriba-Pey, P. G. Lowney, and M. Valero, "Code layout optimizations for transaction processing workloads," ACM SIGARCH Computer Architecture News, 2001.
[118]
G. Reinman, T. Austin, and B. Calder, "A scalable front-end architecture for fast instruction delivery," ACM SIGARCH Computer Architecture News, vol. 27, no. 2, pp. 234--245, 1999.
[119]
G. Reinman, B. Calder, and T. Austin, "Fetch directed instruction prefetching," in MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture. IEEE, 1999, pp. 16--27.
[120]
A. Ros and A. Jimborean, "The entangling instruction prefetcher," IEEE Computer Architecture Letters, vol. 19, no. 2, pp. 84--87, 2020.
[121]
A. Ros and A. Jimborean, "A cost-effective entangling prefetcher for instructions," in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 99--111.
[122]
E. Rotenberg, S. Bennett, and J. E. Smith, "Trace cache: a low latency approach to high bandwidth instruction fetching," in Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29. IEEE, 1996, pp. 24--34.
[123]
J. Rupley, "Samsung exynos m3 processor," IEEE Hot Chips, vol. 30, 2018.
[124]
D. Seal, ARM architecture reference manual. Pearson Education, 2001.
[125]
V. Seshadri, O. Mutlu, M. A. Kozuch, and T. C. Mowry, "The evicted-address filter: A unified mechanism to address both cache pollution and thrashing," in 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 2012, pp. 355--366.
[126]
A. Seznec, "Tage-sc-l branch predictors," in JILP-Championship Branch Prediction, 2014.
[127]
A. Seznec, "The fnl+ mma instruction cache prefetcher," in IPC-1-First Instruction Prefetching Championship, 2020.
[128]
S. Seznec, "Don't use the page number, but a pointer to it," in 23rd Annual International Symposium on Computer Architecture (ISCA'96). IEEE, 1996, pp. 104--104.
[129]
Z. Shi, X. Huang, A. Jain, and C. Lin, "Applying deep learning to the cache replacement problem," in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 413--425.
[130]
Y. Smaragdakis, S. Kaplan, and P. Wilson, "Eelru: simple and effective adaptive page replacement," ACM SIGMETRICS Performance Evaluation Review, vol. 27, no. 1, pp. 122--133, 1999.
[131]
A. J. Smith, "Sequential program prefetching in memory hierarchies," Computer, no. 12, pp. 7--21, 1978.
[132]
N. K. Soundararajan, P. Braun, T. A. Khan, B. Kasikci, H. Litz, and S. Subramoney, "Pdede: Partitioned, deduplicated, delta branch target buffer," in MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021, pp. 779--791.
[133]
A. Sriraman, A. Dhanotia, and T. F. Wenisch, "Softsku: Optimizing server architectures for microservice diversity@ scale," in Proceedings of the 46th International Symposium on Computer Architecture, 2019, pp. 513--526.
[134]
R. Subramanian, Y. Smaragdakis, and G. H. Loh, "Adaptive caches: Effective shaping of cache behavior to workloads," in 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06). IEEE, 2006, pp. 385--396.
[135]
D. Suggs, M. Subramony, and D. Bouvier, "The amd "zen 2" processor," IEEE Micro, vol. 40, no. 2, pp. 45--52, 2020.
[136]
M. Takagi and K. Hiraki, "Inter-reference gap distribution replacement: an improved replacement algorithm for set-associative caches," in Proceedings of the 18th annual international conference on Supercomputing, 2004, pp. 20--30.
[137]
E. Teran, Z. Wang, and D. A. Jiménez, "Perceptron learning for reuse prediction," in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2016, pp. 1--12.
[138]
G. Vavouliotis, L. Alvarez, B. Grot, D. Jiménez, and M. Casas, "Morrigan: A composite instruction tlb prefetcher," in MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021, pp. 1138--1153.
[139]
T. F. Wenisch, M. Ferdman, A. Ailamaki, B. Falsafi, and A. Moshovos, "Temporal streams in commercial server applications," in 2008 IEEE International Symposium on Workload Characterization. IEEE, 2008, pp. 99--108.
[140]
T. F. Wenisch, M. Ferdman, A. Ailamaki, B. Falsafi, and A. Moshovos, "Practical off-chip meta-data for temporal memory streaming," in 2009 IEEE 15th International Symposium on High Performance Computer Architecture. IEEE, 2009, pp. 79--90.
[141]
T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi, "Temporal streaming of shared memory," in 32nd International Symposium on Computer Architecture (ISCA'05). IEEE, 2005, pp. 222--233.
[142]
Wikipedia contributors, "Drupal --- Wikipedia, the free encyclopedia," https://en.wikipedia.org/w/index.php?title=Drupal&oldid=989582664, 2020, [Online; accessed 23-November-2020].
[143]
Wikipedia contributors, "Mediawiki --- Wikipedia, the free encyclopedia," https://en.wikipedia.org/w/index.php?title=MediaWiki&oldid=989993176, 2020, [Online; accessed 23-November-2020].
[144]
Wikipedia contributors, "Wordpress --- Wikipedia, the free encyclopedia," https://en.wikipedia.org/w/index.php?title=WordPress&oldid=977243718, 2020, [Online; accessed 23-November-2020].
[145]
Wikipedia contributors, "Cross-validation (statistics) --- Wikipedia, the free encyclopedia," https://en.wikipedia.org/w/index.php?title=Cross-validation_(statistics)&oldid=1055904460, 2021, [Online; accessed 24-November-2021].
[146]
Wikipedia contributors, "Mysql --- Wikipedia, the free encyclopedia," https://en.wikipedia.org/w/index.php?title=MySQL&oldid=1054628857, 2021, [Online; accessed 19-November-2021].
[147]
C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely Jr, and J. Emer, "Ship: Signature-based hit predictor for high performance caching," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, 2011, pp. 430--441.
[148]
T.-Y. Yeh and Y. N. Patt, "A comprehensive instruction fetch mechanism for a processor supporting speculative execution," ACM SIGMICRO Newsletter, vol. 23, no. 1--2, pp. 129--139, 1992.
[149]
J. Zhou and K. A. Ross, "Buffering databse operations for enhanced instruction cache performance," in International conference on Management of data, 2004.
[150]
Y. Zhou, X. Dong, A. L. Cox, and S. Dwarkadas, "On the impact of instruction address translation overhead," in 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2019, pp. 106--116.
[151]
G. Zuo, J. Ma, A. Quinn, P. Bhatotia, P. Fonseca, and B. Kasikci, "Execution reconstruction: Harnessing failure reoccurrences for failure reproduction," in Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021, p. 1155--1170.

Cited By

View all
  • (2024)Reducing the Overhead of Exact Profiling by Reusing Affine VariablesProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641569(150-161)Online publication date: 17-Feb-2024
  • (2024)Weeding out Front-End Stalls with Uneven Block Size Instruction Cache2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00102(1382-1396)Online publication date: 2-Nov-2024
  • (2024)AVM-BTB: Adaptive and Virtualized Multi-level Branch Target Buffer2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00012(17-31)Online publication date: 29-Jun-2024
  • Show More Cited By

Index Terms

  1. Thermometer: profile-guided btb replacement for data center applications

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ISCA '22: Proceedings of the 49th Annual International Symposium on Computer Architecture
    June 2022
    1097 pages
    ISBN:9781450386104
    DOI:10.1145/3470496
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    In-Cooperation

    • IEEE CS TCAA: IEEE CS technical committee on architectural acoustics

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 June 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. branch target buffer
    2. cache replacement
    3. data center
    4. frontend stalls

    Qualifiers

    • Research-article

    Funding Sources

    • DARPA
    • Intel Labs
    • Applications Driving Architectures (ADA) Research Center
    • SRC
    • NSF

    Conference

    ISCA '22
    Sponsor:

    Acceptance Rates

    ISCA '22 Paper Acceptance Rate 67 of 400 submissions, 17%;
    Overall Acceptance Rate 543 of 3,203 submissions, 17%

    Upcoming Conference

    ISCA '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)579
    • Downloads (Last 6 weeks)54
    Reflects downloads up to 08 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Reducing the Overhead of Exact Profiling by Reusing Affine VariablesProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641569(150-161)Online publication date: 17-Feb-2024
    • (2024)Weeding out Front-End Stalls with Uneven Block Size Instruction Cache2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00102(1382-1396)Online publication date: 2-Nov-2024
    • (2024)AVM-BTB: Adaptive and Virtualized Multi-level Branch Target Buffer2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00012(17-31)Online publication date: 29-Jun-2024
    • (2023)Branch Target Buffer OrganizationsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623774(240-253)Online publication date: 28-Oct-2023
    • (2023)Mira: A Program-Behavior-Guided Far Memory SystemProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613157(692-708)Online publication date: 23-Oct-2023
    • (2023)μManycore: A Cloud-Native CPU for Tail at ScaleProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589068(1-15)Online publication date: 17-Jun-2023
    • (2023)Rebasing Microarchitectural Research with Industry Traces2023 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC59245.2023.00027(100-114)Online publication date: 1-Oct-2023
    • (2023)JACO: JAva Code Layout Optimizer Enabling Continuous Optimization without Pausing Application Services2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00032(295-306)Online publication date: 31-Oct-2023
    • (2022)AthenaProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569684(359-371)Online publication date: 8-Oct-2022
    • (2022)APT-GETProceedings of the Seventeenth European Conference on Computer Systems10.1145/3492321.3519583(747-764)Online publication date: 28-Mar-2022

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media