research-article

Public Access

Thermometer: profile-guided btb replacement for data center applications

Authors:

Tanvir Ahmed Khan,

Sara Mahdizadeh Shahri,

Akshitha Sriraman,

Niranjan K Soundararajan,

Sreenivas Subramoney,

Daniel A. Jiménez,

Baris KasikciAuthors Info & Claims

ISCA '22: Proceedings of the 49th Annual International Symposium on Computer Architecture

Pages 742 - 756

https://doi.org/10.1145/3470496.3527430

Published: 11 June 2022 Publication History

Abstract

Modern processors employ a decoupled frontend with Fetch Directed Instruction Prefetching (FDIP) to avoid frontend stalls in data center applications. However, the large branch footprint of data center applications precipitates frequent Branch Target Buffer (BTB) misses that prohibit FDIP from eliminating more than 40% of all frontend stalls. We find that the state-of-the-art BTB optimization techniques (e.g., BTB prefetching and replacement mechanisms) cannot eliminate these misses due to their inadequate understanding of branch reuse behavior in data center applications.

In this paper, we first perform a comprehensive characterization of the branch behavior of data center applications, and determine that identifying optimal BTB replacement decisions requires considering both transient and holistic (i.e., across the entire execution) branch behavior. We then present Thermometer, a novel BTB replacement technique that realizes the holistic branch behavior via a profile-guided analysis. Based on the collected profile, Thermometer generates useful BTB replacement hints that the underlying hardware can leverage. We evaluate Thermometer using 13 widely-used data center applications and demonstrate that it provides an average speedup of 8.7% (0.4%-64.9%) while outperforming the state-of-the-art BTB replacement techniques by 5.6× (on average, the best performing prior work achieves 1.5% speedup). We also demonstrate that Thermometer achieves a performance speedup that is, on average, 83.6% of the speedup achieved by the optimal BTB replacement policy.

References

[1]

"Adding processor trace support to linux," https://lwn.net/Articles/648154/.

[2]

"Apache cassandra," http://cassandra.apache.org/.

[3]

"Apache kafka," https://kafka.apache.org/powered-by.

[4]

"Apache tomcat," https://tomcat.apache.org/.

[5]

"Champsim," https://github.com/ChampSim/ChampSim.

[6]

"Clang c language family frontend for llvm," [Online; accessed 19-Nov-2021]. [Online]. Available: https://clang.llvm.org/

[7]

"Github - chipsalliance/rocket-chip: Rocket chip generator," [Online; accessed 19-Nov-2021]. [Online]. Available: https://github.com/chipsalliance/rocket-chip

[8]

"An introduction to last branch records," https://lwn.net/Articles/680985/.

[9]

"Postgresql: Documentation: 14: pgbench," [Online; accessed 19-Nov-2021]. [Online]. Available: https://www.postgresql.org/docs/current/pgbench.html

[10]

"Postgresql: The world's most advanced open source database," [Online; accessed 19-Nov-2021]. [Online]. Available: https://www.postgresql.org/

[11]

"The python performance benchmark suite," [Online; accessed 19-Nov-2021]. [Online]. Available: https://pyperformance.readthedocs.io/

[12]

"Twitter finagle," https://twitter.github.io/finagle/.

[13]

"Verilator," https://www.veripool.org/wiki/verilator.

[14]

"Welcome to python.org," [Online; accessed 19-Nov-2021]. [Online]. Available: https://www.python.org/

[15]

"Championship branch prediction," https://jilp.org/cbp2016/, 2016.

[16]

"facebookarchive/oss-performance: Scripts for benchmarking various php implementations when running open source software," https://github.com/facebookarchive/oss-performance, 2019, (Online; last accessed 15-November-2019).

[17]

"The 1st instruction prefetching championship," https://research.ece.ncsu.edu/ipc/, 2020.

[18]

J. Abella, A. González, X. Vera, and M. F. O'Boyle, "Iatac: a smart predictor to turn-off l2 cache lines," ACM Transactions on Architecture and Code Optimization (TACO), vol. 2, no. 1, pp. 55--77, 2005.

Digital Library

[19]

K. Adams, J. Evans, B. Maher, G. Ottoni, A. Paroski, B. Simmers, E. Smith, and O. Yamauchi, "The hiphop virtual machine," in Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications, 2014, pp. 777--790.

[20]

S. M. Ajorpaz, E. Garza, S. Jindal, and D. A. Jiménez, "Exploring predictive replacement policies for instruction cache and branch target buffer," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018, pp. 519--532.

[21]

S. Ananthanarayanan, M. S. Ardekani, D. Haenikel, B. Varadarajan, S. Soriano, D. Patel, and A.-R. Adl-Tabatabai, "Keeping master green at scale," in Proceedings of the Fourteenth EuroSys Conference 2019, ser. EuroSys '19. New York, NY, USA: Association for Computing Machinery, 2019. [Online].

Digital Library

[22]

A. Ansari, F. Golshan, P. Lotfi-Kamran, and H. Sarbazi-Azad, "Mana: Microarchitecting an instruction prefetcher," The First Instruction Prefetching Championship, 2020.

[23]

A. Ansari, P. Lotfi-Kamran, and H. Sarbazi-Azad, "Divide and conquer frontend bottleneck," in Proceedings of the 47th Annual International Symposium on Computer Architecture (ISCA), 2020.

[24]

T. Asheim, B. Grot, and R. Kumar, "Btb-x: A storage-effective btb organization," IEEE Computer Architecture Letters, vol. 20, no. 2, pp. 134--137, 2021.

Digital Library

[25]

G. Ayers, J. H. Ahn, C. Kozyrakis, and P. Ranganathan, "Memory hierarchy for web search," in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2018, pp. 643--656.

[26]

G. Ayers, H. Litz, C. Kozyrakis, and P. Ranganathan, "Classifying memory access patterns for prefetching," in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 2020, pp. 513--526.

Digital Library

[27]

G. Ayers, N. P. Nagendra, D. I. August, H. K. Cho, S. Kanev, C. Kozyrakis, T. Krishnamurthy, H. Litz, T. Moseley, and P. Ranganathan, "Asmdb: understanding and mitigating front-end stalls in warehouse-scale computers," in Proceedings of the 46th ISCA, 2019.

Digital Library

[28]

N. Beckmann and D. Sanchez, "Talus: A simple way to remove cliffs in cache performance," in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2015, pp. 64--75.

[29]

L. A. Belady, "A study of replacement algorithms for a virtual-storage computer," IBM Systems journal, vol. 5, no. 2, pp. 78--101, 1966.

Digital Library

[30]

L. A. Belady and F. P. Palermo, "On-line measurement of paging behavior by the multivalued min algorithm," IBM Journal of Research and Development, vol. 18, no. 1, pp. 2--19, 1974.

Digital Library

[31]

S. M. Blackburn, R. Garner, C. Hoffmann, A. M. Khang, K. S. McKinley, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer et al., "The dacapo benchmarks: Java benchmarking development and analysis," in Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications, 2006, pp. 169--190.

[32]

M. Butler, L. Barnes, D. D. Sarma, and B. Gelinas, "Bulldozer: An approach to multithreaded compute performance," IEEE Micro, vol. 31, no. 2, pp. 6--15, 2011.

Digital Library

[33]

D. Chen, T. Moseley, and D. X. Li, "Autofdo: Automatic feedback-directed optimization for warehouse-scale applications," in CGO, 2016.

[34]

R. Cohn and P. G. Lowney, "Hot cold optimization of large windows/nt applications," in Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29. IEEE, 1996, pp. 80--89.

[35]

T. P. P. Council, "Tpc-c," [Online; accessed 19-Nov-2021]. [Online]. Available: http://www.tpc.org/tpcc/

[36]

W. Cui, X. Ge, B. Kasikci, B. Niu, U. Sharma, R. Wang, and I. Yun, "{REPT}: Reverse debugging of failures in deployed software," in 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), 2018, pp. 17--32.

[37]

C. Ding and Y. Zhong, "Predicting whole-program locality through reuse distance analysis," in Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation, 2003, pp. 245--257.

[38]

N. Duong, D. Zhao, T. Kim, R. Cammarota, M. Valero, and A. V. Veidenbaum, "Improving cache management policies using dynamic reuse distances," in 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 2012, pp. 389--400.

[39]

W. Erquinigo, D. Carrillo-Cisneros, and A. Tang, "Reverse debugging at scale," https://engineering.fb.com/2021/04/27/developer-tools/reverse-debugging/.

[40]

B. Fagin, "Partial resolution in branch target buffers," IEEE Transactions on Computers, vol. 46, no. 10, pp. 1142--1145, 1997.

Digital Library

[41]

P. Faldu and B. Grot, "Leeway: Addressing variability in dead-block prediction for last-level caches," in 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 2017, pp. 180--193.

[42]

M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, "Clearing the clouds: a study of emerging scale-out workloads on modern hardware," Acm sigplan notices, vol. 47, no. 4, pp. 37--48, 2012.

Digital Library

[43]

M. Ferdman, C. Kaynak, and B. Falsafi, "Proactive instruction fetch," in International Symposium on Microarchitecture, 2011.

[44]

M. Ferdman, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos, "Temporal instruction fetch streaming," in International Symposium on Microarchitecture, 2008.

[45]

H. Gao and C. Wilkerson, "A dueling segmented lru replacement algorithm with adaptive bypassing," in JWAC 2010-1st JILP Worshop on Computer Architecture Competitions: Cache Replacement Championship, 2010.

[46]

N. Gober, G. Chacon, D. Jiménez, and P. V. Gratz, "The temporal ancestry prefetcher."

[47]

Google, "Propeller: Profile guided optimizing large scale llvm-based relinker," https://github.com/google/llvm-propeller, 2020.

[48]

D. A. J. P. V. Gratz and G. C. N. Gober, "Barca: Branch agnostic region searching algorithm."

[49]

B. Grayson, J. Rupley, G. Z. Zuraski, E. Quinnell, D. A. Jiménez, T. Nakra, P. Kitchin, R. Hensley, E. Brekelbaum, V. Sinha et al., "Evolution of the samsung exynos cpu microarchitecture," in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020, pp. 40--51.

[50]

P. Guide, "Intel® 64 and ia-32 architectures software developer's manual," Volume 3B: System programming Guide, Part, vol. 2, no. 11, 2011.

[51]

V. Gupta, N. S. Kalani, and B. Panda, "Run-jump-run: Bouquet of instruction pointer jumpers for high performance instruction prefetching."

[52]

S. Harizopoulos and A. Ailamaki, "Steps towards cache-resident transaction processing," in International conference on Very large data bases, 2004.

[53]

I. Harshard Sane, Principle Software Engineer, "Active benchmarking for better performance predictions," https://www.intel.com/content/dam/www/central-libraries/us/en/documents/dpm-workloads-explainer-tech-brief.pdf.

[54]

M. Hashemi, K. Swersky, J. A. Smith, G. Ayers, H. Litz, J. Chang, C. Kozyrakis, and P. Ranganathan, "Learning memory access patterns," arXiv preprint arXiv:1803.02329, 2018.

[55]

W. He, J. Mestre, S. Pupyrev, L. Wang, and H. Yu, "Profile inference revisited," Proceedings of the ACM on Programming Languages, vol. 6, no. POPL, pp. 1--24, 2022.

Digital Library

[56]

Z. Hu, S. Kaxiras, and M. Martonosi, "Timekeeping in the memory system: predicting and optimizing memory behavior," in Proceedings 29th Annual International Symposium on Computer Architecture. IEEE, 2002, pp. 209--220.

[57]

Y. Ishii, J. Lee, K. Nathella, and D. Sunwoo, "Rebasing instruction prefetching: An industry perspective," IEEE Computer Architecture Letters, 2020.

Digital Library

[58]

Y. Ishii, J. Lee, K. Nathella, and D. Sunwoo, "Re-establishing fetch-directed instruction prefetching: An industry perspective," IEEE International Symposium on Performance Analysis of Systems and Software, 2021.

[59]

Q. Jacobson, E. Rotenberg, and J. E. Smith, "Path-based next trace prediction," in Proceedings of 30th Annual International Symposium on Microarchitecture. IEEE, 1997, pp. 14--23.

[60]

A. Jain and C. Lin, "Back to the future: leveraging belady's algorithm for improvedcache replacement," in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 2016, pp. 78--89.

[61]

A. Jain and C. Lin, "Rethinking belady's algorithm to accommodate prefetching," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018, pp. 110--123.

[62]

A. Jaleel, K. B. Theobald, S. C. Steely Jr, and J. Emer, "High performance cache replacement using re-reference interval prediction (rrip)," ACM SIGARCH Computer Architecture News, vol. 38, no. 3, pp. 60--71, 2010.

Digital Library

[63]

S. Jamilan, T. A. Khan, G. Ayers, B. Kasikci, and H. Litz, "Apt-get: Profile-guided timely software prefetching," in Proceedings of the Seventeenth European Conference on Computer Systems, 2022, pp. 747--764.

Digital Library

[64]

D. A. Jiménez, "Insertion and promotion for tree-based pseudolru last-level caches," in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013, pp. 284--296.

[65]

D. A. Jiménez, S. W. Keckler, and C. Lin, "The impact of delay on the design of branch predictors," in Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, 2000, pp. 67--76.

[66]

D. A. Jiménez and E. Teran, "Multiperspective reuse prediction," in 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2017, pp. 436--448.

[67]

S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-Y. Wei, and D. Brooks, "Profiling a warehouse-scale computer," in Proceedings of the 42nd ISCA, 2015.

[68]

R. Karedla, J. S. Love, and B. G. Wherry, "Caching strategies to improve disk system performance," Computer, vol. 27, no. 3, pp. 38--46, 1994.

Digital Library

[69]

B. Kasikci, W. Cui, X. Ge, and B. Niu, "Lazy diagnosis of in-production concurrency bugs," in Proceedings of the 26th Symposium on Operating Systems Principles, 2017, pp. 582--598.

[70]

B. Kasikci, C. Pereira, G. Pokam, B. Schubert, M. Musuvathi, and G. Candea, "Failure sketches: A better way to debug," ser. Hot Topics in Operating Systems, 2015, p. 5.

[71]

B. Kasikci, B. Schubert, C. Pereira, G. Pokam, and G. Candea, "Failure sketching: A technique for automated root cause diagnosis of in-production failures," in Proceedings of the 25th Symposium on Operating Systems Principles, 2015, p. 344--360.

Digital Library

[72]

C. Kaynak, B. Grot, and B. Falsafi, "Shift: Shared history instruction fetch for lean-core server processors," in International Symposium on Microarchitecture, 2013.

[73]

C. Kaynak, B. Grot, and B. Falsafi, "Confluence: unified instruction supply for scale-out servers," in Proceedings of the 48th International Symposium on Microarchitecture, 2015, pp. 166--177.

[74]

S. M. Khan, Y. Tian, and D. A. Jimenez, "Sampling dead block prediction for last-level caches," in 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 2010, pp. 175--186.

[75]

T. A. Khan, N. Brown, A. Sriraman, N. K. Soundararajan, R. Kumar, J. Devietti, S. Subramoney, G. A. Pokam, H. Litz, and B. Kasikci, "Twig: Profile-guided btb prefetching for data center applications," in MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021, pp. 816--829.

[76]

T. A. Khan, I. Neal, G. Pokam, B. Mozafari, and B. Kasikci, "Dmon: Efficient detection and correction of data locality problems using selective profiling," in 15th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 21), 2021, pp. 163--181.

[77]

T. A. Khan, A. Sriraman, J. Devietti, G. Pokam, H. Litz, and B. Kasikci, "I-spy: Context-driven conditional instruction prefetching with coalescing," in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 146--159.

[78]

T. A. Khan, D. Zhang, A. Sriraman, J. Devietti, G. Pokam, H. Litz, and B. Kasikci, "Ripple: Profile-guided instruction cache replacement for data center applications," in Proceedings (to appear) of the 48th International Symposium on Computer Architecture (ISCA), ser. ISCA 2021, Jun. 2021.

[79]

T. A. Khan, Y. Zhao, G. Pokam, B. Mozafari, and B. Kasikci, "Huron: hybrid false sharing detection and repair," in Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2019, pp. 453--468.

[80]

M. Kharbutli and Y. Solihin, "Counter-based cache replacement algorithms," in 2005 International Conference on Computer Design. IEEE, 2005, pp. 61--68.

[81]

R. Kobayashi, Y. Yamada, H. Ando, and T. Shimada, "A cost-effective branch target buffer with a two-level table organization," in Proceedings of the 2nd International Symposium of Low-Power and High-Speed Chips (COOL Chips II), 1999.

[82]

A. Kolli, A. Saidi, and T. F. Wenisch, "Rdip: return-address-stack directed instruction prefetching," in 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2013, pp. 260--271.

[83]

R. Kumar, B. Grot, and V. Nagarajan, "Blasting through the front-end bottleneck with shotgun," ACM SIGPLAN Notices, vol. 53, no. 2, pp. 30--42, 2018.

Digital Library

[84]

R. Kumar, C.-C. Huang, B. Grot, and V. Nagarajan, "Boomerang: A metadata-free architecture for control flow delivery," in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2017, pp. 493--504.

[85]

C. Lattner and V. Adve, "Llvm: A compilation framework for lifelong program analysis & transformation," in International Symposium on Code Generation and Optimization, 2004. CGO 2004. IEEE, 2004, pp. 75--86.

Digital Library

[86]

R. Lavaee, J. Criswell, and C. Ding, "Codestitcher: inter-procedural basic block layout optimization," in Proceedings of the 28th International Conference on Compiler Construction, 2019, pp. 65--75.

[87]

Lee and Smith, "Branch prediction strategies and branch target buffer design," Computer, vol. 17, no. 1, pp. 6--22, 1984.

Digital Library

[88]

D. Lee, J. Choi, J.-H. Kim, S. H. Noh, S. L. Min, Y. Cho, and C. S. Kim, "On the existence of a spectrum of policies that subsumes the least recently used (lru) and least frequently used (lfu) policies," in Proceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, 1999, pp. 134--143.

[89]

D. X. Li, R. Ashok, and R. Hundt, "Lightweight feedback-directed cross-module optimization," in Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, 2010, pp. 53--61.

[90]

L. Li, D. Tong, Z. Xie, J. Lu, and X. Cheng, "Optimal bypass monitor for high performance last-level caches," in Proceedings of the 21st international conference on Parallel architectures and compilation techniques, 2012, pp. 315--324.

[91]

H. Litz, G. Ayers, and P. Ranganathan, "CRISP: critical slice prefetching," in ASPLOS '22: 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 28 February 2022 - 4 March 2022, B. Falsafi, M. Ferdman, S. Lu, and T. F. Wenisch, Eds. ACM, 2022, pp. 300--313. [Online].

Digital Library

[92]

E. Z. Liu, M. Hashemi, K. Swersky, P. Ranganathan, and J. Ahn, "An imitation learning approach for cache replacement," arXiv preprint arXiv:2006.16239, 2020.

[93]

H. Liu, M. Ferdman, J. Huh, and D. Burger, "Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency," in 2008 41st IEEE/ACM International Symposium on Microarchitecture. IEEE, 2008, pp. 222--233.

[94]

C.-K. Luk, R. Muth, H. Patil, R. Cohn, and G. Lowney, "Ispike: a post-link optimizer for the intel/spl reg/itanium/spl reg/architecture," in International Symposium on Code Generation and Optimization, 2004. CGO 2004. IEEE, 2004, pp. 15--26.

[95]

C.-K. Luk and T. C. Mowry, "Cooperative prefetching: Compiler and hardware support for effective instruction prefetching in modern processors," in International Symposium on Microarchitecture, 1998.

[96]

R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger, "Evaluation techniques for storage hierarchies," IBM Systems journal, vol. 9, no. 2, pp. 78--117, 1970.

Digital Library

[97]

C. Mazumdar, P. Mitra, and A. Basu, "Dead page and dead block predictors: Cleaning tlbs and caches together," in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021, pp. 507--519.

[98]

P. Michaud, "Some mathematical facts about optimal cache replacement," ACM Transactions on Architecture and Code Optimization (TACO), vol. 13, no. 4, pp. 1--19, 2016.

Digital Library

[99]

P. Michaud, "Pips: Prefetching instructions with probabilistic scouts," in The 1st Instruction Prefetching Championship, 2020.

[100]

A. A. Moreira, G. Ottoni, and F. M. Quintão Pereira, "Vespa: static profiling for binary optimization," Proceedings of the ACM on Programming Languages, vol. 5, no. OOPSLA, pp. 1--28, 2021.

Digital Library

[101]

T. Nakamura, T. Koizumi, Y. Degawa, H. Irie, S. Sakai, and R. Shioya, "D-jolt: Distant jolt prefetcher."

[102]

E. J. O'neil, P. E. O'neil, and G. Weikum, "The lru-k page replacement algorithm for database disk buffering," Acm Sigmod Record, vol. 22, no. 2, pp. 297--306, 1993.

Digital Library

[103]

G. Ottoni, "Hhvm jit: A profile-guided, region-based compiler for php and hack," in Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2018, pp. 151--165.

[104]

G. Ottoni and B. Liu, "Hhvm jump-start: Boosting both warmup and steady-state performance at scale," in 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, pp. 340--350.

[105]

G. Ottoni and B. Maher, "Optimizing function placement for large-scale data-center applications," in 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2017, pp. 233--244.

[106]

M. Panchenko, R. Auler, B. Nell, and G. Ottoni, "Bolt: a practical binary optimizer for data centers and beyond," in 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2019, pp. 2--14.

[107]

M. Panchenko, R. Auler, L. Sakka, and G. Ottoni, "Lightning bolt: powerful, fast, and scalable binary optimization," in Proceedings of the 30th ACM SIGPLAN International Conference on Compiler Construction, 2021, pp. 119--130.

[108]

R. Panda, P. V. Gratz, and D. A. Jiménez, "B-fetch: Branch prediction directed prefetching for in-order processors," IEEE Computer Architecture Letters, vol. 11, no. 2, pp. 41--44, 2011.

Digital Library

[109]

A. Pellegrini, N. Stephens, M. Bruce, Y. Ishii, J. Pusdesris, A. Raja, C. Abernathy, J. Koppanalil, T. Ringe, A. Tummala et al., "The arm neoverse n1 platform: Building blocks for the next-gen cloud-to-edge infrastructure soc," IEEE Micro, vol. 40, no. 2, pp. 53--62, 2020.

[110]

C. H. Perleberg and A. J. Smith, "Branch target buffer design and optimization," IEEE transactions on computers, vol. 42, no. 4, pp. 396--412, 1993.

[111]

L. L. Peterson, "Architectural and compiler support for effective instruction prefetching: a cooperative approach," ACM Transactions on Computer Systems, 2001.

[112]

E. Petrank and D. Rawitz, "The hardness of cache conscious data placement," in POPL, 2002.

[113]

K. Pettis and R. C. Hansen, "Profile guided code positioning," in Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation, 1990, pp. 16--27.

[114]

A. Prokopec, A. Rosà, D. Leopoldseder, G. Duboscq, P. Tůma, M. Studener, L. Bulej, Y. Zheng, A. Villazón, D. Simon, T. Würthinger, and W. Binder, "Renaissance: Benchmarking suite for parallel applications on the jvm," in Programming Language Design and Implementation, 2019.

[115]

M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer, "Adaptive insertion policies for high performance caching," ACM SIGARCH Computer Architecture News, vol. 35, no. 2, pp. 381--391, 2007.

Digital Library

[116]

M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt, "A case for mlp-aware cache replacement," in 33rd International Symposium on Computer Architecture (ISCA'06). IEEE, 2006, pp. 167--178.

[117]

A. Ramirez, L. A. Barroso, K. Gharachorloo, R. Cohn, J. Larriba-Pey, P. G. Lowney, and M. Valero, "Code layout optimizations for transaction processing workloads," ACM SIGARCH Computer Architecture News, 2001.

Digital Library

[118]

G. Reinman, T. Austin, and B. Calder, "A scalable front-end architecture for fast instruction delivery," ACM SIGARCH Computer Architecture News, vol. 27, no. 2, pp. 234--245, 1999.

Digital Library

[119]

G. Reinman, B. Calder, and T. Austin, "Fetch directed instruction prefetching," in MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture. IEEE, 1999, pp. 16--27.

Digital Library

[120]

A. Ros and A. Jimborean, "The entangling instruction prefetcher," IEEE Computer Architecture Letters, vol. 19, no. 2, pp. 84--87, 2020.

[121]

A. Ros and A. Jimborean, "A cost-effective entangling prefetcher for instructions," in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 99--111.

[122]

E. Rotenberg, S. Bennett, and J. E. Smith, "Trace cache: a low latency approach to high bandwidth instruction fetching," in Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29. IEEE, 1996, pp. 24--34.

[123]

J. Rupley, "Samsung exynos m3 processor," IEEE Hot Chips, vol. 30, 2018.

[124]

D. Seal, ARM architecture reference manual. Pearson Education, 2001.

Digital Library

[125]

V. Seshadri, O. Mutlu, M. A. Kozuch, and T. C. Mowry, "The evicted-address filter: A unified mechanism to address both cache pollution and thrashing," in 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 2012, pp. 355--366.

[126]

A. Seznec, "Tage-sc-l branch predictors," in JILP-Championship Branch Prediction, 2014.

[127]

A. Seznec, "The fnl+ mma instruction cache prefetcher," in IPC-1-First Instruction Prefetching Championship, 2020.

[128]

S. Seznec, "Don't use the page number, but a pointer to it," in 23rd Annual International Symposium on Computer Architecture (ISCA'96). IEEE, 1996, pp. 104--104.

[129]

Z. Shi, X. Huang, A. Jain, and C. Lin, "Applying deep learning to the cache replacement problem," in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 413--425.

[130]

Y. Smaragdakis, S. Kaplan, and P. Wilson, "Eelru: simple and effective adaptive page replacement," ACM SIGMETRICS Performance Evaluation Review, vol. 27, no. 1, pp. 122--133, 1999.

Digital Library

[131]

A. J. Smith, "Sequential program prefetching in memory hierarchies," Computer, no. 12, pp. 7--21, 1978.

Digital Library

[132]

N. K. Soundararajan, P. Braun, T. A. Khan, B. Kasikci, H. Litz, and S. Subramoney, "Pdede: Partitioned, deduplicated, delta branch target buffer," in MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021, pp. 779--791.

[133]

A. Sriraman, A. Dhanotia, and T. F. Wenisch, "Softsku: Optimizing server architectures for microservice diversity@ scale," in Proceedings of the 46th International Symposium on Computer Architecture, 2019, pp. 513--526.

[134]

R. Subramanian, Y. Smaragdakis, and G. H. Loh, "Adaptive caches: Effective shaping of cache behavior to workloads," in 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06). IEEE, 2006, pp. 385--396.

[135]

D. Suggs, M. Subramony, and D. Bouvier, "The amd "zen 2" processor," IEEE Micro, vol. 40, no. 2, pp. 45--52, 2020.

[136]

M. Takagi and K. Hiraki, "Inter-reference gap distribution replacement: an improved replacement algorithm for set-associative caches," in Proceedings of the 18th annual international conference on Supercomputing, 2004, pp. 20--30.

[137]

E. Teran, Z. Wang, and D. A. Jiménez, "Perceptron learning for reuse prediction," in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2016, pp. 1--12.

[138]

G. Vavouliotis, L. Alvarez, B. Grot, D. Jiménez, and M. Casas, "Morrigan: A composite instruction tlb prefetcher," in MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021, pp. 1138--1153.

[139]

T. F. Wenisch, M. Ferdman, A. Ailamaki, B. Falsafi, and A. Moshovos, "Temporal streams in commercial server applications," in 2008 IEEE International Symposium on Workload Characterization. IEEE, 2008, pp. 99--108.

[140]

T. F. Wenisch, M. Ferdman, A. Ailamaki, B. Falsafi, and A. Moshovos, "Practical off-chip meta-data for temporal memory streaming," in 2009 IEEE 15th International Symposium on High Performance Computer Architecture. IEEE, 2009, pp. 79--90.

[141]

T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi, "Temporal streaming of shared memory," in 32nd International Symposium on Computer Architecture (ISCA'05). IEEE, 2005, pp. 222--233.

[142]

Wikipedia contributors, "Drupal --- Wikipedia, the free encyclopedia," https://en.wikipedia.org/w/index.php?title=Drupal&oldid=989582664, 2020, [Online; accessed 23-November-2020].

[143]

Wikipedia contributors, "Mediawiki --- Wikipedia, the free encyclopedia," https://en.wikipedia.org/w/index.php?title=MediaWiki&oldid=989993176, 2020, [Online; accessed 23-November-2020].

[144]

Wikipedia contributors, "Wordpress --- Wikipedia, the free encyclopedia," https://en.wikipedia.org/w/index.php?title=WordPress&oldid=977243718, 2020, [Online; accessed 23-November-2020].

[145]

Wikipedia contributors, "Cross-validation (statistics) --- Wikipedia, the free encyclopedia," https://en.wikipedia.org/w/index.php?title=Cross-validation_(statistics)&oldid=1055904460, 2021, [Online; accessed 24-November-2021].

[146]

Wikipedia contributors, "Mysql --- Wikipedia, the free encyclopedia," https://en.wikipedia.org/w/index.php?title=MySQL&oldid=1054628857, 2021, [Online; accessed 19-November-2021].

[147]

C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely Jr, and J. Emer, "Ship: Signature-based hit predictor for high performance caching," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, 2011, pp. 430--441.

[148]

T.-Y. Yeh and Y. N. Patt, "A comprehensive instruction fetch mechanism for a processor supporting speculative execution," ACM SIGMICRO Newsletter, vol. 23, no. 1--2, pp. 129--139, 1992.

[149]

J. Zhou and K. A. Ross, "Buffering databse operations for enhanced instruction cache performance," in International conference on Management of data, 2004.

[150]

Y. Zhou, X. Dong, A. L. Cox, and S. Dwarkadas, "On the impact of instruction address translation overhead," in 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2019, pp. 106--116.

[151]

G. Zuo, J. Ma, A. Quinn, P. Bhatotia, P. Fonseca, and B. Kasikci, "Execution reconstruction: Harnessing failure reoccurrences for failure reproduction," in Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021, p. 1155--1170.

Cited By

Frenot LPereira FRodríguez GSadayappan PSukumaran-Rajam A(2024)Reducing the Overhead of Exact Profiling by Reusing Affine VariablesProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641569(150-161)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641569
Brunner RKumar R(2024)Weeding out Front-End Stalls with Uneven Block Size Instruction Cache2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00102(1382-1396)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00102
Liu YLi XZhang TLiu TGuo QZhang FWang J(2024)AVM-BTB: Adaptive and Virtualized Multi-level Branch Target Buffer2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00012(17-31)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00012
Show More Cited By

Index Terms

Thermometer: profile-guided btb replacement for data center applications
1. Computer systems organization
  1. Architectures
    1. Serial architectures
      1. Pipeline computing

Recommendations

Twig: Profile-Guided BTB Prefetching for Data Center Applications
MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

Modern data center applications have deep software stacks, with instruction footprints that are orders of magnitude larger than typical instruction cache (I-cache) sizes. To efficiently prefetch instructions into the I-cache despite large application ...
Tango: a hardware-based data prefetching technique for superscalar processors
MICRO 29: Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture

We present a new hardware-based data prefetching mechanism for enhancing instruction level parallelism and improving the performance of superscalar processors. The emphasis in our scheme is on the effective utilization of slack time and hardware ...
Counter-Based Cache Replacement and Bypassing Algorithms

Recent studies have shown that in highly associative caches, the performance gap between the Least Recently Used (LRU) and the theoretical optimal replacement algorithms is large, motivating the design of alternative replacement algorithms to improve ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '22: Proceedings of the 49th Annual International Symposium on Computer Architecture

June 2022

1097 pages

ISBN:9781450386104

DOI:10.1145/3470496

General Chairs:
Valentina Salapura
Google
,
Mohamed Zahran
New York University
,
Program Chairs:
Fred Chong
The University of Chicago
,
Lingjia Tang
The University of Michigan

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

IEEE CS TCAA: IEEE CS technical committee on architectural acoustics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

DARPA
Intel Labs
Applications Driving Architectures (ADA) Research Center
SRC
NSF

Conference

ISCA '22

Sponsor:

SIGARCH

ISCA '22: The 49th Annual International Symposium on Computer Architecture

June 18 - 22, 2022

New York, New York

Acceptance Rates

ISCA '22 Paper Acceptance Rate 67 of 400 submissions, 17%;

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
1,485
Total Downloads

Downloads (Last 12 months)579
Downloads (Last 6 weeks)54

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Frenot LPereira FRodríguez GSadayappan PSukumaran-Rajam A(2024)Reducing the Overhead of Exact Profiling by Reusing Affine VariablesProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641569(150-161)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641569
Brunner RKumar R(2024)Weeding out Front-End Stalls with Uneven Block Size Instruction Cache2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00102(1382-1396)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00102
Liu YLi XZhang TLiu TGuo QZhang FWang J(2024)AVM-BTB: Adaptive and Virtualized Multi-level Branch Target Buffer2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00012(17-31)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00012
Perais ASheikh R(2023)Branch Target Buffer OrganizationsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623774(240-253)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3623774
Guo ZHe ZZhang YDruschel PKaufmann AMace JFlinn JSeltzer M(2023)Mira: A Program-Behavior-Guided Far Memory SystemProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613157(692-708)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3600006.3613157
Stojkovic JLiu CShahbaz MTorrellas JSolihin YHeinrich M(2023)μManycore: A Cloud-Native CPU for Tail at ScaleProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589068(1-15)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589068
Feliu JPerais AJiménez DRos A(2023)Rebasing Microarchitectural Research with Industry Traces2023 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC59245.2023.00027(100-114)Online publication date: 1-Oct-2023
https://doi.org/10.1109/IISWC59245.2023.00027
Lin WQin JChen YJin ZXu JZhang YCai SFu LChen YChen W(2023)JACO: JAva Code Layout Optimizer Enabling Continuous Optimization without Pausing Application Services2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00032(295-306)Online publication date: 31-Oct-2023
https://doi.org/10.1109/CLUSTER52292.2023.00032
Ghahani SKhadirsharbiyani SKotra JKandemir MKloeckner AMoreira J(2022)AthenaProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569684(359-371)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569684
Jamilan SKhan TAyers GKasikci BLitz HBromberg YKermarrec AKozyrakis C(2022)APT-GETProceedings of the Seventeenth European Conference on Computer Systems10.1145/3492321.3519583(747-764)Online publication date: 28-Mar-2022
https://dl.acm.org/doi/10.1145/3492321.3519583

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten