ABSTRACT
The Memory Management Unit (MMU) in modern processors now includes a Translation Lookaside Buffer (TLB) that caches recently-used Page-Table Entries (PTEs), and prevents carrying out redundant page-table walks during the address translation process. Here, the amount of memory that the TLB can readily translate is commonly known as its reach. Yet, the TLB size, and thus its reach, is limited because the TLB must render a low operation latency due to its place on the critical path to access the cache memory.
While extensive research work has been devoted into lessening TLB pressure, it has been generally presumed that the TLB reach is strictly defined by the number of TLB entries, as if it were a fully-associative cache structure. However, in this work we demonstrate that the amount of TLB entries only outlines a theoretical upper bound on the TLB reach, and we reveal how the TLB's true indexing circuitry can reduce the actual TLB reach by 256-KB in some Intel processors when compared to its PTE storing capacity.
Moreover, recent security-related work has proven how adversaries can implement PTE-based cache side-channel attacks by repeatedly forcing the MMU to perform spurious page-table walks, which can be accomplished by passing the TLB reach over and over. In Intel's Skylake for example, the TLB can host up to 1600 PTEs, giving it a reach of 6.25-MB when using 4-KB pages. Yet, we propose a target-relative TLB eviction strategy that only loads 84 handpicked PTEs into the TLB to evict a target PTE, thus letting an adversary artificially diminish the TLB reach to only 344-KB.
- [n. d.]. https://www.7-cpu.com/cpu/Skylake.htmlGoogle Scholar
- Thomas W Barr, Alan L Cox, and Scott Rixner. 2011. SpecTLB: a mechanism for speculative address translation. ACM SIGARCH Computer Architecture News 39, 3 (2011), 307--318.Google ScholarDigital Library
- Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D Hill, and Michael M Swift. 2013. Efficient virtual memory for big memory servers. ACM SIGARCH Computer Architecture News 41, 3 (2013), 237--248.Google ScholarDigital Library
- Abhishek Bhattacharjee. 2013. Large-reach memory management unit caches. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. 383--394.Google ScholarDigital Library
- Abhishek Bhattacharjee, Daniel Lustig, and Margaret Martonosi. 2011. Shared last-level TLBs for chip multiprocessors. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture. IEEE, 62--63.Google ScholarCross Ref
- Abhishek Bhattacharjee and Margaret Martonosi. 2010. Inter-core cooperative TLB for chip multiprocessors. ACM Sigplan Notices 45, 3 (2010), 359--370.Google ScholarDigital Library
- J Bradley Chen, Anita Borg, and Norman P Jouppi. 1992. A simulation based study of TLB performance. In Proceedings of the 19th Annual International Symposium on Computer Architecture. 114--123.Google ScholarDigital Library
- Intel®Corporation. 2019. Intel®64 and IA-32 architectures software developer's manual combined volumes 2A, 2B, 2C, and 2D: Instruction set reference, A-Z. https://software.intel.com/en-us/articles/intel-sdm.Google Scholar
- Guilherme Cox and Abhishek Bhattacharjee. 2017. Efficient address translation for architectures with multiple page sizes. ACM SIGPLAN Notices 52, 4 (2017), 435--448.Google ScholarDigital Library
- Craig Disselkoen, David Kohlbrenner, Leo Porter, and Dean Tullsen. 2017. Prime+ abort: A timer-free high-precision l3 cache attack using intel {TSX}. In 26th {USENIX} Security Symposium ({USENIX} Security 17). 51--67.Google Scholar
- Jayneel Gandhi, Arkaprava Basu, Mark D Hill, and Michael M Swift. 2014. BadgerTrap: A tool to instrument x86-64 TLB misses. ACM SIGARCH Computer Architecture News 42, 2 (2014), 20--23.Google ScholarDigital Library
- Ben Gras, Kaveh Razavi, Herbert Bos, and Cristiano Giuffrida. 2018. Translation leak-aside buffer: Defeating cache side-channel protections with {TLB} attacks. In 27th {USENIX} Security Symposium ({USENIX} Security 18). 955--972.Google Scholar
- Ben Gras, Kaveh Razavi, Erik Bosman, Herbert Bos, and Cristiano Giuffrida. 2017. ASLR on the Line: Practical Cache Attacks on the MMU.. In NDSS, Vol. 17. 26.Google Scholar
- Intel. 2021. Performance Monitoring Events. https://perfmon-events.intel.com/Google Scholar
- Vasileios Karakostas, Jayneel Gandhi, Furkan Ayar, Adrián Cristal, Mark D Hill, Kathryn S McKinley, Mario Nemirovsky, Michael M Swift, and Osman Ünsal. 2015. Redundant memory mappings for fast access to large memories. ACM SIGARCH Computer Architecture News 43, 3S (2015), 66--78.Google ScholarDigital Library
- Vasileios Karakostas, Osman S Unsal, Mario Nemirovsky, Adrian Cristal, and Michael Swift. 2014. Performance analysis of the memory management unit under scale-out workloads. In 2014 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 1--12.Google ScholarCross Ref
- Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J Rossbach, and Emmett Witchel. 2016. Coordinated and efficient huge page management with ingens. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16). 705--721.Google Scholar
- Martin Maas, David G Andersen, Michael Isard, Mohammad Mahdi Javanmard, Kathryn S McKinley, and Colin Raffel. 2020. Learning-based memory allocation for C++ server workloads. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 541--556.Google ScholarDigital Library
- Juan Navarro, Sitararn Iyer, Peter Druschel, and Alan Cox. 2002. Practical, transparent operating system support for superpages. ACM SIGOPS Operating Systems Review 36, SI (2002), 89--104.Google Scholar
- Dag Arne Osvik, Adi Shamir, and Eran Tromer. 2006. Cache attacks and countermeasures: the case of AES. In CryptographersâĂŹ track at the RSA conference. Springer, 1--20.Google Scholar
- Binh Pham, Abhishek Bhattacharjee, Yasuko Eckert, and Gabriel H Loh. 2014. Increasing TLB reach by exploiting clustering in page translations. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 558--567.Google ScholarCross Ref
- Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhattacharjee. 2012. Colt: Coalesced large-reach tlbs. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 258--269.Google ScholarDigital Library
- Madhusudhan Talluri and Mark D Hill. 1994. Surpassing the TLB performance of superpages with less operating system support. ACM SIGPLAN Notices 29, 11 (1994), 171--182.Google ScholarDigital Library
- Stephan Van Schaik, Cristiano Giuffrida, Herbert Bos, and Kaveh Razavi. 2018. Malicious management unit: Why stopping cache attacks in software is harder than you think. In 27th {USENIX} Security Symposium ({USENIX} Security 18). 937--954.Google Scholar
- Stephan Van Schaik, Kaveh Razavi, Ben Gras, Herbert Bos, and Cristiano Giuffrida. 2017. RevAnC: A framework for reverse engineering hardware page table caches. In Proceedings of the 10th European Workshop on Systems Security. 1--6.Google ScholarDigital Library
Index Terms
- Effective TLB thrashing: unveiling the true short reach of modern TLB designs
Recommendations
DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems
Conventional on-chip TLB hierarchies are unable to fully cover the growing application working-set sizes. To make things worse, Last-Level TLB (LLT) misses require multiple accesses to the page table even with the use of page walk caches. Consequently, ...
Adopting TLB index-based tagging to data caches for tag energy reduction
ISLPED '12: Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and designConventional cache tag matching is based on addresses to identify requested data. However, this address-based tagging scheme is not efficient because unnecessarily many tag bits are used. Previous studies show that TLB index-based tagging (TLBIT) can be ...
Location cache: a low-power L2 cache system
ISLPED '04: Proceedings of the 2004 international symposium on Low power electronics and designWhile set-associative caches incur fewer misses than direct-mapped caches, they typically have slower hit times and higher power consumption, when multiple tag and data banks are probed in parallel. This paper presents the location cache structure which ...
Comments