research-article

Effective TLB thrashing: unveiling the true short reach of modern TLB designs

Authors:
Andrés R. Hernández C.

The University of Texas at San Antonio

The University of Texas at San Antonio
View Profile

,
Wei-Ming Lin

The University of Texas at San Antonio

The University of Texas at San Antonio
View Profile

SAC '22: Proceedings of the 37th ACM/SIGAPP Symposium on Applied ComputingApril 2022Pages 1704–1712https://doi.org/10.1145/3477314.3507110

Published:06 May 2022Publication History

SAC '22: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing

Pages 1704–1712

ABSTRACT

The Memory Management Unit (MMU) in modern processors now includes a Translation Lookaside Buffer (TLB) that caches recently-used Page-Table Entries (PTEs), and prevents carrying out redundant page-table walks during the address translation process. Here, the amount of memory that the TLB can readily translate is commonly known as its reach. Yet, the TLB size, and thus its reach, is limited because the TLB must render a low operation latency due to its place on the critical path to access the cache memory.

While extensive research work has been devoted into lessening TLB pressure, it has been generally presumed that the TLB reach is strictly defined by the number of TLB entries, as if it were a fully-associative cache structure. However, in this work we demonstrate that the amount of TLB entries only outlines a theoretical upper bound on the TLB reach, and we reveal how the TLB's true indexing circuitry can reduce the actual TLB reach by 256-KB in some Intel processors when compared to its PTE storing capacity.

Moreover, recent security-related work has proven how adversaries can implement PTE-based cache side-channel attacks by repeatedly forcing the MMU to perform spurious page-table walks, which can be accomplished by passing the TLB reach over and over. In Intel's Skylake for example, the TLB can host up to 1600 PTEs, giving it a reach of 6.25-MB when using 4-KB pages. Yet, we propose a target-relative TLB eviction strategy that only loads 84 handpicked PTEs into the TLB to evict a target PTE, thus letting an adversary artificially diminish the TLB reach to only 344-KB.

References

[n. d.]. https://www.7-cpu.com/cpu/Skylake.htmlGoogle Scholar
Thomas W Barr, Alan L Cox, and Scott Rixner. 2011. SpecTLB: a mechanism for speculative address translation. ACM SIGARCH Computer Architecture News 39, 3 (2011), 307--318.Google ScholarDigital Library
Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D Hill, and Michael M Swift. 2013. Efficient virtual memory for big memory servers. ACM SIGARCH Computer Architecture News 41, 3 (2013), 237--248.Google ScholarDigital Library
Abhishek Bhattacharjee. 2013. Large-reach memory management unit caches. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. 383--394.Google ScholarDigital Library
Abhishek Bhattacharjee, Daniel Lustig, and Margaret Martonosi. 2011. Shared last-level TLBs for chip multiprocessors. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture. IEEE, 62--63.Google ScholarCross Ref
Abhishek Bhattacharjee and Margaret Martonosi. 2010. Inter-core cooperative TLB for chip multiprocessors. ACM Sigplan Notices 45, 3 (2010), 359--370.Google ScholarDigital Library
J Bradley Chen, Anita Borg, and Norman P Jouppi. 1992. A simulation based study of TLB performance. In Proceedings of the 19th Annual International Symposium on Computer Architecture. 114--123.Google ScholarDigital Library
Intel®Corporation. 2019. Intel®64 and IA-32 architectures software developer's manual combined volumes 2A, 2B, 2C, and 2D: Instruction set reference, A-Z. https://software.intel.com/en-us/articles/intel-sdm.Google Scholar
Guilherme Cox and Abhishek Bhattacharjee. 2017. Efficient address translation for architectures with multiple page sizes. ACM SIGPLAN Notices 52, 4 (2017), 435--448.Google ScholarDigital Library
Craig Disselkoen, David Kohlbrenner, Leo Porter, and Dean Tullsen. 2017. Prime+ abort: A timer-free high-precision l3 cache attack using intel {TSX}. In 26th {USENIX} Security Symposium ({USENIX} Security 17). 51--67.Google Scholar
Jayneel Gandhi, Arkaprava Basu, Mark D Hill, and Michael M Swift. 2014. BadgerTrap: A tool to instrument x86-64 TLB misses. ACM SIGARCH Computer Architecture News 42, 2 (2014), 20--23.Google ScholarDigital Library
Ben Gras, Kaveh Razavi, Herbert Bos, and Cristiano Giuffrida. 2018. Translation leak-aside buffer: Defeating cache side-channel protections with {TLB} attacks. In 27th {USENIX} Security Symposium ({USENIX} Security 18). 955--972.Google Scholar
Ben Gras, Kaveh Razavi, Erik Bosman, Herbert Bos, and Cristiano Giuffrida. 2017. ASLR on the Line: Practical Cache Attacks on the MMU.. In NDSS, Vol. 17. 26.Google Scholar
Intel. 2021. Performance Monitoring Events. https://perfmon-events.intel.com/Google Scholar
Vasileios Karakostas, Jayneel Gandhi, Furkan Ayar, Adrián Cristal, Mark D Hill, Kathryn S McKinley, Mario Nemirovsky, Michael M Swift, and Osman Ünsal. 2015. Redundant memory mappings for fast access to large memories. ACM SIGARCH Computer Architecture News 43, 3S (2015), 66--78.Google ScholarDigital Library
Vasileios Karakostas, Osman S Unsal, Mario Nemirovsky, Adrian Cristal, and Michael Swift. 2014. Performance analysis of the memory management unit under scale-out workloads. In 2014 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 1--12.Google ScholarCross Ref
Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J Rossbach, and Emmett Witchel. 2016. Coordinated and efficient huge page management with ingens. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16). 705--721.Google Scholar
Martin Maas, David G Andersen, Michael Isard, Mohammad Mahdi Javanmard, Kathryn S McKinley, and Colin Raffel. 2020. Learning-based memory allocation for C++ server workloads. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 541--556.Google ScholarDigital Library
Juan Navarro, Sitararn Iyer, Peter Druschel, and Alan Cox. 2002. Practical, transparent operating system support for superpages. ACM SIGOPS Operating Systems Review 36, SI (2002), 89--104.Google Scholar
Dag Arne Osvik, Adi Shamir, and Eran Tromer. 2006. Cache attacks and countermeasures: the case of AES. In CryptographersâĂ&Zacute; track at the RSA conference. Springer, 1--20.Google Scholar
Binh Pham, Abhishek Bhattacharjee, Yasuko Eckert, and Gabriel H Loh. 2014. Increasing TLB reach by exploiting clustering in page translations. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 558--567.Google ScholarCross Ref
Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhattacharjee. 2012. Colt: Coalesced large-reach tlbs. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 258--269.Google ScholarDigital Library
Madhusudhan Talluri and Mark D Hill. 1994. Surpassing the TLB performance of superpages with less operating system support. ACM SIGPLAN Notices 29, 11 (1994), 171--182.Google ScholarDigital Library
Stephan Van Schaik, Cristiano Giuffrida, Herbert Bos, and Kaveh Razavi. 2018. Malicious management unit: Why stopping cache attacks in software is harder than you think. In 27th {USENIX} Security Symposium ({USENIX} Security 18). 937--954.Google Scholar
Stephan Van Schaik, Kaveh Razavi, Ben Gras, Herbert Bos, and Cristiano Giuffrida. 2017. RevAnC: A framework for reverse engineering hardware page table caches. In Proceedings of the 10th European Workshop on Systems Security. 1--6.Google ScholarDigital Library

Index Terms

Effective TLB thrashing: unveiling the true short reach of modern TLB designs
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Processors and memory architectures
2. Security and privacy
  1. Security in hardware
    1. Hardware reverse engineering

Recommendations

DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems

Conventional on-chip TLB hierarchies are unable to fully cover the growing application working-set sizes. To make things worse, Last-Level TLB (LLT) misses require multiple accesses to the page table even with the use of page walk caches. Consequently, ...
Read More
Adopting TLB index-based tagging to data caches for tag energy reduction
ISLPED '12: Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design

Conventional cache tag matching is based on addresses to identify requested data. However, this address-based tagging scheme is not efficient because unnecessarily many tag bits are used. Previous studies show that TLB index-based tagging (TLBIT) can be ...
Read More
Location cache: a low-power L2 cache system
ISLPED '04: Proceedings of the 2004 international symposium on Low power electronics and design

While set-associative caches incur fewer misses than direct-mapped caches, they typically have slower hit times and higher power consumption, when multiple tag and data banks are probed in parallel. This paper presents the location cache structure which ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAC '22: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing
April 2022
2099 pages
ISBN:9781450387132
DOI:10.1145/3477314
Conference Chairs:
Jiman Hong
Soongsil University
,
Miroslav Bures
Czech Technical University, Czechia
,
Program Chairs:
Juw Won Park
University of Louisville
,
Tomas Cerny
Baylor University
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 May 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
MMU
TLB
memory paging
reverse engineering
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,650of6,669submissions,25%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 104
  Total Downloads
- Downloads (Last 12 months)48
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Effective TLB thrashing: unveiling the true short reach of modern TLB designs

SAC '22: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems

Adopting TLB index-based tagging to data caches for tag energy reduction

Location cache: a low-power L2 cache system